Why we chose OpenStack and SDN?
When we first embarked on building a new private cloud in 2015, we needed technology which would last at least 5 years so we chose to use the Red Hat distribution of OpenStack and software defined networking with Nuage Networks.The reference architecture for our Paddy Power Betfair OpenStack implementation can be found here, which was created to advise anyone interested in embarking on a similar private cloud journey, as we believed we would have benefited from a similar document at the start of this project and wished to share our experiences:
https://betsandbits.com/2016/10/25/openstack-reference-architecture/
OpenStack was chosen as we wanted an API middle-ware that we could connect networking, compute and storage to and control programmatically. This would make the private cloud solution vendor agnostic, so at a later date if we wanted to introduce a new compute, storage or network vendor we could easily substitute them in without having to write new automation orchestration each time.
It also gave us the option that we could even switch out our OpenStack vendor at a later date if we desired or use the upstream OpenStack distribution once we had built the necessary in house skills. So as long as there is a Nova (Compute), Neutron (Networking) or Cinder (Block Storage) vendor driver for the compute, networking or storage then it can easily be swapped in and used in our private cloud.
Using OpenStack as our automation middle-ware meant that we could build our continuous delivery pipeline orchestration using the OpenStack APIs and they would be compatible with any OpenStack distribution. This was instead of using vendor API's directly that could suddenly change in the next major release meaning rework, or worse still, becoming stuck on older versions of software that would eventually go out of support.
For our automation pipeline orchestration we use the OpenStack Shade library heavily https://github.com/openstack-infra/shade/tree/master/shade and the Ansible OpenStack modules http://docs.ansible.com/ansible/list_of_cloud_modules.html#openstack
All the OpenStack Ansible modules are written with the Shade client library, so any OpenStack Ansible module that is written will tested to make sure it is inter-operable with any OpenStack distribution be it Red Hat, Mirantis, Suse etc. This compatibility is maintained by the OpenStack community and tested as part of the openstack-infra project and Shade releases and we have also added features to the Shade library when required to extend or create new modules to serve our needs.
Software defined networking was also deemed a must on the project as it would help us simplify network operations and allow us to mutate the network in a completely programmatic fashion. At the time the OpenStack Neutron component, was not nearly as mature as it is today, as of the Kilo release struggled to support OpenStack clouds at the scale of 650 compute nodes in one OpenStack cloud.
A good article on scaling the latest Newton release of the OpenStack Neutron (Networking) service can be found here and it seems to have come a long way since the Kilo release and now works at massive scale https://t.co/ywe9cK4upC
Using the Nuage VSP SDN platform meant that we could scale OpenStack Kilo to 650 compute nodes (hypervisors) we required per datacenter. Another benefit of using Nuage was that Neutron to date still does not have support for multicast and a lot of our applications needed this for cluster discovery in the overlay network. Details on how the Nuage SDN solution deals with multicast can be found in this video:
https://www.youtube.com/watch?v=oqF6ezq3eWE
The Nuage VSP solution also meant that we could easily bridge back to our legacy network for application dependencies that had not yet been migrated into OpenStack. This was achieved by making the legacy network routable using the Nuage VSG hardware gateway, which meant we didn't have to fiddle with individual VLANs each time an application needed to talk to the native network. Application access permissions are locked down using the Nuage ACL ingress and egress firewall policy on least privilege basis meaning each application is set-up with minimum amount of ACL policies it needs to operate.
What Challenges Have We Faced With OpenStack or Nuage VSP SDN?
One of the initial challenges we had was updating the OpenStack installer, as Red Hat had recently went from using the Foreman installer in the Juno release to using the new Red Hat OpenStack Installer (OSP Director) in the Kilo release which is a hardened version of the RDO/ Triple O upstream project we needed to work out how we would integrate Nuage into this installer.The Red Hat OpenStack Installer uses Heat (OpenStacks orchestration service) templates to install OpenStack but at the time these only worked for installing pure OpenStack services, not SDN plug-ins, so initially a lot of development work had to go into integrating Nuage into the installer.
Nuage have since integrated these features https://github.com/dttocs/nuage-ospdirector/wiki which will save users from having to do the custom work we had to do when we started our OpenStack implementation. It is important to bear in mind any service needs to have a Heat template before it can be installed with OpenStack as the installer is a full life-cycle management tool. If as service isn't integrated in a Heat template it will be overwritten the next time the installer is run. There is no room for manual tweaks in this world or you will come unstuck as the infrastructure is scaled out.
Another feature we required was making the solution support our CLOS leaf-spine architecture provided by our Arista switches so the solution is rack aware. This customization is now going into the upstream OpenStack project https://review.openstack.org/#/c/377088/
This will make our OpenStack upgrades far easier as Nuage plug-in and leaf-spine support are integrated in the installer without the need for bespoke customisations every time a new Red Hat distribution is released.
A real game changer has been our partnership with Red Hat, who offered us to be a part of their high touch programme, this has allowed us to collaborate with Red Hat on a monthly basis on features we would like to see included in OpenStack. The CLOS leaf-spine support is one such feature that has been developed as a result of our feedback to Red Hat and they have created a blueprint to implement the features we required in the Triple O/ RDO/Red Hat OpenStack Installer to allow it to support modern networking needs.
Aside from the challenges with a brand new installer we have also had some teething problems when we have scaled out the infrastructure beyond a certain point, as some of the OpenStack default settings were not really set-up for scale we required, which caused some disruption when we hit some timeouts as we scaled out OpenStack but these were quickly found and increased. Although this is a superficial problem some timeout settings were initially quite hard to track down as you had to understand if the API call was controller or compute node side. When upping timeouts it is important that each timeout is updated across all the required OpenStack services as operations can cascade between different OpenStack API's.
So I would advise users to understand how the API calls work and how they traverse each service so when an issue is encountered they can track via the OpenStack request id in the logs.
When increasing timeouts on Nova and Neutron it was important to make sure that Nova, Neutron and HAProxy configuration are upped sufficiently to sensible values. Another default value to look out for is the file descriptor settings on RabbitMQ, in OpenStack these were exceptionally low meaning that once multiple concurrent API calls it hit the limit and maxed out. Upping these settings were really very easy to do once you found out where to look. I do think OpenStack would really benefit from having more sensible default settings and I know it is something that is being discussed in the OpenStack community.
We have also hit some minor bugs with Pacemaker and RabbitMQ versions in the Kilo release which were fixed by going to the latest versions that were already available. But these are simply all minor niggles which are natural when dealing with any new technology so really there was nothing out of the ordinary on this front or any show stoppers.
On the Nuage VSP SDN front one of the main challenges we hit was that we allowed developers to set-up their ACL rules themselves and the Nuage VRS (Open Vswitch) in Nuage VSP version 3.2 has a limit of 100 flows at ingress and 100 flows at egress per VM vport. So we have had a few development teams hit this limit initially as the ACL rules were configured in a sub-optimal way, but with a little help these were consolidated down to just the flows that were required per application so even our most complex application contain on average 50 egress and 50 ingress flows which is 50% below the limit. Nuage have since upped the limit to 500 ingress/egress flows on the Nuage 4.x release so it would take a pretty insane application to require 500 ACL rules all to itself on ingress or egress to breach that limit.
The other main challenge we encountered with Nuage VSP was the amount we were calling the Nuage APIs meant that we had to move to the Nuage 3.2 R10 release as we were spinning up over 1000 virtual machines a day with a huge amount of ACLs which was being very CPU intensive on the Nuage VSDs. But Nuage like Red Hat through our partnership programme, analysed how we were using the platform, replicated our set-up in their lab and then made some great optimizations for us to ensure as we scaled out the platform we could continue to make sure all requests for new virtual machines became active in a few seconds, as opposed to having to wait 30 seconds during busy periods. It really does pay to choose your partners with care as I haven't seen this degree of support from many vendors I have worked with over the years,
Quantifiable Benefits of OpenStack and Nuage VSP SDN?
So what are the quantifiable benefits of using OpenStack and Nuage VSP SDN together? So in our 1st year of running OpenStack in production here are some highlights when taking the approach we have achieved the following:- Developers can now can self-serve on-boarding of applications and receive compute, networking and storage on demand
- · 82 production microservice applications have been migrated onto the OpenStack platform so far using our automated continuous delivery pipelines and are live in production
- · We do on average 500 deploys a day to test and production environments on OpenStack
- We provision over 1100+ virtual machines each day on the 2 OpenStack clouds we have in each of our datacenters as all our virtual machines are immutable
- · We deployed 50 hypervisors using Red Hat OpenStack Installer across the 2 OpenStack clouds we have in each of our datacenters in one business day
- · We peaked at 650 deployments and 2000+ virtual machines deployed in the 1 day prior to Christmas
- · We do on average 220 production releases a week now on OpenStack
- · We currently have 2207 deployed active virtual machines in OpenStack which is just % of our end estate for our newly merged Paddy Power Betfair company
- · We now run 120 KVM hypervisors per datacentre (240 hypervisors in total) with the end state being 1300 KVM hypervisors.
- · We are currently running 17280 cores on OpenStack with 384 Terabytes of storage (end state is 100000 cores and 2.08 Petabytes of storage)
All in not a bad first year running OpenStack in production with some pretty impressive landmarks. In the new year we will move to the Ocata release of OpenStack so we are pretty excited about implementing the Manila and Ironic projects in OpenStack to offer bare metal and manage NFS.
Shameless Book Plug
I have also written a book called DevOps for Networking which looks at some of the techniques that can be applied to automate the data center and networking in particular. It also shows how to approach building a DevOps model at a company and trying to encourage network teams to automate their job. It focuses on multiple topics such as continuous integration and continuous deployment and covers OpenStack, Nuage and AWS as well as Ansible.It can be purchased at Packts website:
https://www.packtpub.com/networking-and-servers/devops-networking
or alternately on Amazon:
https://www.amazon.co.uk/DevOps-Networking-Steven-Armstrong/dp/1786464853/ref=tmm_pap_swatch_0?_encoding=UTF8&qid=&sr=: