DevOps A Misinterpreted Buzz Word: May 2016

The Openstack summit in Texas was the third Openstack Summit I have attended. The first two being Vancouver and Tokyo last year. A lot had changed in that time for Betfair, the first summit I attended in Vancouver we were simply fact finding, trying to use the conference as a way to see if some of our initial OpenStack theories and designs were valid, while in Tokyo we were checking for new developments that we could try and help speed up our pilot and check on project updates on Ironic (Bare Metal) and Manila (File Share As A Service) that we would implement as part of phase two of our OpenStack implementation.

Betfair sent us to those two OpenStack summits as they wanted to ensure we made the correct architectural decisions when implementing our own OpenStack solution, while readily trying to avoid repeating any glaring mistakes other vendors had made and already learned from. We also used the conferences to evaluate software defined networking (SDN) solutions that were on offer, using it to answer many of our early queries. We also questioned if we should be using ceph? Should we use local disk or centralised storage? What OpenStack projects were actually mature enough to use yet? What was the actual situation with Ironic bare metal provisioning?

The OpenStack Summit put simply is a very technical and developer led conference, which set's it apart from most vendor specific summits. It is probably the most self-deprecating conference I have ever attended, as the people presenting are honest, will share mistakes and war stories to help the community and others improve. It doesn't say things are great, if it doesn't believe they are, which is a helpful cultural shift that many vendor specific conferences could learn from.

The OpenStack Summit runs in parallel around 10 sessions at a time per 40 minute windows from 9am through to 6pm, utilising one main conference center and is supplemented by two hotels, so you can gather the enormity of the set-up with around 7,500 people attending in Austin. This remember is a conference for an open source project, which in itself is very impressive and something of a huge cultural shift in it's own right.

Who'd have thought a project like OpenStack that Gartner called a science experiment in 2013 would attract so many big hitters with every vendor you can think of attending? Gartner incidentally gave a keynote at the OpenStack summit last week saying OpenStack was ready for production use and a great platform. At one point people said the world was flat, so Gartner too make mistakes, so let's not be too hard on them, despite the fact I have always felt that a magic quadrent sounds like something a salesman would sell you alongside a pyramid scheme, but I digress.

One of the key themes to come out of the OpenStack Summit in Austin was something that has been obvious to me for many years while implementing configuration management processes in industry. Technology rarely cause cloud projects to fail, with only around 6% of projects failing due to technology, it's the company culture and those implementing projects not understanding where the value add is that causes the projects to fail, and not focusing on initial requirements.

A cloud platform is useless to a developer if they have to raise a ticket to an infrastructure team to create a VM and cannot get the specification they want, then there is no value add, so that failure to change a companies operational model is the main reason for failure of cloud projects. The value add for Betfair has not implementing OpenStack or a Nuage SDN as standalone initiatives or as Gartner put it "science projects". The value add for Betfair is using this fantastic technology as an enabler and advantage to allow us to speed up time to market, to automate our whole platform so we can easily roll back or recover from failure, and to give our developers a platform to easily test and innovate on. We use Openstack as our infrastructure middle-ware utilising a consistent set of apis to control the infrastructure via automated workflows.

So if a cloud platform fails or isn't adopted, its down to the people who implemented or didn't change the operational processes and the silo'd culture, and very rarely is the failure the underlying technology. If you don't fix the processes and people issues, then it doesn't matter what your platform is or how good it is, so companies using ticketing systems and promoting silo'd teams and communication, beware you will most likely fail with your cloud initiative.

The OpenStack Summit in Texas was very different from Vancouver and Tokyo for Betfair, the Betfair team had been selected by the OpenStack community to give 2 of the 400 breakout talks, as well as being nominated for the OpenStack super user award. Although ultimately unsuccessful in taking home the super user award, even being nominated was a great achievement, given we are very early in our implementation, only 7 months in to date.

Betfair was one of four finalists for the super user award, beating off some very impressive OpenStack users along the way. The OpenStack super user nomination for Betfair was for the pace at which Betfair had implemented an active active data center at which Openstack and Nuage SDN technology were at the heart of. We had also created an automated template for on-boarding applications, by automating everything against the OpenStack and Nuage software defined networking API's. This has enabled deployment of code, platform image, networks, ACL rules and load balancing, using common workflows, while doing it all using open source tooling.

I make no apologies for saying this is mightily impressive to go from a 4 week POC, to a Pilot, to actually running applications in production all in 6 months on a brand new platform, while automating every step of the deployment. This is while at the same time trying to change peoples opinions and ways of working that weren't used to the pace at which we had to work or using automation in their daily tasks. Sure we are still iterating bits of the implementation, but we always will be, we will continue to keep improving the automation and learning until we reach near perfection, its been a herculean effort by the team given the time frames. The proudest part for me is having teams in the UK, Porto and Cluj all collaborate together to make this possible, this wasn't silo'd teams that made this happen this was a DevOps initiative in the truest sense across multiple countries and we need to keep that collaboration going to continue the success.

I understand there will be questions of do we deserve this super user award nomination yet? In my honest opinion I don't think we would have been worthy winners... YET.
But having attended the conference and spoken to our peers, I do think we were worthy nominees. When people from companies the calibre of Walmart approach us and say that what we have done with your automated deployment process is what they have been trying to do for years. It really hits home what a huge project this really is, and what we have achieved in a short frame of time, and what a key differentiation it can make to the business.

The purist in me, wants the time we win the OpenStack super user award (I like to believe we could one day) to be when we are running all of Paddy Power Betfairs applications in production on OpenStack, and have fully automated everything in the data center top to bottom. By then the quantifiable benefits will be that we have increased our time to market for our Paddy Power Betfair applications, as well as the time to recover from failure, while giving our developers infrastructure to actually innovate on and facilitate their fantastic ideas, all the while contributing back everything we do to the open source community to help others.

Only then will then have done what we set out to achieve when we finish our migration project, we may not catch the eyes of our peers like we have in out first 6 months, or be nominated again for an OpenStack super user award again, but achieving all the initial requirements is the measuring stick for our success. A scenario where our developers are writing code for the best applications possible without worrying about infrastructure issues, while our infrastructure and network engineers are developing code for OpenStack and the open source community to improve and optimise the infrastructure and network we run those applications on is the ideal scenario and the future.

Why can't Paddy Power Betfair be the new Etsy, I see no reason why we shouldn't be leading the way from now on and set a new benchmark using the fantastic platform we have built for our developers. As highlighted at the OpenStack summit the only thing that could stop us is culture, I'm quietly confident we will do just fine. Rapid on-boarding, bare metal and containers as a service are our next stop, and I am just a little bit excited about what we can achieve next...

The two Betfair sessions from the Austin Openstack Summit can be watched below:

DevOps at Betfair using Openstack and SDN:
https://www.youtube.com/watch?v=aKa2idHhk94

Why Betfair chose Openstack - the Road to Their Production Private Cloud:
https://www.youtube.com/watch?v=-Tmuph-vUWU

DevOps A Misinterpreted Buzz Word

Monday, 2 May 2016

A Week At The OpenStack Summit In Texas