Tuesday, 6 September 2016

Poking the Beehive

This week we learnt that Facebook crashed a data centre when stress-testing their services.  Why the hell would they do that to themselves?


On the surface it would seem crazy - Why would you poke the beehive when you know it will make the bees angry?

In fact, it's not a new thing; Some teams apparently like to get stung.  Netflix most famously have a simian army—Chaos Monkey, Latency Monkey and so on—that poke the beehive in various ways and at varying scale.  Netflix worked up to disconnecting (and therefore breaking) entire data centres.  Amazon does similar (but planned) "game days", where relevant teams are pulled in to monitor while dummy traffic is turned up to 11 to stress the various systems.

The team at Netflix reasoned as follows: All of the things are going to break one day.  They will probably break at the most inopportune time, when there is high demand, a new release, after dinner, Christmas time etc.  It's expected that monitoring will send an alert to the appropriate engineers.  Engineers will look at the problem, recover service, deep dive and ultimately fix the problem at the source.  The result is a more robust system, an Anti-Fragile architecture.

If your systems are going to break eventually (and they will!), and you care about it when they do (you should!), then why not do all of that work up front, when customers are less likely to notice, and build a more robust product.

Front-loading this operational work is also beneficial from a planning and resource management perspective.  You could plan a 20% buffer for poking the beehive, and work through a lot of the bugs prior to initial release.  It's far easier to justify bug squashing at development time, then trying to carve out a lump of time later.

Poking the beehive hive should be a part of your deployment process, and then run at regular, random intervals throughout the year.  You have the option of going the full hog (Netflix) and test your operations team and process too, by not informing them.  Or you could focus on testing the system instead with an Amazon-style game day where dependent engineering teams are ready to dive into breakages.

Do you and your team poke the beehive on a regular basis?  What are your experiences with system stability?

No comments:

Post a Comment