The Infrastructure Engineer: 2016

Tuesday, 6 September 2016

Poking the Beehive

This week we learnt that Facebook crashed a data centre when stress-testing their services. Why the hell would they do that to themselves?

On the surface it would seem crazy - Why would you poke the beehive when you know it will make the bees angry?

In fact, it's not a new thing; Some teams apparently like to get stung. Netflix most famously have a simian army—Chaos Monkey, Latency Monkey and so on—that poke the beehive in various ways and at varying scale. Netflix worked up to disconnecting (and therefore breaking) entire data centres. Amazon does similar (but planned) "game days", where relevant teams are pulled in to monitor while dummy traffic is turned up to 11 to stress the various systems.

The team at Netflix reasoned as follows: All of the things are going to break one day. They will probably break at the most inopportune time, when there is high demand, a new release, after dinner, Christmas time etc. It's expected that monitoring will send an alert to the appropriate engineers. Engineers will look at the problem, recover service, deep dive and ultimately fix the problem at the source. The result is a more robust system, an Anti-Fragile architecture.

If your systems are going to break eventually (and they will!), and you care about it when they do (you should!), then why not do all of that work up front, when customers are less likely to notice, and build a more robust product.

Front-loading this operational work is also beneficial from a planning and resource management perspective. You could plan a 20% buffer for poking the beehive, and work through a lot of the bugs prior to initial release. It's far easier to justify bug squashing at development time, then trying to carve out a lump of time later.

Poking the beehive hive should be a part of your deployment process, and then run at regular, random intervals throughout the year. You have the option of going the full hog (Netflix) and test your operations team and process too, by not informing them. Or you could focus on testing the system instead with an Amazon-style game day where dependent engineering teams are ready to dive into breakages.

Do you and your team poke the beehive on a regular basis? What are your experiences with system stability?

Tuesday, 23 August 2016

Perfect is the enemy of good (Or: The 80/20% rule)

Ah, the age old adage — "Perfect is the enemy of good", and the 80%/20% rule. Most of us know of it, but how many of us practice it?

The 80%/20% rule—aka the Pareto principle—suggests that 80% of the results come from 20% of the effort. The Pareto principle originally referred to land ownership distribution across a population, but it has been observed and recycled for use in many other areas. Put simply, aiming for perfect (100%) is a fruitless goal, and you will get lost trying to hit that last 20%. I would suggest that that last 20% is actually best thought of as complete waste, perhaps even down right dangerous. Here be dragons!

We see it all the time; I'll do Just this one more thing before release. Just one more feature. Just one more bug fix. Just one more unit test. Strangely though, never Just one more piece of documentation!

It seems like a commendable goal on the surface — Why would one not strive for perfection and make the customer ecstatic? The answer is that it delays delivery, and increases opportunity cost:

Delayed delivery: Trying to hit that last 20% unnecessarily delays delivery of your product to the customer—You actually produce negative value by doing this extra work.
Increased opportunity cost: The remaining 80% of time that you spend trying to hit that last 20% could be better spent on hitting that valuable first 20% in other projects.

One strategy that you could employ to break the cycle is to put yourself into the client or customer's shoes. Yes, they want the perfect product with all of the features, performance and availability and none of the bugs. But above all, they want It, Now.

Another approach you could adopt is to cultivate an agile or iterative atmosphere with your clients. You explicitly agree that improvements will be delivered rapidly and in incremental stages. This can be somewhat of a paradigm shift for IT Managers, Engineers or Program Managers that are used to the classical waterfall model, but the outcome is a productive one: A better relationship with the client, better understanding of requirements and quicker course-correcting, quicker product deliver, and ultimately a happier customer. You typically never work at the wasteful long tail because by that point, you have already agreed with your customer that a more valuable task is to be delivered first.

I still find myself stuck in perfectionism mode from time to time, but manage to find a way to snap myself out of it! Where do you find yourself stuck in perfectionist mode? How do you find your way back out of the rabbit hole?