Tuesday, 7 March 2017

S3 down and the aftermath

Last week we learnt that S3 is not infallible, and does^Wwill go down once in a while.  S3 is designed for 99.999999999% durability and 99.99% availability, but it's historical track record is far better.  Nevertheless, S3 in us-east-1 still became unavailable and it caused huge impact across the Internet.


This week, hundreds of engineering teams will be scrambling to mitigate for this edge case, rom adding in tactical fail-over hacks, to fully re-architecting to a multi-region design.

It's important to put the event in perspective though, and react appropriately - What was the impact radius?  What was the outage length?  How likely will the issue occur again?

If you continually worry about the never-will-happens, you will never get time to work on the most valuable and impactful changes for your customers.

How are your teams reacting to this outage?  Is the reaction proportionate to the risk?