The Infrastructure Engineer: S3 down and the aftermath

Tuesday, 7 March 2017

S3 down and the aftermath

Last week we learnt that S3 is not infallible, and does^Wwill go down once in a while. S3 is designed for 99.999999999% durability and 99.99% availability, but it's historical track record is far better. Nevertheless, S3 in us-east-1 still became unavailable and it caused huge impact across the Internet.

This week, hundreds of engineering teams will be scrambling to mitigate for this edge case, rom adding in tactical fail-over hacks, to fully re-architecting to a multi-region design.

It's important to put the event in perspective though, and react appropriately - What was the impact radius? What was the outage length? How likely will the issue occur again?

If you continually worry about the never-will-happens, you will never get time to work on the most valuable and impactful changes for your customers.

How are your teams reacting to this outage? Is the reaction proportionate to the risk?

1 comment:

Unknown7 March 2017 at 14:19
Cross-region replication of your S3 bucket could be a viable option to consider for small amount of data but if you have a lot of data on your bucket, costs for replication needs to be considered [1]. And you're going to pay for storing data in a "replicated" location that probably your app will never use because S3 in IAD won't go down again soon. I hope! :)

So, yeah, really two sides of the same coin.

[1] "In addition to the additional data storage charges for the data in the destination bucket, you will also pay the usual AWS price for data transfer between regions". Ref.:
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
ReplyDelete
Replies

Add comment