The Infrastructure Engineer

Monday, 14 May 2018

Enabling memory and disk utilisation on EC2

As is often the case, documentation does a great job at explaining why you should do things - Not simply what you should do. While it's well worth knowing the detail, sometimes you want to get the thing working first!

Here's how to setup host-based monitoring (memory, disk utilisation etc.)

1. You'll need to first attach an IAM role to your EC2 server, to let it push to CloudWatch.
2. Run the commands below to download and install CloudWatchAgent.
3. Within a few minutes, you should see your new metrics flowing in your AWS console at https://eu-west-1.console.aws.amazon.com/cloudwatch/home

Note that we use "resources": ["/"] for disk monitoring. Without this, you'll get 8x unnecessary metrics for things like /dev, /run. If you're looking to stay within the AWS Free Tier, this will push you past the limit.


# Download and install CloudWatchAgent

wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/AmazonCloudWatchAgent.zip

unzip AmazonCloudWatchAgent.zip

sudo ./install.sh



# Create a basic config

cat<<EOF | sudo tee /opt/aws/amazon-cloudwatch-agent/bin/config.json >/dev/null

{

 "metrics": {

  "metrics_collected": {

   "mem": {

    "measurement": [

     "mem_used_percent"

    ],

    "metrics_collection_interval": 60

   },

   "swap": {

    "measurement": [

     "swap_used_percent"

    ],

    "metrics_collection_interval": 60

   },

   "disk": {

    "resources": [

     "/"

    ],

    "measurement": [

     "disk_used_percent"

    ],

    "metrics_collection_interval": 60

   }

  }

 }

}

EOF



# Load and check the CloudWatchAgent

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s



sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a status

The full AWS documentation for this is at at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-first-instance.html

Monday, 23 October 2017

Oh no, Python!

$ python
Python 2.7.10 (default, Jul 14 2015, 19:46:27)

>>> False = True
>>> print False
True

Wednesday, 11 October 2017

Did Trello fail in building a $1bn business?

Over at https://blog.usejournal.com/why-trello-failed-to-build-a-1-billion-business-e1579511d5dc there is an interesting read on how Trello "should" have run it's business and "failed" to build a $1bn business.

The article's premise is that we must be the next billion £$ company! Control the data, keep customers trapped^Wsafe in the warm embracing arms of a closed wall system, then drive income because they have no other option (Perhaps I'm being a bit harsh :-)).

There is an alternative: Build useful products, offer them at low cost, and don't fixate on valuations and revenue.

What if you built something so compelling that, if you chose in the future, customers would happily pay a trivial price to continue using the great stuff they have?

Turns out that that's what Trello did - They have a loyal fan^Wcustomer base, many of which (including me) would be happy to pay a fee for continued service. The silver lining is that they sold up for half a billion.

There's something special about saying to customers: "Here's our product. It's so good, that we're making it really easy to move to other products - We know you'll never decide to leave". It empowers your customers, and gives accountability to your team to ensure they do the right thing for their customers.

Thursday, 14 September 2017

Synchronised time and measuring latency

Many systems need "synchronised" time, from distributed scheduling to monitoring systems. Synchronised time can be quite a hard problem to solve for if you need sub-1 second accuracy.

The easiest way to achieve high accuracy is to measure start and end times from the same location, though that may introduce a Single Point Of Failure and/or can have an impact on scalability.

Example

Lets look at a simple latency measuring system to highlight why this is a problem, and how to handle the behaviour.

Alice measures the start time, and Bob measures the end time of the transaction.

The latency is measured as:

One-way latency = T1-T2

This assumes Alice and Bob have a shared concept of time. If Bob's time was 1 minute fast then the latency is inaccurate to up to 1 minute. In this case, the one-way latency is:

One-way latency = (T1+Alice time delta from Real Time) - (T2+Bob time delta from Real Time)

To some extent Alice and Bob have a shared concept of time, and we use systems like NTP to synchronise network systems. As always, there are limits to the accuracy if you look close enough.

NTP ensures device time is synced, commonly to at-best 100ms[1], unless you have a GPS source on your local LAN. The time between two machines can therefore be potentially up to 200ms apart:

One-way latency = (T1+100ms) - (T2+100ms) = T1 + T2 + 200ms

For many systems this works fine, but if you're looking to measure sub-1s across the network then the error margin is far larger than the required accuracy.

The same problem exists if you try and measure from a third party for example, collecting data by packet capture and post-processing it. If the packet capture processing mechanism sees data from multiple sources, then it's unlikely to be accurate enough to use to calculate latency.

Resolving the problem

One simple and effective way to resolve the problem is to measure both T1 and T2 from the same point.

Time taken for Alice to see the full round trip (RTT): T1-T2

One-way latency: (T1-T2)/2

For differential measurements (like a latency measurement), we don't need to care about how far away from Real Time our measurement is, only that it is consistent.

One-way latency = (T1+32ms) - (T2+32ms) = T1 + T2

The delta is the same at T1 and T2, thus our latency measurement is accurate.

Conclusion

As is often found, complexity is everywhere if you look close enough. Keeping things simple can be a great way to engineer yourself out of a hole, and avoid design and support nightmares in the process.

---

[1] NTP Accuracy

http://www.ntp.org/ntpfaq/NTP-s-algo.htm#Q-ACCURATE-CLOCK

$ ntpstat

synchronised to NTP server (1.2.3.4) at stratum 3

time correct to within 188 ms

polling server every 1024 s

Tuesday, 7 March 2017

S3 down and the aftermath

Last week we learnt that S3 is not infallible, and does^Wwill go down once in a while. S3 is designed for 99.999999999% durability and 99.99% availability, but it's historical track record is far better. Nevertheless, S3 in us-east-1 still became unavailable and it caused huge impact across the Internet.

This week, hundreds of engineering teams will be scrambling to mitigate for this edge case, rom adding in tactical fail-over hacks, to fully re-architecting to a multi-region design.

It's important to put the event in perspective though, and react appropriately - What was the impact radius? What was the outage length? How likely will the issue occur again?

If you continually worry about the never-will-happens, you will never get time to work on the most valuable and impactful changes for your customers.

How are your teams reacting to this outage? Is the reaction proportionate to the risk?

Friday, 24 February 2017

Most of us don't listen with the intent to understand, we listen with the intent to reply

I've noticed that as I've gained experience over the years, I've also adopted many strong opinions - Some of which turn out to be erroneous, and clouded my judgement.

This is an excellent TED talk that gave an extra dimension on communication: http://www.ted.com/talks/celeste_headlee_10_ways_to_have_a_better_conversation

It's well worth taking 12 minutes out to watch this - What things do you notice that you do?

"Approach a conversation, as if you always have something to learn".

Wednesday, 8 February 2017

Consistency vs Perfection

I prefer known "bad" code than assumed-perfect new code for knowability reasons (Six Sigma) - You can control, fix and support known(bad, but consistent) processes far easier - And the typical way forward is to bring consistency first.

An example (One that I've trapped myself with in the past):

1. I write one half-good piece of code.
2. I copypasta that in 5 other places (Defect: Now was the time to refactor. Perhaps understandable when you have a large code base).
3. Someone else adds a 2nd similar-but-better piece of code.
4. Someone else adds a 3rd similar-but-even-better piece of code.
5. We have 3x different similar pieces of code that should be consolidated into a single abstract helper method.
6. Perfect code #2 has a bugfix applied because it turned out that we forgot about an old edge case that method #1 covers.
7. Perfect code #3 has a cool enhancement, and diverges it's purpose away from the other two methods.

The two issues with this design pattern are:
1. The refactor becomes harder. The difference necessarily causes difference. Consistency keeps stuff glued together, makes for easier refactors, and therefore more likely to be done.
2. Multiple ways of doing the same thing, with different performance characteristics (two methods having unknown performance characteristics). The old code might be "bad", but it's also probably less buggy and a more known quantity.

This is another case of Perfect being the enemy of Good.

Tuesday, 6 September 2016

Poking the Beehive

This week we learnt that Facebook crashed a data centre when stress-testing their services. Why the hell would they do that to themselves?

On the surface it would seem crazy - Why would you poke the beehive when you know it will make the bees angry?

In fact, it's not a new thing; Some teams apparently like to get stung. Netflix most famously have a simian army—Chaos Monkey, Latency Monkey and so on—that poke the beehive in various ways and at varying scale. Netflix worked up to disconnecting (and therefore breaking) entire data centres. Amazon does similar (but planned) "game days", where relevant teams are pulled in to monitor while dummy traffic is turned up to 11 to stress the various systems.

The team at Netflix reasoned as follows: All of the things are going to break one day. They will probably break at the most inopportune time, when there is high demand, a new release, after dinner, Christmas time etc. It's expected that monitoring will send an alert to the appropriate engineers. Engineers will look at the problem, recover service, deep dive and ultimately fix the problem at the source. The result is a more robust system, an Anti-Fragile architecture.

If your systems are going to break eventually (and they will!), and you care about it when they do (you should!), then why not do all of that work up front, when customers are less likely to notice, and build a more robust product.

Front-loading this operational work is also beneficial from a planning and resource management perspective. You could plan a 20% buffer for poking the beehive, and work through a lot of the bugs prior to initial release. It's far easier to justify bug squashing at development time, then trying to carve out a lump of time later.

Poking the beehive hive should be a part of your deployment process, and then run at regular, random intervals throughout the year. You have the option of going the full hog (Netflix) and test your operations team and process too, by not informing them. Or you could focus on testing the system instead with an Amazon-style game day where dependent engineering teams are ready to dive into breakages.

Do you and your team poke the beehive on a regular basis? What are your experiences with system stability?

Tuesday, 23 August 2016

Perfect is the enemy of good (Or: The 80/20% rule)

Ah, the age old adage — "Perfect is the enemy of good", and the 80%/20% rule. Most of us know of it, but how many of us practice it?

The 80%/20% rule—aka the Pareto principle—suggests that 80% of the results come from 20% of the effort. The Pareto principle originally referred to land ownership distribution across a population, but it has been observed and recycled for use in many other areas. Put simply, aiming for perfect (100%) is a fruitless goal, and you will get lost trying to hit that last 20%. I would suggest that that last 20% is actually best thought of as complete waste, perhaps even down right dangerous. Here be dragons!

We see it all the time; I'll do Just this one more thing before release. Just one more feature. Just one more bug fix. Just one more unit test. Strangely though, never Just one more piece of documentation!

It seems like a commendable goal on the surface — Why would one not strive for perfection and make the customer ecstatic? The answer is that it delays delivery, and increases opportunity cost:

Delayed delivery: Trying to hit that last 20% unnecessarily delays delivery of your product to the customer—You actually produce negative value by doing this extra work.
Increased opportunity cost: The remaining 80% of time that you spend trying to hit that last 20% could be better spent on hitting that valuable first 20% in other projects.

One strategy that you could employ to break the cycle is to put yourself into the client or customer's shoes. Yes, they want the perfect product with all of the features, performance and availability and none of the bugs. But above all, they want It, Now.

Another approach you could adopt is to cultivate an agile or iterative atmosphere with your clients. You explicitly agree that improvements will be delivered rapidly and in incremental stages. This can be somewhat of a paradigm shift for IT Managers, Engineers or Program Managers that are used to the classical waterfall model, but the outcome is a productive one: A better relationship with the client, better understanding of requirements and quicker course-correcting, quicker product deliver, and ultimately a happier customer. You typically never work at the wasteful long tail because by that point, you have already agreed with your customer that a more valuable task is to be delivered first.

I still find myself stuck in perfectionism mode from time to time, but manage to find a way to snap myself out of it! Where do you find yourself stuck in perfectionist mode? How do you find your way back out of the rabbit hole?