The Infrastructure Engineer: 2017

Monday, 23 October 2017

Oh no, Python!

$ python
Python 2.7.10 (default, Jul 14 2015, 19:46:27)

>>> False = True
>>> print False
True

Wednesday, 11 October 2017

Did Trello fail in building a $1bn business?

Over at https://blog.usejournal.com/why-trello-failed-to-build-a-1-billion-business-e1579511d5dc there is an interesting read on how Trello "should" have run it's business and "failed" to build a $1bn business.

The article's premise is that we must be the next billion £$ company! Control the data, keep customers trapped^Wsafe in the warm embracing arms of a closed wall system, then drive income because they have no other option (Perhaps I'm being a bit harsh :-)).

There is an alternative: Build useful products, offer them at low cost, and don't fixate on valuations and revenue.

What if you built something so compelling that, if you chose in the future, customers would happily pay a trivial price to continue using the great stuff they have?

Turns out that that's what Trello did - They have a loyal fan^Wcustomer base, many of which (including me) would be happy to pay a fee for continued service. The silver lining is that they sold up for half a billion.

There's something special about saying to customers: "Here's our product. It's so good, that we're making it really easy to move to other products - We know you'll never decide to leave". It empowers your customers, and gives accountability to your team to ensure they do the right thing for their customers.

Thursday, 14 September 2017

Synchronised time and measuring latency

Many systems need "synchronised" time, from distributed scheduling to monitoring systems. Synchronised time can be quite a hard problem to solve for if you need sub-1 second accuracy.

The easiest way to achieve high accuracy is to measure start and end times from the same location, though that may introduce a Single Point Of Failure and/or can have an impact on scalability.

Example

Lets look at a simple latency measuring system to highlight why this is a problem, and how to handle the behaviour.

Alice measures the start time, and Bob measures the end time of the transaction.

The latency is measured as:

One-way latency = T1-T2

This assumes Alice and Bob have a shared concept of time. If Bob's time was 1 minute fast then the latency is inaccurate to up to 1 minute. In this case, the one-way latency is:

One-way latency = (T1+Alice time delta from Real Time) - (T2+Bob time delta from Real Time)

To some extent Alice and Bob have a shared concept of time, and we use systems like NTP to synchronise network systems. As always, there are limits to the accuracy if you look close enough.

NTP ensures device time is synced, commonly to at-best 100ms[1], unless you have a GPS source on your local LAN. The time between two machines can therefore be potentially up to 200ms apart:

One-way latency = (T1+100ms) - (T2+100ms) = T1 + T2 + 200ms

For many systems this works fine, but if you're looking to measure sub-1s across the network then the error margin is far larger than the required accuracy.

The same problem exists if you try and measure from a third party for example, collecting data by packet capture and post-processing it. If the packet capture processing mechanism sees data from multiple sources, then it's unlikely to be accurate enough to use to calculate latency.

Resolving the problem

One simple and effective way to resolve the problem is to measure both T1 and T2 from the same point.

Time taken for Alice to see the full round trip (RTT): T1-T2

One-way latency: (T1-T2)/2

For differential measurements (like a latency measurement), we don't need to care about how far away from Real Time our measurement is, only that it is consistent.

One-way latency = (T1+32ms) - (T2+32ms) = T1 + T2

The delta is the same at T1 and T2, thus our latency measurement is accurate.

Conclusion

As is often found, complexity is everywhere if you look close enough. Keeping things simple can be a great way to engineer yourself out of a hole, and avoid design and support nightmares in the process.

---

[1] NTP Accuracy

http://www.ntp.org/ntpfaq/NTP-s-algo.htm#Q-ACCURATE-CLOCK

$ ntpstat

synchronised to NTP server (1.2.3.4) at stratum 3

time correct to within 188 ms

polling server every 1024 s

Tuesday, 7 March 2017

S3 down and the aftermath

Last week we learnt that S3 is not infallible, and does^Wwill go down once in a while. S3 is designed for 99.999999999% durability and 99.99% availability, but it's historical track record is far better. Nevertheless, S3 in us-east-1 still became unavailable and it caused huge impact across the Internet.

This week, hundreds of engineering teams will be scrambling to mitigate for this edge case, rom adding in tactical fail-over hacks, to fully re-architecting to a multi-region design.

It's important to put the event in perspective though, and react appropriately - What was the impact radius? What was the outage length? How likely will the issue occur again?

If you continually worry about the never-will-happens, you will never get time to work on the most valuable and impactful changes for your customers.

How are your teams reacting to this outage? Is the reaction proportionate to the risk?

Friday, 24 February 2017

Most of us don't listen with the intent to understand, we listen with the intent to reply

I've noticed that as I've gained experience over the years, I've also adopted many strong opinions - Some of which turn out to be erroneous, and clouded my judgement.

This is an excellent TED talk that gave an extra dimension on communication: http://www.ted.com/talks/celeste_headlee_10_ways_to_have_a_better_conversation

It's well worth taking 12 minutes out to watch this - What things do you notice that you do?

"Approach a conversation, as if you always have something to learn".

Wednesday, 8 February 2017

Consistency vs Perfection

I prefer known "bad" code than assumed-perfect new code for knowability reasons (Six Sigma) - You can control, fix and support known(bad, but consistent) processes far easier - And the typical way forward is to bring consistency first.

An example (One that I've trapped myself with in the past):

1. I write one half-good piece of code.
2. I copypasta that in 5 other places (Defect: Now was the time to refactor. Perhaps understandable when you have a large code base).
3. Someone else adds a 2nd similar-but-better piece of code.
4. Someone else adds a 3rd similar-but-even-better piece of code.
5. We have 3x different similar pieces of code that should be consolidated into a single abstract helper method.
6. Perfect code #2 has a bugfix applied because it turned out that we forgot about an old edge case that method #1 covers.
7. Perfect code #3 has a cool enhancement, and diverges it's purpose away from the other two methods.

The two issues with this design pattern are:
1. The refactor becomes harder. The difference necessarily causes difference. Consistency keeps stuff glued together, makes for easier refactors, and therefore more likely to be done.
2. Multiple ways of doing the same thing, with different performance characteristics (two methods having unknown performance characteristics). The old code might be "bad", but it's also probably less buggy and a more known quantity.

This is another case of Perfect being the enemy of Good.