Availability in Distributed Systems
In a typical distributed software environment, besides latency, a key metric that correlates well with the health of the business is what is known as “availability”.
Availability is usually measured by the percentage of time the software is not down (i.e.: available). Availability is vital to modern software organizations. ‘High availability’, or HA for short, is a term commonly used by software professionals. It refers to a software system that enjoys a very low down time.
Let’s go back to our hypothetical social network from the previous article, in which you are a senior member of the engineering team. Let’s assume that for some yet to be determined reason, the site is down an average of five minutes a day. Only five minutes! If you are still learning the ropes, that might not sound like much. If you are a seasoned engineer however, you’ll correctly know for a fact that if you don’t reduce that downtime as soon as possible, hell will break loose!
Why? It’s because a five minutes downtime a day equates to thirty hours of downtime a year. For the key players in the high tech industry, an hour of downtime can result in millions of lost revenue. For example, Fortune estimates that Facebook makes around $13.3M an hour. Thirty hours of outage therefore accumulate to around 400 million dollars of lost revenue! That is definitely not a loss any sane business would appreciate.
The numbers are immense due to the fact that there are literally many millions of users consuming the service at any point of time. Having said all that, there is a balance to be had. If the business ends up spending $700M in engineering resources ( this includes hardware, Data Center leases, maybe cloud services costs, additional salaries… etc.), in order to mitigate $400M of lost revenue, then it’s still a net loss. Experienced leads keep that in mind when planning for the availability metrics of the software systems they are responsible for.
Now that we established the criticality of availability, let’s dive deeper into how it is measured. As was covered earlier, availability is the percentage of time your service is up. This is also known as uptime. In other words, it’s the opposite of downtime. Availability is usually calculated per year. It is always described in nines. A one nine availability is simply 90% uptime throughout a year. This translates to about 36 days of downtime a year, which is very bad if you are Facebook, Instagram, Google, Uber or a software business of a similar caliber. Three nines is 99. 9 uptime (~ 9 hours/year downtime). Two and half nines is 99.5%, and so on. One of my favorite references on the uptime nines is this Wikipedia table.
Due to the fact that seeking perfection in distributed systems is no different than wishing that unicorns join you at breakfast, there will never be 100% availability. The best you can do is to add more nines to your availability target through deep engineering efforts. Six nines for example (99. 9999 %) equates to ~32 seconds of downtime. Eight nines availability Objective tolerates no more than ~316 milliseconds of downtime per year. Obviously, the more critical your system is, the greater the availability needs to be.
Availability doesn’t only apply to the whole system, but also to subsystems. Subsystems contain features, components/services, and databases within the parent system. For example, in our hypothetical social network, the parent system is the site as a whole. Underneath, there are subsystems allowing different features like marketplace , streaming videos, serving ads, news feed…. etc. Each feature is powered by a number of software services that must work well together to allow the subsystem to operate with acceptable availability and latency.
The availability requirements would differ from one component to the next based on the business needs. In the example of a social network, serving ads is critical for the business, and it must be combined with a highly available news feed from which the users can see the ads. Notice in the diagram below how the uptime requirements for the ability to write posts are less than the uptime requirements for serving ads.