Share

The Challenges of Ensuring High Availability

A service-level agreement (SLA) is an agreement between two or more parties, where one is the customer and the others are service providers. In this article we are going to talk about one of the aspects of SLA: service availability.

Service level agreements are also defined at different levels:

  • Customer-based SLA: An agreement with an individual customer group, covering all the services they use. For example, an SLA between an IT service provider and the finance department of a large organization for the services such as finance system, payroll system, billing system, procurement/purchase system, etc.
  • Service-based SLA: An agreement for all customers using the services being delivered by the service provider. For example:
    • A mobile service provider offers a routine service to all the customers and offers certain maintenance as a part of an offer with the universal charging.
    • An email system for the entire organization. There are chances of difficulties arising in this type of SLA as level of the services being offered may vary for different customers.
  • Multilevel SLA: The SLA is split into the different levels, each addressing different set of customers for the same services, in the same SLA.
    • Corporate-level SLA: Covering all the generic service level management (often abbreviated as SLM) issues appropriate to every customer throughout the organization. These issues are likely to be less volatile and so updates (SLA reviews) are less frequently required.
    • Customer-level SLA: covering all SLM issues relevant to the particular customer group, regardless of the services being used.
    • Service-level SLA: covering all SLM issue relevant to the specific services, in relation to this specific customer group.

In SLAs, availability is usually represented by number of 9s as shown in the table below:

 

Downtime
Availability 9s Daily Weekly Monthly Yearly
90% One 2h 24m 0.0s 16h 48m 0.0s 3d 1h 2m 54.6s 36d 12h 34m 55.2s
99% Two 14m 24.0s 1h 40m 48.0s 7h 18m 17.5s 3d 15h 39m 29.5s
99.9% Three 1m 26.4s 10m 4.8s 43m 49.7s 8h 45m 57.0s
99.99% Four 8.6s 1m 0.5s 4m 23.0s 52m 35.7s
99.999% Five 0.9s 6.0s 26.3s 5m 15.6s

The SLA calculations assume a requirement of continuous uptime (i.e. 24/7 all year long)

Unfortunately, guaranteeing a certain percentage of uptime does not depend only on the code developed but also on external factors – some are out of our control and others cannot be predicted in advance.

Here are some examples of those external factors:

Cloud Outages

There’s a misconception that just by deploying an application in the cloud it willbe automatically be immune to downtime. Despite being an unusual issue and because the most well-known infrastructure services provide a quite high SLA, there are several examples of downtime that had an impact on the applications running on them:

Overload

This happens when the usage of the system is going beyond what it was designed / planned for, leading to an accumulation of orders and creating a snowball effect that ends up rendering the service useless.

The most common ways of dealing with this problem are:

  • perform load testing in advance to determine what is the load limit of the service and then set a threshold to immediately start refusing new requests while remaining in a responsive state;
  • provide the system with elasticity characteristics so that it can dynamically adapt to the load inflicted on it: “scale out” happens when there is an increase in load and “scale in” when the pressure on the system decreases.

Human error

One of the main causes is also human error, for example, when someone deploys the wrong version of the software, deletes the wrong file and/or data critical to the service’s operation or simply hits the wrong button and destroys half of the machines of a cluster because he didn’t realize that he was dealing with the production environment instead of the pre-production one.

In order to mitigate this type of occurrences, it is important to minimize, above all, what are direct interactions between a user and the system. Here are some ways of achieving this:

  • automated deployment processes;
  • automated backups and data restoration ensuring that there are validations that this process is actually working overtime;
  • segregate access and authorization permissions to make changes in the various service environments (DEV, QA, PROD, etc.).

The uptime calculation is based on the uptime of the several components involved:

Total Availability (99.7%) = Load balancer Availability (99.9%) * Hardware Availability (99.99%) * Application Availability (99.99%) * Database Availability (99.9%)

Looking at the above values it would seem that the availability of the system would be 99.9% but, really, the correct value is 99.7% which leads to an increase in service downtime from ~ 8h to ~ 1d 2h per year.

It’s easy to realize that the more components in the equation, the more difficult it becomes to guarantee the stability of the system as a whole and one of the basic rules is that the maximum availability of the system can never be greater than that of the component with the lowest availability capacity.

Knowing this, to maximize uptime and ensure that SLAs are met, applications must be designed with the awareness that something can go wrong and include in their architecture ways to diminish these situations. 

Ensure redundancy at all levels

This is the easiest way to increase system availability. Assuming a component has 99.0% availability if we duplicate it and run it in parallel, we can obtain: 1 – (1 – .99) ² = 99.99% availability. Unfortunately, this increase follows a logarithmic growth leading to smaller gains at higher levels of redundancy. However, with the correct architecture and applying this redundancy at the various levels it is possible to achieve very high availability values.

Deploy self-recovery and self-healing mechanisms

  • At application level: it must have fault tolerance capabilities to avoid a cascade failure effect using patterns such as retry, Circuit-breaking and backpressure. You must also have the ability to introspect in order to understand if you are in a functional OK or NOK state and be able to export that information to system supervision services;
  • At system level: there should be monitoring tools that are able to collect information from the various components that make up the system and react automatically to error situations or that indicate that the system is at risk, e.g. starting another instance of a service that is not able to respond to all requests in a timely manner (load spike), restart a process that reported being in a non-recoverable failure state;
  • At hardware level: run monitoring mechanisms that can detect failures or degradation on the hardware running the system and react appropriately. For example, in the cloud starting new machines and migrating the services of the instances that are reporting problems.

Principles of chaos engineering

When the design and automatic recovery mechanisms included in the system design are not able to deal with the problem, the last line of defense will be an operator who is under pressure and with limited time to react in the best way to recover the system. For this to be successfully achieved, it is important to have well-defined processes, monitoring tools and information available to be used by teams with appropriate training and knowledge.

After each downtime situation has been resolved, it must be analyzed in detail, identifying the causes. Then, the system must be updated with capabilities to ensure that the same kind of problem does not recur or, at least, to be able to deal with the situation without downtime.

For every nine that is added in the availability of an SLA the difficulty of guaranteeing them grows in a non-proportional way and it’s mandatory to have a team that has the know-how to apply the concepts that make this possible, this is part of our expertise here at Present Technologies.

Related Posts

Comments are closed.