What You Need To Make Your App a High Availability System: Tackling The Technical Challenges

Whether it’s a web, mobile, or any kind of application, they all serve the needs of their clients. However, we can’t expect all of them to perform the same under dire circumstances: unless they have the durability of a High Availability system, that is.

Published in

Life at Mekari

5 min readNov 30, 2021

After a few years worth of struggle to develop and maintain an application with my team, I found several technical challenges that we need to tackle before we can make our application more durable: once we get over these obstacles, things will be easier (hopefully 😅) to turn it into a High Availability system.

Here are the technical challenges you need to face:

Fault-tolerance

Ideally, our system should be able to continue working even when unexpected errors happen. Whether it’s a bug, network error, or even worse, our app needs to be able to continue running.

One of the first steps to reach that ideal is to isolate each process from another. For example, in Erlang and Elixir, we can use a concurrency process to isolate one process from another. This means, if one process unfortunately crashes, it won’t affect other processes because it shares no memory.

Another thing we can do is implement a retry mechanism.
Here’s one example:

The system failed to send an email to a user and let’s assume the process crashed. The system should automatically retry the process until the email is sent, but don’t forget to make a limitation for the retry process. It’s because the server will experience a heavy load if the system keeps on making an attempt to retry the process.

Scalability

The next technical challenge we need to face is scalability: the system should be able to handle any possible load. Having a super server that can take 1 billion requests/second would be nice if someday our servers are hit with 1 billion requests, but it’s unnecessary.

Instead, we should focus on enabling our system to be upgraded as time goes by without its operations being interrupted: our system needs to be able to be upgraded without it needing to stop (be it from restarting, getting updates, and etc.)

We can use autoscaling from our service provider to handle this issue. For example, we can use the autoscaling service it was to help us automatically scale up if system load increase and scale down if system load decrease.

The scaling itself is not limited to vertical scaling, we can use horizontal scaling too.

Distribution

The system should never stop except the servers suddenly disappears. To make the server never stop we need to have multiple machines. If one of the machines is taken down, the other machine should be able to take over.

This means the system should be able automatically to scale horizontally and vertically based on each machine’s needs.

We can use the autoscaling service to handle the autoscaling as I mentioned before.

Responsiveness

The system should always handle the requests reasonably fast and responsive. Request handling should not take too much time even if the load increases or an error occurs to the system.

To handle and divide the load if the load increases, we can use a load balancer in our system, and to handle repetitive query we can use cache to lighten the performance load.

Here is an example of simple architecture that I use for my side project:

Live update

The system should be able to release the newer system version without restarting any machine, for example in a streaming service, we don’t want to disconnect all our users to deploy a newer version of our system.

To handle this issue we can use blue-green deployment to reduce downtime and risk for deploying the newer system version.

At any time, only one of the environments is live, with the live environment serving all production traffic. For this example, Blue is currently live and Green is idle.
As you prepare a new version of your software, deployment and the final stage of testing takes place in an environment that is not live: in this example, Green. Once you have deployed and fully tested the software in Green, you switch the router so all incoming requests now go to Green instead of Blue. Green is now live, and Blue is idle.
https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html

Summary:

Fault-tolerance — Whatever happens to the system that is affected by an error, we want to keep the error minimal and localize the error as much as possible while keeping the system running and continue providing service.
Scalability — The system should be able to automatically scale its resource to handle the increasing or decreasing load without restarting the system.
Distribution — the machines should still be able to run if one machine is down by letting other machines take over its process. Running the system on multiple machines makes the system resilient.
Responsiveness — the system should be able to handle any type of request reasonably fast and responsive, and lengthy requests should not block the rest of the system and negatively affect its performance
Live Update — the system should be updated to a newer version with little to no downtime and risk.

Your app will truly become a High Availability system once you know how to tackle these challenges: being able to consistently provide service to users, in rain or shine, isn’t easy. So, for those of you who have tackled these challenges to support your system, here’s a gift for you: