Welcome to my ultimate guide on how you can achieve resilient software architecture, and in the process, make your application everything-proof. Or at least, minimize the risks that make your solution unavailable to the users. But why should you be interested in the first place?
Intro about (not so) resilient systems
Let’s assume for a moment some hypotheticals.
You’re a C-level/management. You’ve probably educated yourself, read some tech articles outlining the undoubtful benefits of a microservices-oriented approach.
💡 BTW, we have TONS of content about microservices software architecture too. Everything tried and tested, so I highly recommend the work of my colleagues there!
You trusted the hype and implemented microservices in your business, and you’ve probably been happy with the results so far. You’ve successfully managed to save money on the resources you didn’t use. Your developers find the new tools pleasant to work with. You feel like you’ve made the right choice and the world has become a better place.
So why would I ruin this peaceful picture by mentioning failure? <record scratch> Please, bear with me.
The cold, hard truth is that everything is going to fail, be it sooner or later. It might not be a bad thing, though.
Sometimes, you just want things to fail fast to immediately detect and fix faults and defects.
On the other hand, you might want things to fail as late as possible (if at all) – just like this one dedicated gamer, who has kept his gaming system on for 20 years in order not to lose his saved data. Or these sysadmins, who managed to move a bare-metal server with over six years of uptime from one data centre to another without turning it off.
Somewhere in-between these two cases there’s resilient software architecture.
A resilient system is always on the verge of failure (while experiencing pressure at scale), but it ideally should keep working flawlessly for as long as possible, even if some of its components did end up failing.
Do I really need this service level? My architecture has been working fine so far
That’s great! But it doesn’t mean you shouldn’t be prepared for some easy-to-predict failures.
What if you decided to host said service using a cloud provider that doesn’t offer multiple availability zones by default (think of regions as geographical locations to simplify things), so your entire service is hosted in a major city in a European country. Your users located in the countries of North America or South-East Asia won’t be able to experience top-notch performance nor access your service blazingly fast, but you’re willing to accept that.
But one day, you suddenly receive critical news. A fire breaks out in the data center you host your service at and your service goes down. But wait, it gets worse! The backups of your service have been located at the same place, so you also lose something that’d help you keep your operations up and running.
Had you deployed your service to multiple, distributed locations (and/or cloud service providers), this wouldn’t have happened. Surely enough, some of your users will have a hard time accessing the service now, but at least it wouldn’t be completely down.
Well, at least you have defined your infrastructure as code, so it can be brought back up in mere minutes, right? RIGHT?
In the IT world, everything’s always on fire – sometimes quite literally.
I don’t like not being able to recover any data. How do I make my services more resilient?
Thankfully, the microservices-oriented approach to serving applications provides an answer. It offers yet another benefit I believe doesn’t get enough credit – it’s the possibility of building resilient systems.
A resilient system can withstand failures of any kind:
- problems generated by third-party cloud providers,
- human errors,
- unpredicted traffic
- high load resulting from the service gaining sudden traction, etc.
…while also being highly available to the end-users, without them losing the high quality of service, which would be quite obvious and immediately visible.
Designing your architecture in a resilient way might seem intimidating and costly at first. Some benefits are immediately noticeable, while the more transparent ones build a safety net that will make you way less worried in the long run.
Moreover, the possibility of recovering your system in the event of a catastrophic failure will prevent you from losing more potential revenue. Simply put, it will pay off should the worst happen.
What’s that “safety net” you mentioned?
The safety net ensures that by implementing the resilient architecture model, you never end up losing everything. A replica of your infrastructure can always be restored from another one that’s still operational.
You’re essentially safe from having a single point of failure.
Moreover, the whole infrastructure could potentially be recreated from scratch in mere minutes, if it was found to be faulty – all thanks to defining it as code. Thank you, recovery!
You can also be sure your infrastructure will handle any kind of load, from low traffic at night to high load resulting from a successful marketing campaign.
How can the improved resilience benefit the users of large scale systems?
One of the key takeaways of implementing resilient software architecture is the so-called high availability (abbreviated: “HA”).
High availability means that your users are always able to access your service, even if there’s some smoke coming from under the hood that they won’t (and shouldn’t) ever be able to see.
Let’s see it on the examples:
- If you run a static blog, high availability could be achieved by caching the content of your blog on some CDN, so that the visitors would still be able to read the articles even in the event of a critical infrastructure failure.
- If you run an e-commerce service operating on several markets in several countries, making copies (replicas) of your infrastructure all around the world would be wise from both the performance standpoint (shorter service access times) and the high availability standpoint (i.e. if a replica failed, the rest would still be available).
Sometimes, only a part of the infrastructure fails. If there’s something wrong with the database system, the user should still be able to access the service and submit their requests, which would be stored until the normal operation of the database system got restored. This way, the user wouldn’t even have to know there were any failures in the system they’ve just used. They are just content that their request has been fulfilled.
What if the failures were caused by some bugs and regressions introduced by your developers? There’s an answer to that too! It’s possible to always deploy the newest version of the application to only a small portion of the user base. If the software is too buggy, the number of affected users won’t be high and the bugs can be fixed before launching the application for a wider audience. Bonus: you can even get metrics about whether the users liked new features or not!
Okay, I’m convinced. Is there anything I could take off the shelf to benefit my app?
Thankfully enough, there are some true and tried patterns you can implement so your infrastructure becomes resilient. There’s no need to reinvent the wheel – the tech industry giants have paved the way for us.
Some descriptions of these patterns will probably sound familiar because they’ve appeared in the aforementioned example cases without being explicitly named. 🙊
Redundancy is probably the most basic and straightforward pattern used in building resilient architecture.
Redundancy stands for making multiple copies (replicas) of the system’s components so that higher availability can be achieved.
It’s important enough that AWS does it by default for services like AWS S3 Standard Bucket, even without the user’s knowledge. That’s how they can offer 99.99% (four-nines) availability on their S3 service. For other components, they advise having at least three replicas of the infrastructure, since it raises the overall system availability to 99.9999% (which is the highly coveted six-nines availability).
Autoscaling and load balancing
Autoscaling and load balancing both come in handy when you expect your traffic to be variable (tbf, it’s always the case). Do you see increased traffic when people visit your service at work and then decreased traffic when they get home (and at night)? Or maybe you run a food delivery service, which people use heavily during the weekend and way less throughout the week? Or what if some celebrity was spotted using your service, granting you thousands of new users? Your joy will be quickly killed, just like the app.
Autoscaling enables you to automatically run just enough instances of your service’s component(s) that they could handle all the users coming through (i.e. demand).
Of course, there also has to be something that could redirect your users to the instances that aren’t overloaded because they’re already serving enough users – that’d be the job of the load balancer.
A load balancer checks which instances of the service’s component could be able to handle a user’s request at a given time and redirects the user to one of these.
Golden images and containerization
How to create new instances of a component when the time’s running short? Some system components can be quite heavy on resources. It’s no good if each of the additional instances takes 10 minutes to boot and start handling users. The “classic” way to handle this problem is by using the so-called “golden images”.
Golden images are special images containing the application runtime and the application itself that require little to no provisioning when booting.
They are, however, slowly becoming a thing of the past now that containerization is gaining mainstream popularity. Containerization offers an answer to the same question, but a better one.
Containers use special software called container runtime (usually Docker) to run containers from lightweight images. As opposed to golden images, containers don’t have a full operating system underneath, which makes them boot faster and consume fewer resources.
Infrastructure as Code (IaC)
Infrastructure as Code is a way to define infrastructure so any changes are traceable and reversible. This means that anyone wanting to see how the infrastructure looks like could just read its code and see how it evolved through time (while also seeing who’s responsible for parts of it, so no sneaky changes please). Moreover, manual configuration (e.g. using a web-based panel of a cloud service provider) becomes obsolete, so no more time has to be wasted on something so repetitive and prone to human error.
Back in the day, physical servers were roughly the only way to host services and applications. Any kind of downtime was costly and usually required someone to physically check the servers, troubleshoot the issues and get them to run again.
That’s why the IT industry put its focus on mutable infrastructure – the longer a physical server went without having to restart it, the better. Any changes to the application, the underlying OS and such were implemented without even rebooting the server.
Unfortunately, something like this usually meant that every server was a bit unique and had its own quirks due to manual (often undocumented) modifications (called a configuration drift).
Fortunately, the era of virtualization came around, bringing change with itself. It was now possible to use golden images and provision them with new changes as needed. This meant that a server failing wasn’t a disaster anymore – one of these could now be recreated in a day at most.
Immutable infrastructure means that any instances created aren’t supposed to be modified.
After an instance (be it a VM or a Docker container) is run from an image (and provisioned, if need be), its configuration shouldn’t ever be changed. Instances aren’t ever being modified – they’re being replaced with newer versions before being decommissioned.
The existence of immutable infrastructure made it possible to introduce several new concepts into the vast world of cloud computing, some of which will be explained below.
Blue-green deployments are a natural extension of the immutable infrastructure paradigm. Since immutable infrastructure means you never change any configuration on the already running instances, how do you implement any changes?
You deploy the instances with the new version of the application alongside the instances with the old version and after making sure the new version works as intended, redirect the traffic from the old (blue) instances to the new (green) instances. That’s exactly what blue-green deployments are about.
You never have any stopovers and in case the new version of the application is found to be faulty, you can perform a rollback to that one that’s served you well so far.
Canary releases allow you to deploy the new version of the application (often with new, experimental functionality) to a limited group of users. You’ve probably already been a subject of these many times, consciously or not. 😉
The COVID-19 pandemic has alone brought quite a bunch of these changes. For example, Discord’s group video call update was initially enabled for only 5% of the communities, selected randomly. They’ve collected metrics about the performance of the new feature, measured the overall user satisfaction and ultimately decided to enable it for everyone. Just a year later, it’s really hard to believe that it wasn’t a thing until recently, seeing that our team here at TSH gets to meet on Discord every day.
At times, the new feature is released to a specific market for testing. For example, Sony has rolled its VOD program in a single country to test it out – they chose Poland, so our developers could watch some videos after work.
All major cloud providers strive to offer their clients as many managed services as they possibly can. These can range from basic things like VMs, managed database services, DNS and such to more complex and possible niche applications like machine learning, IoT or robotics.
Why use these? The answer is simple – they’re usually fairly convenient. They can also save a good amount of money if properly configured.
If we’re to consider their usefulness in terms of resilient software architecture, one thing immediately comes to mind – the provider should always offer an SLA (service level agreement) on their services. It’s the provider’s responsibility to manage the service, make it highly available and keep your data secure from any possible breaches and intrusions.
Let’s take a managed DB as an example since it’s one of the most popular choices. When using a managed DB, you no longer have to update the underlying DB software yourself. You don’t have to set up any replication yourself either. You don’t have to worry about doing the backups yourself – you just create a backup schedule instead and that’s it. The goal is to have some software that just works.
Are managed services the answer to every problem? Not quite.
There are some use cases where using them might prove to be too expensive, especially when misconfigured. Let me give you some real-life examples:
- The goal was to implement the logging of an app running on several VMs located in a private cloud. For simplicity, AWS CloudWatch has been chosen as the right tool for this job. At first, everything worked perfectly – the logs were pouring in from the private cloud to AWS and it was possible to analyze them for errors. Then, the developers changed the application’s logging level to “debug”. Since AWS charges for the ingestion of logs coming from outside AWS, this change proved itself to be rather costly. If a non-managed solution (e.g. the ELK stack) was implemented from the start, the bill would be much lower.
- Serverless can be considered a “pinnacle” of managed services since it allows you to run code and fulfil some business requirements without having to provide any infrastructure for it. Then again, if you write some erroneous code (e.g. resulting in an infinite loop), as this developer did, the bill will be pretty big. This was a smaller hobbyist project, but something of this sort could well happen in production.
Moreover, serverless isn’t a good fit for every use case in general, both when it comes to the performance and the costs. You can check it yourself via calculator here.
That being said, I believe the benefits usually outweigh the disadvantages. Just make sure they fit your case and when in doubt, hire some experts.
Monitoring, metrics, distributed logging and tracing
When running a service spanning hundreds of instances, it would be near impossible to track its performance and availability without employing a powerful monitoring system. Since system monitoring is one of the most straightforward things to think of while deploying any kind of service anywhere, there are multiple approaches to this subject.
There are a plethora of managed monitoring services available. While you, of course, have to pay for them, you also make sure that the monitoring service isn’t in the same place as the application. If the infrastructure goes down along with the system that’s supposed to monitor it, that system won’t be very useful.
On the other hand, there are lots of open source systems you can host yourself that, if implemented properly, might cut some costs. There are options available – the hard part is choosing the right tool for the job, as usual.
When you have hundreds of instances working at the same time… wait. How do you even tell they’re working at all?
Healthchecks are an answer to this problem. The developers have to implement a simple endpoint in the application (or a microservice) they create. This endpoint can then periodically be queried by something that manages all the microservices (i.e. the orchestration software, or one of its components – a load balancer, a service registry or similar).
When there’s no response, an instance is considered unhealthy and automatically replaced with a healthy one.
Healthchecks can also be implemented as heartbeats, which basically is the same thing, but the other way round – it’s the instances reporting back to the orchestration software.
Caching has been around for a long time in many different things, from hard disks and processors to backend applications.
Simply put, caching is the process of saving some data that are requested the most in a place that can be accessed the fastest.
For example, if you discover that the most visited page on your blog is the one you keep the pictures of your cat on, why not cache it? That way, the software serving your blog doesn’t even have to get any load coming from the cat enthusiasts.
The story starts with Netflix. You can imagine a service operating in 190 countries that makes up so much network traffic it had to be throttled when everyone was locked in at home due to the COVID-19 pandemic to have rather incredible infrastructure. It’s hard to imagine the load it must constantly be under. They have worked hard to make it possible for everyone to watch their favourite show, while the whole “machine” is two steps away from disaster.
One of the things they came up with was chaos engineering, which basically means tampering with the infrastructure on purpose to see if it can survive if something goes horribly wrong.
One of the components of Netflix’s Chaos Monkey kit, the Latency Monkey slows some packets flowing through the system down on purpose, in order to simulate delays, network outages and connectivity issues. Another component, the Chaos Kong drops a full AWS region.
The concept isn’t new, seeing that Netflix released the source code of the Chaos Monkey kit back in 2012, but it’s not widely used yet. This may, however, change, since AWS finally released their own AWS Fault Injection Simulator for everyone as recently as March 2021.
A microservices-based infrastructure consists of many services, each of which fulfils a specific business function. The microservices might not even use the same programming languages and runtimes, which is a good thing, since you can then use the best tool for a given job (e.g. Python for ML applications and such).
On the other hand, it introduces some difficulty – how do you track the traffic between these seemingly incompatible services? How do you measure the response times of specific components in such a setting?
That’s where the service mesh comes in. Its goal is to add observability to the infrastructure.
It makes it easier to observe performance issues, optimize the routes between the components and even reroute some requests so that they don’t hit the components that have failed. As a positive side effect, it also increases the security of the system, since the requests routed through a service mesh can be encrypted.
Deployment of good resiliency practices start with the people
Just like DevOps isn’t a role (DevOps Engineers are people, people!), resilient software architecture doesn’t apply to just the software, nor just the infrastructure.
It’s a complex process that starts with a specific mindset in a company so that the developers feel responsible for their code, have an idea of how it’ll impact the system as a whole, can be upskilled by other developers or DevOps Engineers if needed. It ends on actual technical implementation.
What tools and solutions can I use to achieve and maintain architecture resilience?
The good news is: there are options.
A vast majority of the patterns mentioned above have lots of competing tools and standards you can use to achieve the desired outcome. I’ll cover some of them below.
Golden images and containerization
If you REALLY need to make a golden image, try Hashicorp Packer. It has builders for every major VPC provider.
Then again, I’d rather recommend the microservices-oriented paradigm, and that’s where Docker shines through. There’s some confusion as to what Docker is (because it’s a company, a container runtime, the container images and so on), so if you’re confused (and you have every right to be so!), this blog post might make matters more clear.
Infrastructure as Code (IaC)
The name itself is a bit confusing since it covers both the software used to automatically create resources on a cloud provider (e.g. AWS) and the software used to provide some already existing resources (e.g. VMs on AWS EC2). The former kind of software does configuration orchestration, while the latter does configuration management.
There aren’t many choices when it comes to configuration orchestration. Hashicorp has, again, created an amazing tool called Terraform and by far, no one has tried to make a competing one – save for AWS, as one could expect. 😉 AWS CloudFormation isn’t cloud-agnostic, though, which might cause problems if you were to migrate from AWS to some other cloud provider.
There are many choices when it comes to configuration management. Ansible seems to be the most common choice these days, although there are at least five more.
Monitoring, metrics, distributed logging
Since the subject’s so important, you can use lots of both self-hosted and managed services. If you don’t want to host logging software yourself, AWS CloudWatch, Papertrail, Splunk or New Relic will help you gain more insight into how your system operates.
If, on the other hand, you’re fine with implementing things yourself, you might have to use more “building blocks” – Prometheus for monitoring and the ELK stack (composed of Elasticsearch, Logstash and Kibana) for logging.
Istio is the service mesh software that’s gaining the most traction, although there are at least eight other tools you might find to be more tailored to your needs.
And, just like always, TSH invites you to have a look at our handy technology radar 😀 Only tried and tested technologies there!