Chaos Engineering has been a topic that has intrigued me for quite some time. My first encounter with this concept started in an SRE meetup hosted by the folks at Atlassian (back when the pandemic wasn’t a thing). While my current role doesn’t require me to work on these aspects of development, tasks that aren’t our responsibility more interesting, aren’t they?
As I might not be able to paraphrase it better, here’s a definition from principlesofchaos.org:
“Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.”
With rapid changes in the way modern apps are architected and delivered, it is essential to ensure the system’s resiliency. Resiliency can simply be thought of as the ability of a system to stay afloat when a fault might have occurred. However, staying afloat can mean different things based on the context. Having an IT downtime can be a nightmare for most modern businesses. Unfortunately, with most distributed systems having a lot of “moving parts”, there can be multiple potential points of failure. Besides, the behavior of the deployment environment can be quite precarious. It can only be a matter of time until a domino effect kicks in.
With all these uncertainties around, chaos engineering can help us determine a system’s ability to tolerate inevitable failures and formulate SLAs. As you might be already wondering, we would have to test every scenario and ensure that every layer in the tech stack can handle errors. Right from the infrastructure level all the way up to the application level, we would have to simulate failures and implement solutions to contain the “blast radius’’.
When I was given a demo on how these experiments are conducted, it was fascinating to think about how we need to be concerned about things that we usually do not consider to be under our scope. As an example, you are expected to experimentally determine what would happen to your app if the data center you use has a network outage or a physical disk failure. On a code level, the question might be like, what would happen if one of your dependent services intermittently responds with an unacceptable latency? I hope this conveys the point. As a developer, I usually have the “works on my system” mentality. However, a successful team needs to ensure that their product can function in the real world.
On this note, I’d like to emphasize the importance of running these tests in production or a production-like environment. Traditional QA tests verify the correctness of our code and its alignment with the requirements. Testing the behavior of the system under conditions such as extreme load and infrastructure failure is an afterthought at best. Furthermore, testing and staging environments can’t be made to mimic a production setup. The organic traffic and usage patterns can’t be accurately modeled beyond a certain extent.
Moving to the “discipline” part in the definition, there are guidelines on how to systematically practice chaos engineering. It starts with defining the steady-state of a system. Then, the experiments and “game day” strategies would have to be planned. Next, we move on to simulate variables and inject faults. As with any agile workflow, we need to be proactive in triaging the identified weaknesses and fixing them. Open-source tools such as Chaos Monkey, Litmus, and ToxiProxy can be integrated as a part of your DevOps cycle and automate experiments. Commercial offerings such as the one from Gremlin can provide an integrated platform and centralized visibility.
Companies across a range of industries — from health care and finance to logistics, energy, and telecommunications — need dependable and reliable software systems and Chaos Engineering can help you get a step closer to delivering the same.