“Chaos engineering” is planning and performing various failure tests on the live production environments to check its high availability, RTO and RPOs. The idea may sound absurd at first and may raise eyebrows of many in your organization. But when implemented right , with proper panning it can keep the organization fully prepared for any critical component/module failure. “Chaos Engineering” by itself is very planned and detailed and involves through analysis of whole organization environment, understanding current RTO and RPO and planning and execute the tests so as to make sure the systems are tested for wide variety of failures. The objective of planning is to cover as many failure scenarios as possible.

An Example

Lets try to understand it with an example. Imagine we have a 3-tier critical web application hosted with high availability. Measures have been taken to make sure that for crisis  the application can be restored to its working state in RTO . There are multiple instances hosting databases and application with load balancing and auto fail over.  On papers the number and the setup looks good but in reality we don’t know what kind of the problem will come.We might have prepared our self to deal with the instance failure or database corruption. But might not have thought of Load balance failure. And since it was an unforeseen cause of failure the time to recover the application functionality might take longer that RTO and mean major business impact either reputation wise ,finance wise or both.

Traditional vs. Chaos

Traditionally fail over testing is done during the new implementation or very big migration changes. The routine DR testing is done only for critical components in the environment.As the organization or the service provider don’t want down time during testing as well that can impact business. DR testing are planned and done during none critical hours only for the modules which we know can recover. Also since most of the architectures and its components keeps on changing either due to new upgrades or cost/performance optimization. Unless the environments are tested for each and every component for failures, we can face problem just like in above example.

Summary

Testing your live environment gives you more confidence on the stability of your application and also keeps the tech teams responsible on their toes(which is not as bad as it sounds). That helps identify the weaknesses and focus areas in production. Also identifies any direct or indirect impact of any recent changes in the environment .

Next time we will discuss how to analyze, plan and execute the testings of  “Chaos” till then , please let me know in comments below what you think about the article. Any suggestion is much appreciated…