The search for a golden mean between the amount of labor and potential financial losses is part of the concept of service reliability (Service Reliability). Since ideal systems do not exist, something will still fail or slow down. However, we must clearly understand what we mean by failure and speed of operation. At some point, we are bound to find ourselves in a situation where “lowering” latency within our systems requires so many tasks or their complexity that it will be too expensive and time-consuming.
In my opinion, the optimal way out of the problem is not an attempt to completely avoid failure but to minimize business losses. Developing regulations for the return to service with guidelines understandable to technical teams is necessary. All this cannot be modeled without simulating outages. That creates chaos at all levels of the system’s functioning.
In general, chaos management deals with errors in the application and at the infrastructure level. We create, orchestrate, and scale bugs consciously to find system weaknesses.
Chaos Engineering As A Process
It is difficult to say unequivocally when this process should begin. Someone thinks that it needs to be integrated into DevOps. Someone – that chaos management practices should be implemented at the beginning of system design to create understanding and metrics for future work with errors immediately. As the toolkit grows, it is increasingly included in the CI/CD pipeline. The process of managing chaos also has a scientific or research component. Chaos management allows you to get closer to understanding the stable state of the system, test hypotheses, build a connection with Service Level Objectives and SLA, and incorporate effective, proven error recovery methods into regulations and employee training.
The interpretation of the results introduces additional complexity in the management of chaos. How can we correctly interpret what happens if we are experimenting on several levels at once, for example, knocking out pods in Kubernetes and adding “noise” at the network level?
Here it is important to understand the peculiarities of the architecture and implementation of what we are testing. For example, the Kubernetes API Server can start throttling requests in response to many of them, which can also be called chaos management. Why does he do it? Because it can’t cope by itself or can’t cope, for example, etc.? It is difficult to say whether the industry will ever come to universal practices that will at least to some extent cover chaos management scenarios in an ecosystem approach, and not in convex cases: they knocked out ten pods out of 11 – they recovered in 1 minute.
However, the development of the industry also brings simplification of tasks. Cloud computing perfectly complements the very concept of creating chaos. Suppose we are trying to create chaos inside the “iron” infrastructure. In that case, we must either allocate a separate stand, which can be difficult and time-consuming or put up with restrictions from other users of our infrastructure.
Cloud platforms allow you to get a stand quickly, work out an experiment, look at the system’s reaction, turn off the stand, stop paying, make changes, redeploy, and repeat everything. The flexibility of cloud platforms allows you to deploy several stands with different versions and conduct comparative testing.