Microservices Challenges — Distributed Troubleshooting and Monitoring
Microservices architecture brings us so many great benefits, as we discussed here. However, nothing comes free and so the benefits of micro services There are various challenges which are by-product of micro services, and it makes all sense to understand these well before beginning journey in new fancy world.
One of these challenges is the complexity of troubleshooting that comes with micro service patterns. We shall discuss this with possible solution in this article.
Troubleshooting is part of development life. Life is comparatively easy when debugging is for one monolithic system. But the micro services world has many smaller services, which interact with each other in a mix of combinations to support bigger application features.
With such a dynamic interaction matrix, troubleshooting for any issue could span a multitude of services. Finding a trail of any request across multiple services and servers, which could again be dynamic based on the business logic of each service, could be fatal.
This makes troubleshooting extremely complex, which needs deep system knowledge and persistence to walk through many many nodes.
Once one has deep knowledge of the system, and its distributed architecture, debugging is possible. But it will still consume a lot of time and energy. Assume a scenario where this complexity could be multiplied with auto-scalable container-based deployment models, which means server nodes are changing dynamically based on need.
It is not manageable without proper tooling in place. Even, then it is complex to understand the thousands of service request trails.
Distributed Tracing, Log Aggregation, and Visualization are the savior.
This enables to trace the related logs across services and nodes.
It is implemented by adding an identifier to logs which can help trace (by correlating) the request across the services and server nodes.
It is like adding one constant with the request at the beginning of the first service hit. This constant will then be passed to all subsequent service calls from there. Every service, if distributed tracing is enabled for these, automatically picks this constant trace id and uses it while logging respective data.
In turn, it enables to correlate the data of the same request across the services and servers easily.
Now, we have an identifier that can help to identify any given request across the services.
The next challenge is, that it is not practically possible to login into different servers and analyze the logs. Hence, we need a provision where all the logs from different services can be collated and presented in one place for easy and efficient reading.
Log Aggregation tools help in this. Every service, while using trace id for logs, prepares the logging data, and sends this data to one centralized log aggregator service (mostly using one of the available tools/libraries).
Log aggregator stores the logs in some optimized data structure, and search-friendly scalable database. Once all log data is available in one data store, it becomes possible to use this for different kinds of functions.
Log aggregators have all log data in one place. These tools/databases are meant to store the bulk log entries and to enable faster search on this.
Now, this data can be used to present the whole request trail in some easily understandable format at UI or through API too.
Once we have this aggregated data, it opens many opportunities, for example:
- See request trails across the services for troubleshooting
- Analyze the performance data i.e. time taken by each service/operation
- Client usage patterns or user behavior, i.e. which service is used more and at what time.
- Potential of connecting this data with context information to do even more deep analysis, for example, which service was used more on new year and so on. This can feed important information to system design decisions.
It is an obvious question that logging data for each service call, and operation and then sending it to a centralized collector can add performance overhead. This is true to some extent, however, there are ways to optimize it.
Adding distributed trace tokens and logging is usually managed by instrumentation-driven libraries, which add pre/post hooks for each method to log the basic information. The amount of information is configurable.
As this logging is managed by libraries and mostly abstracted, these libraries are highly optimized to do this job.
Sending logging data to a centralized collector (aggregators) can be optimized by making it asynchronous. For example, a logging client on each node can have a local in-memory (and physical too) storage which can collate all the logs. All these logs can later be sent to a centralized server asynchronously using HTTP requests or queues etc. This reduces the reporting overhead a lot, especially from service processing flow.
Cost can not be zero for benefits, however, it can be contained in most cases. Most tracing libraries are highly configurable for log level and reporting mechanisms, these provide various options to optimize the process based on application-specific use cases.
Tech to Support
There are many libraries and tool options to enable these.
Zipkin & Jaeger are the two most popular Distributed Tracing systems available as of now. Zipkin was developed by Twitter, and Jaeger by Uber.
- Both were open sourced, once these reach to production readiness stage, by respective companies with their respective open source communities.
- Both support the basic components of Distributed tracing / Log aggregator or collector / Visualizer, as described above. Additionally, these support many more useful features on top, which are specific to each tool.
- Both are quite similar in architecture and usage. However, differs mostly in deployment style and plugins available for different languages and frameworks.
Refer to respective websites for detailed information. These are well documented tools.
Refer to open-tracing also, which is on a mission to standardize the tracing APIs and tools. This is also an incubating project in Cloud Native Computing Foundation.
The combination of these tools and design patterns makes distributed troubleshooting and monitoring possible in the microservice ecosystem.
Keep in mind, that it is still more complex than monolith designs. It still needs deep system knowledge across the services and the ability to connect all the dots across. It needs experience in system design and system both. However, a combination of these tools makes it feasible.
Hence, if you are not ready for an iota of extra complexity, it is recommended to stay away from microservices kind of distributed architecture patterns. The number of services could be overwhelming in such designs due to fine-grained structure.
Or start with the right tooling in place to manage it well from the beginning. Don’t ever think to run the highly distributed microservice kind of system without these supporting tools. Otherwise you will be spending months of effort on troubleshooting, with a high potential to burn the teams physically and morally, and may lose productivity too.
We shall discuss more challenges like distributed transactions, and managing failures in distributed environments in future articles.
Resource : https://matrixexplorer.medium.com/microservices-challenges-distributed- troubleshooting-and-monitoring-b5129f56701f