Technical experience sharing: Use ELK to build a TB-level log monitoring system?

This article mainly introduces how to use ELK Stack to help us build a log monitoring system that supports Nissan's TB level. In an enterprise-level microservice environment, running hundreds of services is considered a relatively small scale. In the production environment, logs play a very important role. Logs are required to troubleshoot exceptions, logs are required for performance optimization, and services are required for business troubleshooting.

However, there are hundreds of services running in production, and each service is simply localized and stored. When logs are needed to help troubleshoot problems, it is difficult to find the node where the logs are located. It is also difficult to mine the data value of business logs.

Then the unified output of logs to one place for centralized management, and then the processing of logs, and outputting the results into data available for operation and maintenance, research and development is a feasible solution for log management and assistance in operation and maintenance, and it is also an urgent need for enterprises to solve the log.

Our solution

Based on the above requirements, we launched a log monitoring system, as shown in the figure above:

  • Logs are collected, filtered and cleaned uniformly.

  • Generate visual interface, monitoring, alarm, log search.

An overview of the functional process is as shown above:

  • Bury points on each service node and collect relevant logs in real time.

  • After unified log collection service, filtering, and cleaning logs, a visual interface and alarm function are generated.

Our architecture

①We use FileBeat for the log file collection terminal. Operation and maintenance are configured through our back-end management interface. Each machine corresponds to a FileBeat. The Topic corresponding to each FileBeat log can be one-to-one or many-to-one, according to the daily log volume configuration Different strategies.

In addition to collecting business service logs, we also collected MySQL slow query logs and error logs, as well as other third-party service logs, such as Nginx.

Finally, combined with our automated publishing platform, automatically publish and start each FileBeat process.

②The proxy method we use for call stack, link, and process monitoring indicators: Elastic APM, so that there is no need to change the program on the business side.

For a business system that is already in operation, it is undesirable and unacceptable to change the code in order to join monitoring.

Elastic APM can help us collect HTTP interface call links, internal method call stacks, used SQL, process CPU, memory usage indicators, etc.

Some people may have questions. With Elastic APM, other logs can basically be collected. Why use FileBeat?

Yes, the information collected by Elastic APM can indeed help us locate more than 80% of the problems, but it is not supported by all languages ​​such as C.

Second, it cannot help you collect the non-error logs and so-called critical logs you want, such as: an error occurred when an interface was called, and you want to see the logs before and after the error time; there is also printing business related to facilitate analysis Log.

Third, custom business exceptions, which are non-system exceptions and belong to the business category. APM will report such exceptions as system exceptions.

If you warn about system abnormalities later, these abnormalities will interfere with the accuracy of the alarm, and you cannot filter business exceptions because there are many types of custom business exceptions.

③At the same time, we opened the Agent twice. Collect more detailed GC, stack, memory, and thread information.

④We use Prometheus for server collection.

⑤Since we are Saas service-oriented, there are many services, and many service logs cannot be unified and standardized. This is also related to historical problems. A system that has nothing to do with business systems indirectly or directly connects with existing business systems. If you let it change the code in order to adapt to yourself, it can't be pushed.

The awesome design is to allow yourself to be compatible with others and treat the other party as an object of attack. Many logs are meaningless. For example, in the development process, in order to facilitate troubleshooting and tracking problems, printing in if else is just a signature log, which represents whether the if code block or the else code block is gone.

Some services even print Debug level logs. Under the limited conditions of cost and resources, all logs are unrealistic. Even if the resources allow, it will be a big expense in one year.

Therefore, we adopted solutions such as filtering, cleaning, and dynamically adjusting log priority collection. First, collect all logs into the Kafka cluster and set a short validity period.

We currently set a data volume of one hour, one hour, and our resources are still acceptable for the time being.

Log Streams is our stream processing service for log filtering and cleaning. Why do we need ETL filters?

Because our log service has limited resources, it's not right. The original logs are scattered on the local storage media of each service, which also requires resources.

Now we are just collecting it. After collecting it, the original resources on each service can release part of the resources occupied by the log.

That's right, this calculation really divides the resourceization of the original services into the log service resources, and does not increase the resources.

However, this is only theoretical. With online services, resources are easy to expand, but shrinking is not so easy. It is extremely difficult to implement.

Therefore, it is impossible to allocate the log resources used on each service to the log service in a short time. In this case, the resource of the log service is the amount of resources used by all current service logs.

The longer the storage time, the greater the resource consumption. If the cost of solving a non-business or indispensable problem is greater than the benefits of solving the current problem in a short time, I think that with limited funds, no leader or company is willing to adopt a solution.

Therefore, from the perspective of cost, we have introduced filters in the Log Streams service to filter log data that is of no value, thereby reducing the resource cost used by the log service.

Technology We use Kafka Streams as ETL stream processing. Realize dynamic filtering and cleaning rules through interface configuration.

The general rules are as follows:

  • Interface-based configuration log collection. All logs of the default Error level are collected.

  • Take the error time as the center, open a window in the stream processing, and collect non-Error level logs at N time points that can be configured up and down. By default, only the info level is used.

  • Each service can be equipped with 100 key logs, and all key logs are collected by default.

  • On the basis of slow SQL, configure different time-consuming filters according to business classification.

  • Real-time statistics of business SQL according to business needs, such as: during peak period, statistics of the query frequency of similar business SQL within one hour. It can provide DBA with a basis for optimizing the database, such as creating an index based on the query SQL.

  • During peak hours, logs are dynamically cleaned and filtered according to the weight index of the business type, the log level index, the maximum log limit index of each service in a time period, and the time period index.

  • Dynamically shrink the time window according to different time periods.

  • Log index generation rules: generate the corresponding index according to the log file rules generated by the service, for example: a service log is divided into: debug, info, error, xx_keyword, then the generated index is also debug, info, error, xx_keyword plus date as a suffix . The purpose of this is to habitually use logs for research and development.

⑦Visualization interface We mainly use Grafana. Among the many data sources it supports, there are Prometheus and Elasticsearch, which can be described as seamless docking with Prometheus. We mainly use Kibana for visual analysis of APM.

Log visualization

Our log visualization is as follows:

If you think this article is helpful to you, you can like it and follow it to support it, or you can follow my public account, there are more technical dry goods articles and related information sharing, everyone can learn and progress together!

 

Guess you like

Origin blog.csdn.net/weixin_48182198/article/details/108681129