Typical Architecture of a Monitoring System

A typical architecture diagram of a monitoring system. Looking from left to right, the collector is responsible for collecting monitoring data. After the collected data is transmitted to the server, it is usually directly written into the timing library. Then it analyzes and visualizes the data in the time series library. The most typical part of the analysis is the judgment of alarm rules, that is, the alarm engine in the figure. After the alarm engine generates an alarm event, it sends it to the alarm sending module for notification of different media. The visualization is relatively simple, that is, the data display on the graph, which reasonably renders various monitoring data through various charts, which is convenient for users to view, compare and conduct daily inspections.

1. Collector

The collector is responsible for collecting monitoring data. There are two typical deployment methods. One is to deploy following the monitoring object. For example, a collector is deployed on all machines to collect the CPU, memory, hard disk, IO, and network-related indicators of the machine; The other is the remote probe method, such as selecting a central machine as a probe, detecting the PING connectivity of many machines at the same time, or connecting to many MySQL instances, and executing commands to collect data.

  • Telegraf is a product of InfluxData. The open source protocol is MIT. It is very open and has many external contributors. It is mainly used with InfluxDB. Of course, Telegraf can also push monitoring data to Prometheus, Graphite, Datadog, OpenTSDB and many other storages, but the connection with InfluxDB is the smoothest.
  • Exporter is a component specially used in the Prometheus ecosystem. The collectors in the Prometheus ecosystem are scattered. Each collection target has a corresponding Exporter component. For example, MySQL has mysqld_exporter, Redis has redis_exporter, switches have snmp_exporter, and JVM has jmx_exporter.
  • Grafana-Agent is an All-In-One collector launched by Grafana, which can not only collect indicator data, but also collect log data and link data. The open source protocol is Apache 2.0, which is relatively open. Grafana-Agent integrates the log collector Promtail of the Loki ecosystem. For link data, Grafana-Agent integrates OpenTelemetry Collector.
  • The positioning of Categraf is similar to that of Grafana-Agent, which supports the collection of metrics, logs, and traces. Categraf focuses on the Prometheus ecology. The label is a steady-state structure. It only collects numerical time-series data, and pushes the data to the back-end storage through Remote Write. All time-series libraries that support the Remote Write protocol can be connected, such as Prometheus, VictoriaMetrics, M3DB, Thanos, etc. wait.

After the collector collects the data, it must be pushed to the server. There are usually two methods, one is to push directly to the timing library, and the other is to push to Kafka first, and then write to the timing library through Kafka.

2. Timing library

In the architecture of the monitoring system, the core is the timing library. Older monitoring systems directly reuse relational databases. For example, Zabbix directly uses MySQL to store time-series data. MySQL is good at handling transaction scenarios and is not optimized for time-series scenarios. There is an obvious bottleneck in capacity.

OpenTSDB is based on the HBase package, and later continued to develop, and there is also a version based on the Cassandra package. Since the underlying storage is based on HBase, generally small companies can't play it, and the domestic audience is relatively small. When choosing a time series database, few people will choose OpenTSDB.

InfluxDB has specially designed storage engine, data structure, and access interface for time-series storage scenarios. It is widely used in China, and InfluxDB can be well integrated with Grafana, Telegraf, etc., and the ecology is very complete. However, the open source version of InfluxDB is a stand-alone version, and there is no open source cluster version. After all, it is a commercial company and needs to make money to achieve sound development. This point needs to be considered by us.

TDEngine can be regarded as the domestic version of InfluxDB. It is optimized for the scenarios of IoT devices and has good performance. It can also be integrated with Grafana and Telegraf. TDEngine is a good choice for scenarios that are partial to device monitoring. The cluster version of TDEngine is open source, which is very attractive compared to InfluxDB. TDEngine not only stores time-series data, but also supports streaming computing, allowing users to deploy fewer components.

M3DB is a time-series database from Uber. M3 claims to resist 6.6 billion monitoring indicators in Uber, which is a huge amount. Moreover, M3DB is fully open source, including the cluster version, but its architecture is relatively complex, and its CPU and memory usage is high, so it has not been widely promoted in China. The architecture code of M3DB contains a lot of knowledge about distributed system design, and it is a good project to learn.

VictoriaMetrics , referred to as VM, has a very simple and clear architecture. It adopts the merge read method to avoid data migration problems. It is a very lightweight and reliable cluster method to build a batch of virtual machines on the cloud, attach cloud hard disks, deploy VM clusters, and use a single copy. .

TimescaleDB is a time series database developed by timescale.inc, which provides services as an extension of PostgreSQL.

3. Alarm engine

The core responsibility of the alarm engine is to process alarm rules and generate alarm events. Generally speaking, users will configure hundreds or even thousands of alarm rules, and some very large companies may need to configure tens of thousands of alarm rules. Each rule contains data filtering conditions, thresholds, execution frequency, etc. There are some monitoring systems with rich configurations, which also support the configuration of rule effective period, duration, observation time, etc.

Alarm engines usually have two architectures, one is data-triggered and the other is periodic polling.

Data triggering means that after the server receives the monitoring data, in addition to storing it in the timing database, it will also forward a copy of the data to the alarm engine. Every time the alarm engine receives a piece of monitoring data, it must determine whether the alarm rule is associated and make an alarm. judge. Because the amount of monitoring data is relatively large, the amount of alarm rules may also be relatively large, so the alarm engine will be deployed in fragments, that is, deploying multiple instances.

Periodic polling , simple architecture, usually one rule and one coroutine, according to the execution frequency configured by the user, periodic query judgment is enough, because it is actively queried, it is easy to do index correlation calculations. Such architectures are like Prometheus, Nightingale, Grafana, etc. After the event is generated, it is usually handed over to a separate module for alarm sending. This module is responsible for event aggregation and convergence, and sends it to different receivers and different notification media according to different conditions.

4. Data display

The visualization of monitoring data is also a very common and important requirement. Grafana is the most successful in the industry. Grafana adopts a plug-in architecture, which can support different types of data sources, and the charts are very rich, which can basically be regarded as the de facto standard in the open source field. Grafana is even directly embedded in the commercial products of many companies, which shows how popular it is.

There are usually two types of requirements for monitoring data visualization, one is real-time query, and the other is monitoring the dashboard (Dashboard). Real-time query is a temporary idea. For example, if there is a problem online, we need to track down the monitoring data and restore the on-site troubleshooting. This requires an indicator browsing function that is convenient for us to view and quickly find the desired indicator. The monitoring dashboard is usually used for daily inspections and troubleshooting. It is created by senior engineers and places some indicators that deserve special attention. To a certain extent, it can trigger our thinking and has a strong effect of knowledge accumulation. If you want to understand the principle of a certain component, the monitoring dashboard of this component can usually give you some inspiration.

This article is a study note for Day 29 in July. The content comes from Geek Time "Operation and Maintenance Monitoring System Practical Notes". This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/132000200