I already have Prometheus, but I still need Nightingale?

When it comes to current monitoring, Prometheus is undoubtedly the most popular project. If you only monitor machines and network devices, Zabbix can still compete. If you want to monitor not only devices but also applications, Kubernetes and other infrastructure, Prometheus is the best choice. There are even some open source projects that have built-in support for indicator exposure of the Prometheus protocol, such as new versions of Zookeeper, new versions of RabbitMQ, Nginx vts, etc. The influence of Prometheus is evident.

The word Prometheus mentioned in many scenarios is actually not just the Prometheus project itself, but the Prometheus ecosystem, including the indicator formats, transmission protocols, query languages, various Exporter collectors, various compatible storages defined by Prometheus, etc. .

In the Prometheus ecosystem, various Exporters can be used for collection, VictoriaMetrics can be used for storage, and Grafana can be used for viewing pictures. It seems to be very complete. Why is there another open source project called "Nightingale" that claims to be in partnership with Prometheus? Seamless? This article attempts to explore one or two.

Introduction to Nightingale

An excerpt of the Nightingale project introduction from the Nightingale official website:

Nightingale Monitor is an open source cloud-native observation and analysis tool that adopts the All-in-One design concept. It integrates data collection, visualization, monitoring and alarming, and data analysis. It is closely integrated with the cloud-native ecosystem and provides enterprises with out-of-the-box functionality. Level monitoring, analysis and alarm capabilities. Nightingale released the v1 version on github on March 20, 2020, and has accumulated more than 100 versions.

Nightingale was originally developed and open sourced by Didi, and was donated to the China Computer Federation Open Source Development Committee (CCF ODC) on May 11, 2022, becoming the first open source project to receive donations after the establishment of CCF ODC. Nightingale's core R&D team is also the original core R&D staff of the Open-Falcon project. Counting from 2014 (Open-Falcon was open sourced in 2014), it has been 10 years, just to do a good job in monitoring.

After reading the project introduction, I can only know that Nightingale is a monitoring system. What are the differences between it and Prometheus? I have not yet seen it. Don't worry, let's take a look at the Prometheus problem first.

Prometheus problem

The collection, storage, and image viewing of Prometheus have all been solved very well. It’s just an alarm. For some companies, it may have the following pain points:

  • A company has many sets of Prometheus, and the rules are scattered in multiple yamls, which is inconvenient to manage.
  • I hope to have an easy-to-use, permission-isolated UI that opens monitoring capabilities to all teams in the company and allows them to serve themselves. Don’t come to the monitoring team for everything.
  • Directly using Promql to query data and configure alarm rules is a bit demanding. Can some rule libraries and query statements be built in so that knowledge can be accumulated and ordinary users can use it out of the box?
  • It is hoped that the alarm rules can be more flexible, such as supporting different rules with different effective times, and providing some built-in alarm self-healing mechanisms, etc.

That's what Nightingale was made for. In fact, the old version of Nightingale was self-contained and derived from Open-Falcon. However, as Prometheus became popular, Nightingale began to embrace the Prometheus ecosystem. Nightingale can be regarded as an alarm engine for time series data . Of course, Nightingale also provides the ability to view graphs and dashboards, and can even view data from Elasticsearch, Loki, and TDEngine. However, the current situation is that Nightingale's alarm capabilities are most used by everyone, and most dashboards still use Grafana. The typical architecture used by Nightingale is as follows:

Can Nightingale completely replace Prometheus?

In fact, it is not a substitution relationship, but a synergistic relationship. In Nightingale's view, Prometheus is mainly used as a timing library. In addition to Prometheus, you can also choose other timing libraries such as VictoriaMetrics, Thanos, M3DB, and TDEngine. Nightingale is only used as an alarm engine for a time series library. It can be connected to Prometheus or other time series libraries. Users can uniformly manage alarm rules in Nightingale, make judgments on abnormal data, generate alarm events, and make subsequent distribution notifications. Alarm self-healing and other logic.

In addition, if you have multiple computer rooms, the timing libraries are scattered in multiple computer rooms, and the network between the computer rooms is not good, and you want the edge computer room to be autonomous without affecting alarms even if the network is fragmented, Nightingale is also very suitable. In this case, Nightingale calls it the edge computer room deployment mode. The timing library and alarm engine are deployed downwards. It doesn't matter if the network is disconnected. When the network is good, the data can be uniformly viewed at the center and the alarm rules can be uniformly managed. The architecture diagram is as follows:

In the above example, the deployment architecture of three computer rooms is demonstrated. The network link between computer room A and the central computer room is very good, but the network link between computer room B and the central computer room is not very good. Each computer room has a timing library. Therefore, the Nightingale alarm engine in the central computer room directly processes the timing libraries of the central computer room and computer room A. The timing library of computer room B is processed by the alarm engine of computer room B, that is, n9e-edge in the figure. n9e-edge will be processed from the Nightingale in the central computer room. Synchronize the alarm rules, and then make alarm judgments on the timing library of the local computer room.

In this way, even if the network between computer room B and the central computer room is separated, because the alarm rules have been synchronized in the n9e-edge memory, the alarm engine of computer room B can still process the alarm determination work of the two timing libraries of computer room B normally. Improved the overall high availability of the monitoring system.

What scene uses Nightingale instead of Prometheus?

The key depends on what your pain points are. If you use a single point of Prometheus at this stage, your problem can be solved very well. There is no need to change it. In any company, the migration of technical tools will encounter various resistances. It is natural to understand.

If you have pain points in alarm rule management and high availability of alarms in edge computer rooms, you can try Nightingale. Any tool has its own advantages and disadvantages, choose according to the scenario.

Can Nightingale receive alarms from various monitoring systems and provide unified event notification?

Some friends saw that Nightingale can connect to various timing libraries, make alarm judgments, generate alarm events and distribute them, and they thought, can the alarms generated by my other monitoring systems also be sent by Nightingale? In this way, issues such as alarm notification templates, contacts, authentication login permissions, etc. can be managed uniformly.

In fact, this is not possible. This is a typical event OnCall requirement. It collects alarms from various monitoring systems (such as Prometheus, Zabbix, Open-Falcon, Blue Whale, various cloud monitoring, ElastAlert, etc.), and performs unified alarm convergence and noise reduction, scheduling, claiming and upgrading, Flexible distribution based on conditions, etc. If you want to do this well, it is worth using a separate product. Let's call this product the OnCall product. The relationship between the OnCall product and each monitoring system is:

20240509115353

That is: the monitoring system (including various types of cloud monitoring) focuses on solving the problems of data collection, storage, visual analysis, and alarm determination, and is responsible for generating alarm events. Afterwards, the alarm events are handed over to the OnCall center for processing, and the OnCall center is responsible. Convergence and noise reduction, suppression and shielding, filtering and distribution of alarm events and many other matters.

Good OnCall products are commercial products, such as PagerDuty, FlashDuty, Opsgenie, etc. You can Google them yourself to find what you need.

What more interesting functions does Nightingale have than Prometheus?

Here I will take a few system diagrams and give a brief introduction.

20240509115411

20240509115426

Nightingale does not collect data and can be connected to various collectors on the market. Among them, the connection between the categraf collector and Nightingale is the smoothest. If you use categoryf as a collector, you can collect various meta-information of the machine and build a lightweight machine layer. CMDB.

20240509115440

Nightingale has built-in alarm self-healing capabilities, that is, when an alarm occurs, it can automatically execute a script on the alarming machine. You can write some automated repair logic in the script.

20240509115501

Nightingale has a built-in indicator view, which will be released in the v7 beta3 version. It will also provide many commonly used promql built-in, just click to query, and it will be extremely friendly to novice users.

summary

We already have Prometheus, why do we need Nightingale? This article is an exploratory reply to this question. Hope this helps, thank you all for reading.

High school students create their own open source programming language as a coming-of-age ceremony - sharp comments from netizens: Relying on the defense, Apple released the M4 chip RustDesk. Domestic services were suspended due to rampant fraud. Yunfeng resigned from Alibaba. In the future, he plans to produce an independent game on Windows platform Taobao (taobao.com) Restart web version optimization work, programmers’ destination, Visual Studio Code 1.89 releases Java 17, the most commonly used Java LTS version, Windows 10 has a market share of 70%, Windows 11 continues to decline Open Source Daily | Google supports Hongmeng to take over; open source Rabbit R1; Docker supported Android phones; Microsoft’s anxiety and ambitions; Haier Electric has shut down the open platform
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/morflameblog/blog/11105690