The last article about Prometheus talked about how Prometheus implements process monitoring. In the actual online environment, when the system process is abnormal, it is necessary to notify the operation and maintenance personnel on duty in real time to check whether the system is still operating normally. Next, we will introduce how to implement monitoring and alarm notification based on Prometheus.
The alarm notification of Prometheus uses its component AlertManager. Alertmanager receives alerts from clients such as Prometheus, and then processes them by grouping, deleting duplicates, etc., and sends them to the correct receiver through routing. Alerts can be sent to different module leaders according to different rules. Alertmanager supports alerts such as Wechat, Email, and Webhook, among which Webhook can be connected to chat tools such as DingTalk.
Alarm process
- Prometheus configuration monitoring rules
- Monitoring object trigger threshold
- Threshold Exceeded Duration
- Push alerts to Alertmanager
- Alertmanager processes alarm information
1) Group (group): Similar alarms are combined into one notification.
2) Silences: No notification, used when the system is upgraded.
3) Inhibition: Notify only once, the same content will not be notified again. - Alertmanager sends notifications to the media, mailboxes, DingTalk, WeChat, etc. Receive notifications
Install and deploy AlertManager
Deploy alert term manager
download binaries
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar zxvf alertmanager-0.24.0.linux-amd64.tar.gz
mv alertmanager-0.24.0.linux-amd64 /apps/alertmanager
Create an alerttermanager service
vim /etc/systemd/system/alertmanager.service
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
[Service]
User=root
Type=simple
#不能有单引号和双引号
ExecStart=/home/prometheus/alertmanager/alertmanager/alertmanager --config.file=/home/prometheus/alertmanager/alertmanager/alertmanager.yml --storage.path=/home/prometheus/alertmanager/alertmanager/data --web.listen-address=:19093 --cluster.listen-address=0.0.0.0:19094 --web.external-url=http://192.168.1.108:19093
Restart=on-failure
[Install]
WantedBy=multi-user.target
Start the service:
systemctl daemon-reload
systemctl enable --now alertmanager
systemctl status alertmanager
Visit 192.168.1.108:19093 to manage the alertmanager page:
Alertmanager configuration
Detailed explanation of the configuration file, taking the mailbox alarm as an example:
vim /home/prometheus/alertmanager/alertmanager/alertmanager.yml
#邮件发送者
global:
resolve_timeout: 30s
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'xxxxxxxxvpobcee'
smtp_hello: '@qq.com'
smtp_require_tls: false
templates:
- '/home/prometheus/alertmanager/alertmanager/tmpl/email.tmpl' #增加templates配置
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 5m
receiver: 'email'
routes:
- receiver: dingtalk-webhook
group_wait: 10s
- receiver: email
group_wait: 10s
receivers:
- name: 'email'
email_configs:
- to: '[email protected]'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Item Value
Prometheus rules
Create a new rule file, configure group information, alarm threshold and time, alarm label and comment, etc.
The indicator expression adopts PromQL statement, and the unit of most indicators is bytes, which needs to be converted into KMG, for example, 2M=2 1024 1024.
Prometheus rule file, for mailbox, DingTalk or enterprise WeChat, this file is common:
vim /home/prometheus/prometheus/rule/qtalk_auth.yaml
groups:
- name: qtalk_auth 程异常退出
rules:
- alert: 应用进程 qtalk_auth 异常退出 # 告警名称
expr: (namedprocess_namegroup_num_procs{groupname="map[:qtalk_auth]"}) == 0
for: 30s # 满足告警条件持续时间多久后,才会发送告警
labels: #标签项
severity: error
ip: 192.168.1.108
annotations: # 解析项,详细解释告警信息
summary: "进程异常报警 Alert {
{ $labels.instance }} ,异常停止超过30秒."
description: "{
{$labels.ip}} 进程{
{$labels.groupname}} 异常停止!请立即查看!"
Check the prometheus alarm rule file, showing SUCCESS:
/home/prometheus/prometheus/promtool check rules rule/qtalk_auth.yml
Checking rule/qtalk_auth.yml
SUCCESS: 1 rules found
Prometheus configuration
Configure the Prometheus file, the IP and port of the alertmanagers server, and the path of the prometheus server rule file:
vim /home/Prometheus/prometheus/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["192.168.1.108:19093"]
#- alertmanager:["192.168.1.108:19093"]
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rule/*.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: 'process'
static_configs:
- targets: ['192.168.1.108:9256']
Restart the Prometheus service:
systemctl restart prometheus.service
Email Alert
View Prometheus
Prometheus home page, Alerts option, you can view the alarm information:
There are 3 alarm states:
- inactive: no exception.
- pending: The threshold has been triggered, but the alarm duration has not been met (that is, the for field in the rule).
- firing: The threshold has been triggered and the condition is met and sent to alertmanager.
In the pending state, the threshold is triggered, but observe for another 30m seconds (for: 30s).
In the firing state, if the threshold is exceeded after 30 seconds, it will be sent to alertmanager.
View Alertmanager
Only the warning of Firing in Prometheus will be sent to Alertmanager, enter the home page to view.
check email
After Prometheus sends an alert to alertmanager, alertmanager sends the alert message via email according to the notification settings:
When sending emails, the emails are pushed according to the time interval in the configuration rules. (Can be modified in the configuration file)
So far, a simple Prometheus-based system monitoring and alarm notification service has been built. Using such a monitoring and notification system can allow system operation and maintenance personnel to know the system health early and ensure high system availability.
reference documents
Prometheus sends recovery value_Prometheus-Basic Series-(5)-Alarm System-2