7.prometheus告警配置管理AlertManager

一、prometheus告警管理介绍

prometheus的告警管理分为两部分。通过在prometheus服务端设置告警规则, Prometheus服务器端产生告警向Alertmanager发送告警。 然后,Alertmanager管理这些告警,包括静默,抑制,聚合以及通过电子邮件,邮件、微信、钉钉、Slack等方法发送通知。
设置警报和通知的主要步骤如下:
    设置并配置Alertmanager;
    配置Prometheus对Alertmanager访问;
    在Prometheus创建警报规则;

1、告警管理模块AlertManager的核心概念

分组(Grouping):分组将类似性质的告警分类为单个通知。 这在大型中断期间尤其有用,因为许多系统一次失败,并且可能同时发射数百到数千个警报。
抑制(Inhibition):如果某些特定的告警已经触发,则某些告警需要被抑制。(inhibit_rules)
静默(SILENCES):静默是在给定时间内简单地静音告警的方法。 基于匹配器配置静默,就像路由树一样。 检查告警是否匹配或者正则表达式匹配静默。 如果匹配,则不会发送该告警的通知。

二 、AlertManager安装设置及邮件告警

1、安装设置

主机安装
[root@node1 ~]# https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz
[root@node1 prometheus]# tar xf alertmanager-0.20.0.linux-amd64.tar.gz
[root@node1 prometheus]# cd alertmanager-0.20.0.linux-amd64
[root@node1 alertmanager-0.20.0.linux-amd64]# ./alertmanager --version
按需修改配置,运行二进制文件即可。
docker部署
[root@node1 ~]# docker pull prom/alertmanager
[root@node1 ~]# docker inspect prom/alertmanager
准备配置文件:alertmanager.yml,放到/opt/prometheus/alertmanager/下
[root@node1 ~]# docker exec -it alertmanager_tmp cat /etc/alertmanager/alertmanager.yml
[root@node1 ~]# mkdir /opt/prometheus/alertmanager/
[root@node1 alertmanager]# vim alertmanager.yml
[root@node1 alertmanager]# docker run -d --name alertmanager -p 9093:9093 -v /opt/prometheus/alertmanager/:/etc/alertmanager/ prom/alertmanager #运行
浏览器访问alertmanager的web界面: http://192.168.42.133:9093/#/alerts 

2、邮件告警设置

(1)配置alertmanager消息通知

[root@node1 ~]# cd /opt/prometheus/alertmanager/
[root@node1 alertmanager]# vim alertmanager.yml
global:
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'xxxxxxx'
  smtp_require_tls: false
 
route:
  receiver: 'mail_163'
 
receivers:
  - name: 'mail_163'
    email_configs:
    - to: '[email protected]'
[root@node1 alertmanager]# docker restart alertmanager

(2)配置prometheus,添加告警规则

[root@node1 prometheus]# pwd
/opt/prometheus/prometheus
[root@node1 prometheus]# vim rules/node1_alerts.yml
groups:
- name: node1_alerts
  rules:
  - alert: HighNodeCpu
    expr: instance:node_cpu:avg_rate1m > 10
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Hgih Node CPU for 1 hour
      console: This is a Test
[root@node1 prometheus]# vim prometheus.yml
rule_files:
   - "rules/node1_rules.yml"
   - "rules/*_alerts.yml"   #添加告警规则
[root@node1 prometheus]# docker restart prometheus-server  

(3)配置prometheus,添加alertmamager

[root@node1 prometheus]# vim prometheus.yml
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - 192.168.42.133:9093

(4)告警测试

[root@master ~]# wget https://cdn.pmylund.com/files/tools/cpuburn/linux/cpuburn-1.0-amd64.tar.gz
[root@master cpuburn]# ./cpuburn
查看prometheus告警界面:
查看alertmanager web界面( http://192.168.42.133:9093/#/alerts),可以看到告警已发出,邮箱收到告警如下图。告警模板可定制。
注:告警消息模板可定制,在alertmanager.yml文件中
templates:  #与global同级
  - 'template/*.tmpl'

三、添加其他告警规则(节点磁盘、节点target、prometheus、systemd)

磁盘还有7天满告警
groups:
- name: node1_alerts
  rules:
  - alert: DiskWillFillIn7Days
    expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 7*24*3600) < 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Disk on {{ $labels.instance }} will fill in approximately 7 days.

监控节点instance down

groups:
- name: node1_alerts
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 10s
    labels:
      severity: critical
    annotations:
      summary: Host {{ $labels.instance }} of {{ $labels.job }} is Down!
监控prometheus配置加载错误和与alertmanager连接失败:
[root@node1 rules]# vim prometheus_alerts.yml
groups:
  - name: prometheus_alerts
    rules:
    - alert: PrometheusConfigReloadFailed
      expr: prometheus_config_last_reload_successful == 0
      for: 1m
      labels:
        severity: warning
      annotations:
        description: Reloading Prometheus config has failed on {{ $labels.instance }}.
    - alert: PrometheusNotConnectedToAlertmanagers
      expr: prometheus_notifications_alertmanagers_discovered < 2
      for: 1m
      labels:
        severity: warning
      annotations:
        description: Prometheus {{ $labels.instance }} is not connected to some Alertmanagers.
监控systemd管理的服务down掉
groups:
  - name: service_alerts
    rules:
    - alert: NodeServiceDown
      expr: node_systemd_unit_state{state="active"} != 1
      for: 40s
      labels:
        severity: critical
      annotations:
        summary: Service {{ $labels.name }} on {{ $labels.instance }} is no longer active!

四、AlertManager路由配置

route属性用来设置报警的分发策略,它是一个树状结构,按照深度优先从左向右的顺序进行匹配。
[root@node1 alertmanager]# vim alertmanager.yml
global:
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'xxxx'
  smtp_require_tls: false
 
route:
  group_by: ['instance']    #报警分组依据,根据 labael(标签)进行匹配,如果是多个,就要多个都匹配
  group_wait: 30s             #组报警等待时间
  group_interval: 5m       #组报警间隔时间,为该组启动新的告警间隔时间
  repeat_interval: 3h       #重复报警间隔时间,告警发送成功后,下一次发送间隔时间
  receiver: mail_qq         #默认,必须指定
  routes:
  - match:
      severity: critical
    receiver: mail_163
  - match_re:
      severity: ^(warning|critical)$
    receiver: mail_qq
 
receivers:
  - name: 'mail_qq'
    email_configs:
    - to: '[email protected]'
  - name: 'mail_163'
    email_configs:
    - to: '[email protected]'
[root@node1 alertmanager]# docker restart alertmanager

五、AlertManager静默配置

类似于zabbix的维护模式,添加的静默模式的告警在设置时间段内不会告警。
web页面设置
命令行工具(amtool)设置
[root@node1 ~]# docker exec alertmanager /bin/amtool --alertmanager.url=http://192.168.42.133:9093 silence add alertname="InstanceDown" -c “忽略instance故障告警”
[root@node1 ~]# docker exec alertmanager /bin/amtool --alertmanager.url=http://192.168.42.133:9093 silence add alertname="InstanceDown" job=~".*CADvisor.*"  -c “忽略cadvsor instance故障告警”
[root@node1 ~]# docker exec alertmanager /bin/amtool --alertmanager.url=http://192.168.42.133:9093 silence query
[root@node1 ~]# docker exec alertmanager /bin/amtool --alertmanager.url=http://192.168.42.133:9093 silence expire 840158fb-2185-4568-b6c8-413ceaf7d3a5
[root@node1 ~]#  /bin/amtool --help #查看命令选项
注意:amtool 如果不指定 --alertmanager ,默认会在 $HOME/.config/amtool/config.yml 或/etc/amtools/config.yml 查询
 

猜你喜欢

转载自www.cnblogs.com/cmxu/p/12291167.html