安装alertmanager
到https://prometheus.io/download/下载alertmanager
解压后编辑alertmanager.yml,只实现报警功能,修改如下
global: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:25' smtp_from: '***@163.com' smtp_auth_username: '***@163.com' smtp_auth_password: '******' #授权密码 smtp_require_tls: false route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1m #重复间隔,这里设置为1m,生产环境设置为20m-30m左右 receiver: 'mail' receivers: - name: 'mail' email_configs: - to: '@@@@@@163.com'
启动
nohup ./alertmanager --config.file=/root/alertmanager-0.17.0.linux-amd64/alertmanager.yml &
修改prometheus配置如下(和alertmanager同主机):
alerting: alertmanagers: - static_configs: - targets: - 127.0.0.1:9093 #本机启动Alertmanager因此使用127.0.0.1,也可部署在其他主机 rule_files: - "rules/*.yml" #设置报警规则文件
添加node普遍报警法规则内容如下:
groups: - name: general.rules rules: # Alert for any ×××tance that is unreachable for >5 minutes. - alert: InstanceDown expr: up == 0 for: 1m labels: severity: error annotations: summary: "Instance {{ $labels.×××tance }} down" description: "{{ $labels.×××tance }} of job {{ $labels.job }} has been down for more than 5 minutes."
查看prometheus中,Targets状态,此时node http://192.168.199.221:9100/metrics是up状态
将221的node_exporter停掉,再次观察
查看Alerts状态
稍等片刻收到报警
添加内存的报警规则内容如下:
groups: - name: mem.rules rules: # Alert for any ×××tance that is unreachable for >5 minutes. - alert: NodeMemoryUsage expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 5 for: 1m labels: severity: error annotations: summary: "Instance {{ $labels.×××tance }} down" description: "{{ $labels.×××tance }} of job {{ $labels.job }} has been down for more than 5 minutes."
说明:由于是测试使用,将使用率调整为超过5%就报警
重载prometheus。查看prometheus ui中alert状态
查看规则是否生效
过片刻,收到报警
添加cpu报警规则文件内容如下:
groups: - name: cpu.rules rules: # Alert for any ×××tance that is unreachable for >5 minutes. - alert: NodeCpuUsage expr: 100-irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])*100 > 1 for: 1m labels: severity: error annotations: summary: "{{ $labels.×××tance }} cpu useage load too high" description: "{{ $labels.×××tance }} of job {{ $labels.job }} has been too hgih for more than 1 minutes."
这里设置阈值为1%,只是测试使用。重载prometheus,过片刻收到报警
转载于:https://blog.51cto.com/lvsir666/2409063