Prometheus 之 告警规则配置

一、告警规则实现原理

目前Prometheus与Alertmanager已经连通,接下来我们可以针对收集到的各类指标配置报警规则,一旦满足报警规则的设置,则Prometheus将报警信息推送给Alertmanager,进而转发到我们配置的邮件中。在哪里配置?同样是在prometheus-configmap中:
Prometheus 之  告警规则配置

二、修改prometheus配置文件

prometheus-configmap.yaml 配置文件增加rule_files与alert_rules.yml
#一个ConfigMap上传两个配置文件,分别为prometheus.yml与alert_rules.yml

$ vim prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitor
data:
  prometheus.yml: |
    global:
      scrape_interval: 30s
      evaluation_interval: 30s
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - alertmanager:9093
    # Load rules once and periodically evaluate them according to the global  'evaluation_interval'.
    rule_files:        #增加的内容
      - /etc/prometheus/alert_rules.yml
      # - "first_rules.yml"
      # - "second_rules.yml"
    scrape_configs:
    - job_name: 'prometheus'
      static_configs:
      - targets: ['localhost:9090']
... # 省略中间部分
  alert_rules.yml: |              #增加的内容
    groups:
    - name: node_metrics
      rules:
      - alert: NodeLoad
        expr: node_load15 < 1
        for: 2m
        annotations:
          summary: "{{$labels.instance}}: Low node load detected"
          description: "{{$labels.instance}}: node load is below 1 (current value is: {{ $value }}"

Prometheus 之  告警规则配置

Prometheus 之  告警规则配置

三、告警规则的理解

  alert_rules.yml: |             
    groups:
    - name: node_metrics
      rules:
      - alert: NodeLoad
        expr: node_load15 < 1    #检查各节点cpu5分钟负载低于1的
        for: 2m
        annotations:
          summary: "{{$labels.instance}}: Low node load detected"
          description: "{{$labels.instance}}: node load is below 1 (current value is: {{ $value }}"
  • group.name:告警分组的名称,一个组下可以配置一类告警规则,比如都是物理节点相关的告警
  • alert:告警规则的名称
  • expr:是用于进行报警规则 PromQL 查询语句,expr通常是布尔表达式,可以让Prometheus根据计算的指标值做 true or false 的判断
  • for:评估等待时间(Pending Duration),用于表示只有当触发条件持续一段时间后才发送告警,在等待期间新产生的告警状态为pending,屏蔽掉瞬时的问题,把焦点放在真正有持续影响的问题上
  • labels:自定义标签,允许用户指定额外的标签列表,把它们附加在告警上,可以用于后面做路由判断,通知到不同的终端,通常被用于添加告警级别的标签
  • annotations:指定了另一组标签,它们不被当做告警实例的身份标识,它们经常用于存储一些额外的信息,用于报警信息的展示之类的

规则配置中,支持模板的方式,其中:

  • {{$labels}}可以获取当前指标的所有标签,支持{{$labels.instance}}或者{{$labels.job}}这种形式
  • {{ $value }}可以获取当前计算出的指标值

一个报警信息在生命周期内有下面3种状态:

  • inactive: 表示当前报警信息处于非活动状态,即不满足报警条件
  • pending: 表示在设置的阈值时间范围内被激活了,即满足报警条件,但是还在观察期内
  • firing: 表示超过设置的阈值时间被激活了,即满足报警条件,且报警触发时间超过了观察期,会发送到Alertmanager端

四、检查告警内容

Prometheus Alerts 内容

Prometheus 之  告警规则配置

收到邮件告警与恢复信息

Prometheus 之  告警规则配置

Prometheus 之  告警规则配置

整个prometheus-configmap.yaml文件内容

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitor
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - alertmanager:9093    #通过alertmanager的svc方式推送
    # Load rules once and periodically evaluate them according to the global  'evaluation_interval'.
    rule_files:
      - /etc/prometheus/alert_rules.yml
      # - "first_rules.yml"
      # - "second_rules.yml"
    scrape_configs:
    - job_name: 'prometheus'
      static_configs:
      - targets: ['localhost:9090']
    - job_name: 'coredns'
      static_configs:
      - targets: ['10.96.0.10:9153']
    - job_name: 'kubernetes-apiserver'
      static_configs:
      - targets: ['10.96.0.1']         #apiserver的svc
      scheme: https                    #访问方式,默认是http
      tls_config:                          #因apiserver的调用需有授权认证信息
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    - job_name: 'kubernetes-scheduler'
      static_configs:
      - targets: ['10.3.153.200:10251']
    - job_name: 'kubernetes-controller'
      static_configs:
      - targets: ['10.3.153.200:10252']
    - job_name: 'kubernetes-sd-node-exporter'
      kubernetes_sd_configs:
        - role: node
      relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
        action: replace
    - job_name: 'kubernetes-sd-cadvisor'
      kubernetes_sd_configs:
        - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
        - target_label: __address__
          replacement: 10.96.0.1
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
  alert_rules.yml: |              
    groups:
    - name: node_metrics
      rules:
      - alert: NodeLoad
        expr: node_load15 < 1
        for: 2m
        annotations:
          summary: "{{$labels.instance}}: Low node load detected"
          description: "{{$labels.instance}}: node load is below 1 (current value is: {{ $value }}"

猜你喜欢

转载自blog.51cto.com/12965094/2687946