基于prometheus/alertmanager的监控报警实现

2022-12-11 本文已影响0人天草二十六_简村人

一、背景

prometheus采集了很多指标，包括机器、应用、中间件等的，可是一直缺少告警，毕竟我们不可能一直盯着grafana大屏看。。。

本文的范围是讲述告警的实现，前提你对Prometheus已有初步了解，有一定的编程基础。

采用alertmanager来做告警，是prometheus的官方推荐，在本文中，实现代码非常少，只需要配置一个回调地址接口。核心实现并不在alertmanager，希望没让你失望哈。

二、目标

1、及时发现机器故障、数据指标异常等。
2、监控的基础上增加报警，适用于任何环境。告警的规则发布，就像程序发版一样，经由开发到测试，再到生产。

三、部署图

image.png

Prometheus能够监控的对象很多，除了这里罗列的一些，还包括容器、Prometheus自身等。

四、报警实现

1、prometheus

启动命令：nohup ./prometheus --web.enable-lifecycle --web.enable-admin-api --storage.tsdb.retention=60d &

热加载： curl -X POST http://127.0.0.1:9090/-/reload

prometheus.yml

这里配置alertmanager，指标的规则以及爬取终端。爬取既支持自定义的json格式，也支持consul这样子的注册中心，当然也支持数组。

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0.0.1:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/opt/prometheus-2.17.2.linux-amd64/rules/*.yml"
   #- "first_rules.yml"
   #- "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'file_sd'
    metrics_path: '/metrics'
    file_sd_configs:
      - files:
        - linux-targets.json
  - job_name: 'consul-prometheus'
    metrics_path: '/mgm/prometheus'
    consul_sd_configs:
    - server: '192.168.50.61:8500'
      services: []
  - job_name: 'cAdvisor'
    metrics_path: '/metrics'
    static_configs:
    - targets: ['192.168.10.150:8091','192.168.10.120:8091','192.168.5.66:8091']
  - job_name: cwp-to-video
    metrics_path: '/mgm/prometheus'
    static_configs:
    - targets: ['192.168.53.29:7109']

rule指标规则

由上面的主配置可以看出，规则文件是存储在/opt/prometheus-2.17.2.linux-amd64/rules/*.yml。

image.png

node-up.yml

groups:
- name: node-rule
  rules:
  - alert: linux机器
    expr: up{job="linux"} == 0 #只对linux机器监控上下线,不对服务 
    for: 120s
    labels:
      severity: warning
    annotations:
      summary: "机器{{ $labels.instance }} 挂了"
      description: "报告.请立即查看!"
      value: "{{ $value }}"

api.yml

groups:
- name: api-rule
  rules:
  - alert: "3秒以上的慢接口"
    expr: sum(increase(http_server_requests_seconds_count{}[1m])) by (application) - sum(increase(http_server_requests_seconds_bucket{le="3.0"}[1m])) by (application) > 10
    for: 120s
    labels:
      severity: warning
      application: "{{$labels.application}}"
    annotations:
      summary: "服务名:{{$labels.application}}3秒以上的慢接口超过10次"
      description: "应用的慢接口(3秒以上)次数的监控"
      value: "{{ $value }}"

  - alert: "5xx错误的接口"
    expr: sum(increase(http_server_requests_seconds_count{status=~"5.."}[1m])) by (application)  > 10
    for: 120s
    labels:
      severity: warning
      application: "{{$labels.application}}"
    annotations:
      summary: "服务名:{{$labels.application}}接口出现5xx错误的次数超过10次"
      description: "应用的5xx错误次数的监控"
      value: "{{ $value }}"

logback.yml

groups:
- name: logback-rule
  rules:
  - alert: "日志报警"
    expr: sum by (application) (increase(logback_events_total{level="error"}[1m]))  > 10
    for: 15s
    labels:
      application: "{{$labels.application}}"
      severity: warning
    annotations:
      summary: "服务名:{{$labels.application}}错误日志数超过了每分钟10条"
      description: "应用的报警值: {{ $value }}"
      value: "{{ $value }}"

disk.yml

groups:
- name: disk-rule
  rules:
  - alert: "磁盘空间报警"
    expr: 100 - (node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) > 80
    for: 60s
    labels:
      severity: warning
    annotations:
      summary: "服务名:{{$labels.instance}}磁盘空间使用超过80%了"
      description: "开发环境机器报警值: {{ $value }}"
      value: "{{ $value }}"

cpu.yml

groups:
- name: cpu-rule
  rules:
  - alert: "CPU报警"
    expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
    for: 120s
    labels:
      severity: warning
      instance: "{{$labels.instance}}"
    annotations:
      summary: "linux机器:{{$labels.instance}}CPU使用率超过80%了"
      description: "开发环境机器报警值: {{ $value }}"
      value: "{{ $value }}"

  - alert: "linux load5 over 5"
    for: 120s
    expr: node_load5 > 5
    labels:
      severity: warning
      instance: "{{$labels.instance}}"
    annotations:
      description: "{{ $labels.instance }} over 5, 当前值:{{ $value }}"
      summary: "linux load5 over 5"
      value: "{{ $value }}"

memory.yml

groups:
- name: memory-rule
  rules:
  - alert: "内存使用率高"
    expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 80
    for: 120s
    labels:
      severity: warning
    annotations:
      summary: "服务名:{{$labels.instance}}内存使用率超过80%了"
      description: "开发环境机器报警,内存使用率过高!"
      value: "{{ $value }}"

  - alert: "内存不足提醒"
    expr: (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 < 10
    for: 120s
    labels:
      severity: warning
    annotations:
      summary: "linux机器:{{$labels.instance}}内存不足低于10%了"
      description: "开发环境机器报警,内存不足!"
      value: "{{ $value }}"

network.yml

groups:
- name: network-rule
  rules:
  - alert: "eth0 input traffic network over 10M"
    expr: sum by(instance) (irate(node_network_receive_bytes_total{device="eth0",instance!~"172.1.*|172..*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
    for: 60s
    labels:
      severity: warning
      instance: "{{$labels.instance}}"
    annotations:
      summary: "eth0 input traffic network over 10M"
      description: "{{$labels.instance}}流入流量为:{{ $value }}M"
      value: "{{ $value }}"

  - alert: "eth0 output traffic network over 10M"
    expr: sum by(instance) (irate(node_network_transmit_bytes_total{device="eth0",instance!~"172.1.*|175.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
    for: 60s
    labels:
      severity: warning
      instance: "{{$labels.instance}}"
    annotations:
      summary: "eth0 output traffic network over 10M"
      description: "{{$labels.instance}}流出流量为: {{ $value }}"
      value: "{{ $value }}"

2、alertmanager

启动命令

nohup ./alertmanager  2>&1 | tee -a alertmanager.log &

alertmanager.yml

对告警进行打标签，这里配置自定义的回调接口（具体发送告警的实现在接口中）

global:
  resolve_timeout: 5m

route:
  group_wait: 30s # 在组内等待所配置的时间，如果同组内，30秒内出现相同报警，在一个组内出现。
  group_interval: 5m # 如果组内内容不变化，合并为一条警报信息，5m后发送。
  repeat_interval: 24h # 发送报警间隔，如果指定时间内没有修复，则重新发送报警。
  group_by: ['alertname']  # 报警分组
  receiver: 'webhook'

receivers:
- name: 'webhook'
  webhook_configs:
  - url: 'http://192.168.10.47/devops/api/prometheus/notify?env=dev'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

3、发送警告

接收报文

{
    "receiver":"webhook",
    "status":"resolved",
    "alerts":[
        {
            "status":"resolved",
            "labels":{
                "alertname":"linux机器",
                "instance":"192.168.8.18",
                "job":"linux",
                "severity":"warning"
            },
            "annotations":{
                "description":"报告.请立即查看!",
                "summary":"机器192.168.8.18 挂了",
                "value":"0"
            },
            "startsAt":"2022-05-27T16:57:03.205485317+08:00",
            "endsAt":"2022-12-12T15:05:03.205485317+08:00",
            "generatorURL":"[http://CplusSev0201:9090/graph?g0.expr=up%7Bjob%3D%22linux%22%7D+%3D%3D+0\u0026g0.tab=1](http://cplussev0201:9090/graph?g0.expr=up%7Bjob%3D%22linux%22%7D+%3D%3D+0\u0026g0.tab=1)",
            "fingerprint":"f6e41781dace19ad"
        }
    ],
    "groupLabels":{
        "alertname":"linux机器"
    },
    "commonLabels":{
        "alertname":"linux机器",
        "job":"linux",
        "severity":"warning"
    },
    "commonAnnotations":{
        "description":"报告.请立即查看!",
        "value":"0"
    },
    "externalURL":"[http://CplusSev0201:9093](http://cplussev0201:9093/)",
    "version":"4",
    "groupKey":"{}:{alertname=\"linux机器\"}",
    "truncatedAlerts":0
}

处理报文

比在alertmanager里实现发送告警要方便得多，核心问题是要知道某个机器或应用，它的负责方是谁。而这又本属于cmdb的范畴，所以建议你采用标签机制，对机器、中间件和应用打好标签，和告警的消息标签对应起来，轻松找到告警接收者。 -- 这就是为什么我们要自己写回调接口来实现的原因之一。

环境区分是通过参数变量env，它在接口的表单里传递。
消息体没有使用模板，因为alertmanager发送过来的消息是批量的，它会帮助我们对告警进行一个聚合收敛。
最后是发送策略，你可以发送到某个人，也可以发送到指定群，当然也可以是SMS，或者仅打印error日志(通过sentry告警)。如果是有界面的话，一般会让管理者选择告警的时间段，毕竟开发或测试环境挂了，它的优先级相对线上低多了。

        JSONObject jsonObject = JSON.parseObject(requestJson);

        String alerts = jsonObject.getString("alerts");
        if (StringUtils.isEmpty(alerts)) {
            if (log.isWarnEnabled()) {
                log.warn("prometheus回调的告警列表为空");
            }
            return ResponseEntity.noContent().build();
        }

        // 环境包括：dev/test/prod。 默认是prod
        String env = StringUtils.isEmpty(request.getParameter("env")) ? "prod" : request.getParameter("env");

        List<AlertDTO> alertDTOList = JSONObject.parseArray(alerts, AlertDTO.class);

        StringBuilder content = new StringBuilder("> Prometheus出现告警,需要及时跟进！！\n");

        for (AlertDTO alert : alertDTOList) {

            content.append("> =======start=========\n\n");

            content.append("> **告警类型：** ").append(alert.getLabels().getAlertname()).append("\n\n");
            content.append("> **告警主题：** ").append(alert.getAnnotations().getSummary()).append("\n\n");
            content.append("> **告警详情：** ").append(alert.getAnnotations().getDescription()).append("\n\n");
            content.append("> **触发阈值：** ").append(alert.getAnnotations().getValue()).append("\n\n");

            content.append("> **触发时间：** ").append(this.formatDateTime(alert.getStartsAt())).append("\n\n");
            content.append("> **链接地址：** ").append(this.replaceUrl(alert.getGeneratorURL()))
                    .append("[点击跳转](").append(this.replaceUrl(alert.getGeneratorURL())).append(")").append("\n\n");

            content.append("> =======end=========\n\n");

            content.append("\n\n\n\n\n");
        }

        // 不同的环境，发送的策略不一
        switch (env) {
            case "dev":
            case "test":
                int iHour = DateUtil.thisHour(true);
                if (iHour >= 8 && iHour <= 18) {
                    WxchatMessageUtil.sendByPhone(content.toString(), "150xxxx9916");
                } else {
                    log.error("触发告警！因已下班，故打印日志以记录。");
                }
                break;
            case "prod":
                WxchatMessageUtil.sendByRobot(content.toString(), "a82xx480-3b64-485a-8c25-b90c483308cc");
                break;
            default:
                break;
        }

发送警告

本来是在alertmanager里实现的复杂配置，落在devops-service服务中实现，应对多种发送渠道，灵活得很。这里强烈建议。

至于怎么发送企业微信、SMS、钉钉等消息，就不属于本文的范畴了，不赘述。

企业微信截图_0caa58fa-61a7-40c1-952c-cf1d8ed075a6.png