基于prometheus/alertmanager的监控报警实现

2022-12-11  本文已影响0人  天草二十六_简村人

一、背景

prometheus采集了很多指标,包括机器、应用、中间件等的,可是一直缺少告警,毕竟我们不可能一直盯着grafana大屏看。。。

本文的范围是讲述告警的实现,前提你对Prometheus已有初步了解,有一定的编程基础。

采用alertmanager来做告警,是prometheus的官方推荐,在本文中,实现代码非常少,只需要配置一个回调地址接口。核心实现并不在alertmanager,希望没让你失望哈。

二、目标

三、部署图

image.png

Prometheus能够监控的对象很多,除了这里罗列的一些,还包括容器、Prometheus自身等。

四、报警实现

1、prometheus

启动命令:nohup ./prometheus --web.enable-lifecycle --web.enable-admin-api --storage.tsdb.retention=60d &

prometheus.yml

这里配置alertmanager,指标的规则以及爬取终端。爬取既支持自定义的json格式,也支持consul这样子的注册中心, 当然也支持数组。

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0.0.1:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/opt/prometheus-2.17.2.linux-amd64/rules/*.yml"
   #- "first_rules.yml"
   #- "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'file_sd'
    metrics_path: '/metrics'
    file_sd_configs:
      - files:
        - linux-targets.json
  - job_name: 'consul-prometheus'
    metrics_path: '/mgm/prometheus'
    consul_sd_configs:
    - server: '192.168.50.61:8500'
      services: []
  - job_name: 'cAdvisor'
    metrics_path: '/metrics'
    static_configs:
    - targets: ['192.168.10.150:8091','192.168.10.120:8091','192.168.5.66:8091']
  - job_name: cwp-to-video
    metrics_path: '/mgm/prometheus'
    static_configs:
    - targets: ['192.168.53.29:7109']

rule指标规则

由上面的主配置可以看出,规则文件是存储在/opt/prometheus-2.17.2.linux-amd64/rules/*.yml。

image.png
groups:
- name: node-rule
  rules:
  - alert: linux机器
    expr: up{job="linux"} == 0 #只对linux机器监控上下线,不对服务 
    for: 120s
    labels:
      severity: warning
    annotations:
      summary: "机器{{ $labels.instance }} 挂了"
      description: "报告.请立即查看!"
      value: "{{ $value }}"
groups:
- name: api-rule
  rules:
  - alert: "3秒以上的慢接口"
    expr: sum(increase(http_server_requests_seconds_count{}[1m])) by (application) - sum(increase(http_server_requests_seconds_bucket{le="3.0"}[1m])) by (application) > 10
    for: 120s
    labels:
      severity: warning
      application: "{{$labels.application}}"
    annotations:
      summary: "服务名:{{$labels.application}}3秒以上的慢接口超过10次"
      description: "应用的慢接口(3秒以上)次数的监控"
      value: "{{ $value }}"

  - alert: "5xx错误的接口"
    expr: sum(increase(http_server_requests_seconds_count{status=~"5.."}[1m])) by (application)  > 10
    for: 120s
    labels:
      severity: warning
      application: "{{$labels.application}}"
    annotations:
      summary: "服务名:{{$labels.application}}接口出现5xx错误的次数超过10次"
      description: "应用的5xx错误次数的监控"
      value: "{{ $value }}"
groups:
- name: logback-rule
  rules:
  - alert: "日志报警"
    expr: sum by (application) (increase(logback_events_total{level="error"}[1m]))  > 10
    for: 15s
    labels:
      application: "{{$labels.application}}"
      severity: warning
    annotations:
      summary: "服务名:{{$labels.application}}错误日志数超过了每分钟10条"
      description: "应用的报警值: {{ $value }}"
      value: "{{ $value }}"
groups:
- name: disk-rule
  rules:
  - alert: "磁盘空间报警"
    expr: 100 - (node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) > 80
    for: 60s
    labels:
      severity: warning
    annotations:
      summary: "服务名:{{$labels.instance}}磁盘空间使用超过80%了"
      description: "开发环境机器报警值: {{ $value }}"
      value: "{{ $value }}"
groups:
- name: cpu-rule
  rules:
  - alert: "CPU报警"
    expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
    for: 120s
    labels:
      severity: warning
      instance: "{{$labels.instance}}"
    annotations:
      summary: "linux机器:{{$labels.instance}}CPU使用率超过80%了"
      description: "开发环境机器报警值: {{ $value }}"
      value: "{{ $value }}"

  - alert: "linux load5 over 5"
    for: 120s
    expr: node_load5 > 5
    labels:
      severity: warning
      instance: "{{$labels.instance}}"
    annotations:
      description: "{{ $labels.instance }} over 5, 当前值:{{ $value }}"
      summary: "linux load5 over 5"
      value: "{{ $value }}"
groups:
- name: memory-rule
  rules:
  - alert: "内存使用率高"
    expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 80
    for: 120s
    labels:
      severity: warning
    annotations:
      summary: "服务名:{{$labels.instance}}内存使用率超过80%了"
      description: "开发环境机器报警,内存使用率过高!"
      value: "{{ $value }}"

  - alert: "内存不足提醒"
    expr: (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 < 10
    for: 120s
    labels:
      severity: warning
    annotations:
      summary: "linux机器:{{$labels.instance}}内存不足低于10%了"
      description: "开发环境机器报警,内存不足!"
      value: "{{ $value }}"
groups:
- name: network-rule
  rules:
  - alert: "eth0 input traffic network over 10M"
    expr: sum by(instance) (irate(node_network_receive_bytes_total{device="eth0",instance!~"172.1.*|172..*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
    for: 60s
    labels:
      severity: warning
      instance: "{{$labels.instance}}"
    annotations:
      summary: "eth0 input traffic network over 10M"
      description: "{{$labels.instance}}流入流量为:{{ $value }}M"
      value: "{{ $value }}"

  - alert: "eth0 output traffic network over 10M"
    expr: sum by(instance) (irate(node_network_transmit_bytes_total{device="eth0",instance!~"172.1.*|175.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
    for: 60s
    labels:
      severity: warning
      instance: "{{$labels.instance}}"
    annotations:
      summary: "eth0 output traffic network over 10M"
      description: "{{$labels.instance}}流出流量为: {{ $value }}"
      value: "{{ $value }}"

2、alertmanager

启动命令

nohup ./alertmanager  2>&1 | tee -a alertmanager.log &

alertmanager.yml

对告警进行打标签,这里配置自定义的回调接口(具体发送告警的实现在接口中)

global:
  resolve_timeout: 5m

route:
  group_wait: 30s # 在组内等待所配置的时间,如果同组内,30秒内出现相同报警,在一个组内出现。
  group_interval: 5m # 如果组内内容不变化,合并为一条警报信息,5m后发送。
  repeat_interval: 24h # 发送报警间隔,如果指定时间内没有修复,则重新发送报警。
  group_by: ['alertname']  # 报警分组
  receiver: 'webhook'

receivers:
- name: 'webhook'
  webhook_configs:
  - url: 'http://192.168.10.47/devops/api/prometheus/notify?env=dev'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

3、发送警告

{
    "receiver":"webhook",
    "status":"resolved",
    "alerts":[
        {
            "status":"resolved",
            "labels":{
                "alertname":"linux机器",
                "instance":"192.168.8.18",
                "job":"linux",
                "severity":"warning"
            },
            "annotations":{
                "description":"报告.请立即查看!",
                "summary":"机器192.168.8.18 挂了",
                "value":"0"
            },
            "startsAt":"2022-05-27T16:57:03.205485317+08:00",
            "endsAt":"2022-12-12T15:05:03.205485317+08:00",
            "generatorURL":"[http://CplusSev0201:9090/graph?g0.expr=up%7Bjob%3D%22linux%22%7D+%3D%3D+0\u0026g0.tab=1](http://cplussev0201:9090/graph?g0.expr=up%7Bjob%3D%22linux%22%7D+%3D%3D+0\u0026g0.tab=1)",
            "fingerprint":"f6e41781dace19ad"
        }
    ],
    "groupLabels":{
        "alertname":"linux机器"
    },
    "commonLabels":{
        "alertname":"linux机器",
        "job":"linux",
        "severity":"warning"
    },
    "commonAnnotations":{
        "description":"报告.请立即查看!",
        "value":"0"
    },
    "externalURL":"[http://CplusSev0201:9093](http://cplussev0201:9093/)",
    "version":"4",
    "groupKey":"{}:{alertname=\"linux机器\"}",
    "truncatedAlerts":0
}

比在alertmanager里实现发送告警要方便得多,核心问题是要知道某个机器或应用,它的负责方是谁。而这又本属于cmdb的范畴,所以建议你采用标签机制,对机器、中间件和应用打好标签,和告警的消息标签对应起来,轻松找到告警接收者。 -- 这就是为什么我们要自己写回调接口来实现的原因之一。

        JSONObject jsonObject = JSON.parseObject(requestJson);

        String alerts = jsonObject.getString("alerts");
        if (StringUtils.isEmpty(alerts)) {
            if (log.isWarnEnabled()) {
                log.warn("prometheus回调的告警列表为空");
            }
            return ResponseEntity.noContent().build();
        }

        // 环境包括:dev/test/prod。 默认是prod
        String env = StringUtils.isEmpty(request.getParameter("env")) ? "prod" : request.getParameter("env");

        List<AlertDTO> alertDTOList = JSONObject.parseArray(alerts, AlertDTO.class);

        StringBuilder content = new StringBuilder("> Prometheus出现告警,需要及时跟进!!\n");

        for (AlertDTO alert : alertDTOList) {

            content.append("> =======start=========\n\n");

            content.append("> **告警类型:** ").append(alert.getLabels().getAlertname()).append("\n\n");
            content.append("> **告警主题:** ").append(alert.getAnnotations().getSummary()).append("\n\n");
            content.append("> **告警详情:** ").append(alert.getAnnotations().getDescription()).append("\n\n");
            content.append("> **触发阈值:** ").append(alert.getAnnotations().getValue()).append("\n\n");

            content.append("> **触发时间:** ").append(this.formatDateTime(alert.getStartsAt())).append("\n\n");
            content.append("> **链接地址:** ").append(this.replaceUrl(alert.getGeneratorURL()))
                    .append("[点击跳转](").append(this.replaceUrl(alert.getGeneratorURL())).append(")").append("\n\n");

            content.append("> =======end=========\n\n");

            content.append("\n\n\n\n\n");
        }

        // 不同的环境,发送的策略不一
        switch (env) {
            case "dev":
            case "test":
                int iHour = DateUtil.thisHour(true);
                if (iHour >= 8 && iHour <= 18) {
                    WxchatMessageUtil.sendByPhone(content.toString(), "150xxxx9916");
                } else {
                    log.error("触发告警!因已下班,故打印日志以记录。");
                }
                break;
            case "prod":
                WxchatMessageUtil.sendByRobot(content.toString(), "a82xx480-3b64-485a-8c25-b90c483308cc");
                break;
            default:
                break;
        }

本来是在alertmanager里实现的复杂配置,落在devops-service服务中实现,应对多种发送渠道,灵活得很。这里强烈建议。

企业微信截图_0caa58fa-61a7-40c1-952c-cf1d8ed075a6.png
上一篇下一篇

猜你喜欢

热点阅读