基于prometheus/alertmanager的监控报警实现
2022-12-11 本文已影响0人
天草二十六_简村人
一、背景
prometheus采集了很多指标,包括机器、应用、中间件等的,可是一直缺少告警,毕竟我们不可能一直盯着grafana大屏看。。。
本文的范围是讲述告警的实现,前提你对Prometheus已有初步了解,有一定的编程基础。
采用alertmanager来做告警,是prometheus的官方推荐,在本文中,实现代码非常少,只需要配置一个回调地址接口。核心实现并不在alertmanager,希望没让你失望哈。
二、目标
- 1、及时发现机器故障、数据指标异常等。
- 2、监控的基础上增加报警,适用于任何环境。告警的规则发布,就像程序发版一样,经由开发到测试,再到生产。
三、部署图
image.pngPrometheus能够监控的对象很多,除了这里罗列的一些,还包括容器、Prometheus自身等。
四、报警实现
1、prometheus
启动命令:nohup ./prometheus --web.enable-lifecycle --web.enable-admin-api --storage.tsdb.retention=60d &
- 热加载: curl -X POST http://127.0.0.1:9090/-/reload
prometheus.yml
这里配置alertmanager,指标的规则以及爬取终端。爬取既支持自定义的json格式,也支持consul这样子的注册中心, 当然也支持数组。
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/opt/prometheus-2.17.2.linux-amd64/rules/*.yml"
#- "first_rules.yml"
#- "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'file_sd'
metrics_path: '/metrics'
file_sd_configs:
- files:
- linux-targets.json
- job_name: 'consul-prometheus'
metrics_path: '/mgm/prometheus'
consul_sd_configs:
- server: '192.168.50.61:8500'
services: []
- job_name: 'cAdvisor'
metrics_path: '/metrics'
static_configs:
- targets: ['192.168.10.150:8091','192.168.10.120:8091','192.168.5.66:8091']
- job_name: cwp-to-video
metrics_path: '/mgm/prometheus'
static_configs:
- targets: ['192.168.53.29:7109']
rule指标规则
image.png由上面的主配置可以看出,规则文件是存储在/opt/prometheus-2.17.2.linux-amd64/rules/*.yml。
- node-up.yml
groups:
- name: node-rule
rules:
- alert: linux机器
expr: up{job="linux"} == 0 #只对linux机器监控上下线,不对服务
for: 120s
labels:
severity: warning
annotations:
summary: "机器{{ $labels.instance }} 挂了"
description: "报告.请立即查看!"
value: "{{ $value }}"
- api.yml
groups:
- name: api-rule
rules:
- alert: "3秒以上的慢接口"
expr: sum(increase(http_server_requests_seconds_count{}[1m])) by (application) - sum(increase(http_server_requests_seconds_bucket{le="3.0"}[1m])) by (application) > 10
for: 120s
labels:
severity: warning
application: "{{$labels.application}}"
annotations:
summary: "服务名:{{$labels.application}}3秒以上的慢接口超过10次"
description: "应用的慢接口(3秒以上)次数的监控"
value: "{{ $value }}"
- alert: "5xx错误的接口"
expr: sum(increase(http_server_requests_seconds_count{status=~"5.."}[1m])) by (application) > 10
for: 120s
labels:
severity: warning
application: "{{$labels.application}}"
annotations:
summary: "服务名:{{$labels.application}}接口出现5xx错误的次数超过10次"
description: "应用的5xx错误次数的监控"
value: "{{ $value }}"
- logback.yml
groups:
- name: logback-rule
rules:
- alert: "日志报警"
expr: sum by (application) (increase(logback_events_total{level="error"}[1m])) > 10
for: 15s
labels:
application: "{{$labels.application}}"
severity: warning
annotations:
summary: "服务名:{{$labels.application}}错误日志数超过了每分钟10条"
description: "应用的报警值: {{ $value }}"
value: "{{ $value }}"
- disk.yml
groups:
- name: disk-rule
rules:
- alert: "磁盘空间报警"
expr: 100 - (node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100) > 80
for: 60s
labels:
severity: warning
annotations:
summary: "服务名:{{$labels.instance}}磁盘空间使用超过80%了"
description: "开发环境机器报警值: {{ $value }}"
value: "{{ $value }}"
- cpu.yml
groups:
- name: cpu-rule
rules:
- alert: "CPU报警"
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
for: 120s
labels:
severity: warning
instance: "{{$labels.instance}}"
annotations:
summary: "linux机器:{{$labels.instance}}CPU使用率超过80%了"
description: "开发环境机器报警值: {{ $value }}"
value: "{{ $value }}"
- alert: "linux load5 over 5"
for: 120s
expr: node_load5 > 5
labels:
severity: warning
instance: "{{$labels.instance}}"
annotations:
description: "{{ $labels.instance }} over 5, 当前值:{{ $value }}"
summary: "linux load5 over 5"
value: "{{ $value }}"
- memory.yml
groups:
- name: memory-rule
rules:
- alert: "内存使用率高"
expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 80
for: 120s
labels:
severity: warning
annotations:
summary: "服务名:{{$labels.instance}}内存使用率超过80%了"
description: "开发环境机器报警,内存使用率过高!"
value: "{{ $value }}"
- alert: "内存不足提醒"
expr: (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 < 10
for: 120s
labels:
severity: warning
annotations:
summary: "linux机器:{{$labels.instance}}内存不足低于10%了"
description: "开发环境机器报警,内存不足!"
value: "{{ $value }}"
- network.yml
groups:
- name: network-rule
rules:
- alert: "eth0 input traffic network over 10M"
expr: sum by(instance) (irate(node_network_receive_bytes_total{device="eth0",instance!~"172.1.*|172..*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
for: 60s
labels:
severity: warning
instance: "{{$labels.instance}}"
annotations:
summary: "eth0 input traffic network over 10M"
description: "{{$labels.instance}}流入流量为:{{ $value }}M"
value: "{{ $value }}"
- alert: "eth0 output traffic network over 10M"
expr: sum by(instance) (irate(node_network_transmit_bytes_total{device="eth0",instance!~"172.1.*|175.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10
for: 60s
labels:
severity: warning
instance: "{{$labels.instance}}"
annotations:
summary: "eth0 output traffic network over 10M"
description: "{{$labels.instance}}流出流量为: {{ $value }}"
value: "{{ $value }}"
2、alertmanager
启动命令
nohup ./alertmanager 2>&1 | tee -a alertmanager.log &
alertmanager.yml
对告警进行打标签,这里配置自定义的回调接口(具体发送告警的实现在接口中)
global:
resolve_timeout: 5m
route:
group_wait: 30s # 在组内等待所配置的时间,如果同组内,30秒内出现相同报警,在一个组内出现。
group_interval: 5m # 如果组内内容不变化,合并为一条警报信息,5m后发送。
repeat_interval: 24h # 发送报警间隔,如果指定时间内没有修复,则重新发送报警。
group_by: ['alertname'] # 报警分组
receiver: 'webhook'
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://192.168.10.47/devops/api/prometheus/notify?env=dev'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
3、发送警告
- 接收报文
{
"receiver":"webhook",
"status":"resolved",
"alerts":[
{
"status":"resolved",
"labels":{
"alertname":"linux机器",
"instance":"192.168.8.18",
"job":"linux",
"severity":"warning"
},
"annotations":{
"description":"报告.请立即查看!",
"summary":"机器192.168.8.18 挂了",
"value":"0"
},
"startsAt":"2022-05-27T16:57:03.205485317+08:00",
"endsAt":"2022-12-12T15:05:03.205485317+08:00",
"generatorURL":"[http://CplusSev0201:9090/graph?g0.expr=up%7Bjob%3D%22linux%22%7D+%3D%3D+0\u0026g0.tab=1](http://cplussev0201:9090/graph?g0.expr=up%7Bjob%3D%22linux%22%7D+%3D%3D+0\u0026g0.tab=1)",
"fingerprint":"f6e41781dace19ad"
}
],
"groupLabels":{
"alertname":"linux机器"
},
"commonLabels":{
"alertname":"linux机器",
"job":"linux",
"severity":"warning"
},
"commonAnnotations":{
"description":"报告.请立即查看!",
"value":"0"
},
"externalURL":"[http://CplusSev0201:9093](http://cplussev0201:9093/)",
"version":"4",
"groupKey":"{}:{alertname=\"linux机器\"}",
"truncatedAlerts":0
}
- 处理报文
比在alertmanager里实现发送告警要方便得多,核心问题是要知道某个机器或应用,它的负责方是谁。而这又本属于cmdb的范畴,所以建议你采用标签机制,对机器、中间件和应用打好标签,和告警的消息标签对应起来,轻松找到告警接收者。 -- 这就是为什么我们要自己写回调接口来实现的原因之一。
- 环境区分是通过参数变量env,它在接口的表单里传递。
- 消息体没有使用模板,因为alertmanager发送过来的消息是批量的,它会帮助我们对告警进行一个聚合收敛。
- 最后是发送策略,你可以发送到某个人,也可以发送到指定群,当然也可以是SMS,或者仅打印error日志(通过sentry告警)。如果是有界面的话,一般会让管理者选择告警的时间段,毕竟开发或测试环境挂了,它的优先级相对线上低多了。
JSONObject jsonObject = JSON.parseObject(requestJson);
String alerts = jsonObject.getString("alerts");
if (StringUtils.isEmpty(alerts)) {
if (log.isWarnEnabled()) {
log.warn("prometheus回调的告警列表为空");
}
return ResponseEntity.noContent().build();
}
// 环境包括:dev/test/prod。 默认是prod
String env = StringUtils.isEmpty(request.getParameter("env")) ? "prod" : request.getParameter("env");
List<AlertDTO> alertDTOList = JSONObject.parseArray(alerts, AlertDTO.class);
StringBuilder content = new StringBuilder("> Prometheus出现告警,需要及时跟进!!\n");
for (AlertDTO alert : alertDTOList) {
content.append("> =======start=========\n\n");
content.append("> **告警类型:** ").append(alert.getLabels().getAlertname()).append("\n\n");
content.append("> **告警主题:** ").append(alert.getAnnotations().getSummary()).append("\n\n");
content.append("> **告警详情:** ").append(alert.getAnnotations().getDescription()).append("\n\n");
content.append("> **触发阈值:** ").append(alert.getAnnotations().getValue()).append("\n\n");
content.append("> **触发时间:** ").append(this.formatDateTime(alert.getStartsAt())).append("\n\n");
content.append("> **链接地址:** ").append(this.replaceUrl(alert.getGeneratorURL()))
.append("[点击跳转](").append(this.replaceUrl(alert.getGeneratorURL())).append(")").append("\n\n");
content.append("> =======end=========\n\n");
content.append("\n\n\n\n\n");
}
// 不同的环境,发送的策略不一
switch (env) {
case "dev":
case "test":
int iHour = DateUtil.thisHour(true);
if (iHour >= 8 && iHour <= 18) {
WxchatMessageUtil.sendByPhone(content.toString(), "150xxxx9916");
} else {
log.error("触发告警!因已下班,故打印日志以记录。");
}
break;
case "prod":
WxchatMessageUtil.sendByRobot(content.toString(), "a82xx480-3b64-485a-8c25-b90c483308cc");
break;
default:
break;
}
- 发送警告
本来是在alertmanager里实现的复杂配置,落在devops-service服务中实现,应对多种发送渠道,灵活得很。这里强烈建议。
- 至于怎么发送企业微信、SMS、钉钉等消息,就不属于本文的范畴了,不赘述。