Alertmanager邮箱报警配置

2020-04-08  本文已影响0人  风吹路过的云

Prometheus监控,这里做一个简单的磁盘空间不足的邮箱报警示例。
prometheus.yml配置文件

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['172.1.5.220:9093']
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "node_down.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    static_configs:
      - targets: ['172.1.5.220:9090']
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['172.1.5.220:8080']
  - job_name: 'harbor-250'
    static_configs:
      - targets: ['192.168.8.250:4080']
  - job_name: 'java-demo'
    scrape_interval: 5s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['192.168.9.222:8080']
  - job_name: 'node'
    scrape_interval: 8s
    static_configs:
      - targets: ['172.1.5.220:9100', '192.168.9.223:9100', '192.168.8.250:4100']

node_down.yml配置如下,HostOutOfDiskSpace的rule,意思是当磁盘空间少于60%时,报警。
同学们,如果rule不会写的话,可以参考这里,很多规则,我也是参与这里的
Rule参考

groups:
- name: node_down
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      user: test
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
- name: out_of_disk_space
  rules: 
  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/etc/hostname"}  * 100) / node_filesystem_size_bytes{mountpoint="/etc/hostname"} < 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (< 60% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

alertmanager.yml配置如下,因为公司用的邮箱系统是阿里企业邮箱,所以host是:smtp.qiye.aliyun.com:465

global: 
  smtp_smarthost: 'smtp.qiye.aliyun.com:465'
  smtp_from: 'abc@xxx.com'
  smtp_auth_username: 'abc@xxx.com'
  smtp_auth_password: 't43123456'
  smtp_require_tls: false

route: 
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10m
  receiver: live-monitoring

receivers: 
  - name: 'live-monitoring'
    email_configs: 
    - to: '3424354443@qq.com'

这里的smtp_smarthost很重要,一开始,我以为填公司域名smtp.xxx.com:25就行了,结果报如下的错,

level=error ts=2020-04-08T06:02:44.036Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.xxx.com" context_err="context deadline exceeded"
level=error ts=2020-04-08T06:02:44.036Z caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.xxx.com"

经过网上搜索,才找到是:smtp.qiye.aliyun.com:465
效果如下:

已使用大小 firing 报警邮件

参考资料:
https://awesome-prometheus-alerts.grep.to/rules

上一篇下一篇

猜你喜欢

热点阅读