Alertmanager邮箱报警配置

2020-04-08 本文已影响0人风吹路过的云

Prometheus监控，这里做一个简单的磁盘空间不足的邮箱报警示例。
prometheus.yml配置文件

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['172.1.5.220:9093']
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "node_down.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    static_configs:
      - targets: ['172.1.5.220:9090']
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['172.1.5.220:8080']
  - job_name: 'harbor-250'
    static_configs:
      - targets: ['192.168.8.250:4080']
  - job_name: 'java-demo'
    scrape_interval: 5s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['192.168.9.222:8080']
  - job_name: 'node'
    scrape_interval: 8s
    static_configs:
      - targets: ['172.1.5.220:9100', '192.168.9.223:9100', '192.168.8.250:4100']

node_down.yml配置如下，HostOutOfDiskSpace的rule，意思是当磁盘空间少于60%时，报警。
同学们，如果rule不会写的话，可以参考这里，很多规则，我也是参与这里的
Rule参考

groups:
- name: node_down
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      user: test
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
- name: out_of_disk_space
  rules: 
  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes{mountpoint="/etc/hostname"}  * 100) / node_filesystem_size_bytes{mountpoint="/etc/hostname"} < 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (< 60% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

alertmanager.yml配置如下，因为公司用的邮箱系统是阿里企业邮箱，所以host是：smtp.qiye.aliyun.com:465

global: 
  smtp_smarthost: 'smtp.qiye.aliyun.com:465'
  smtp_from: 'abc@xxx.com'
  smtp_auth_username: 'abc@xxx.com'
  smtp_auth_password: 't43123456'
  smtp_require_tls: false

route: 
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 10m
  receiver: live-monitoring

receivers: 
  - name: 'live-monitoring'
    email_configs: 
    - to: '3424354443@qq.com'

这里的smtp_smarthost很重要，一开始，我以为填公司域名smtp.xxx.com:25就行了，结果报如下的错，

level=error ts=2020-04-08T06:02:44.036Z caller=notify.go:372 component=dispatcher msg="Error on notify" err="send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.xxx.com" context_err="context deadline exceeded"
level=error ts=2020-04-08T06:02:44.036Z caller=dispatch.go:301 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="send STARTTLS command: x509: certificate is valid for *.mxhichina.com, mxhichina.com, not smtp.xxx.com"

经过网上搜索，才找到是：smtp.qiye.aliyun.com:465
效果如下：

已使用大小

firing

报警邮件

参考资料：
https://awesome-prometheus-alerts.grep.to/rules

Alertmanager邮箱报警配置

猜你喜欢

热点阅读