云原生

十七 Prometheus实践

2022-06-10  本文已影响0人  負笈在线

AlertmanagerConfig:
https://github.com/prometheus-operator/prometheus-operator/blob/master/example/user-guides/alerting/alertmanager-config-example.yaml
https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#alertmanagerconfig

(一) PrometheusRule

可以通过如下命令查看默认配置的告警策略:

# kubectl get prometheusrule -n monitoring
NAME AGE
alertmanager-main-rules 19d
kube-prometheus-rules 19d
kube-state-metrics-rules 19d
kubernetes-monitoring-rules 19d
node-exporter-rules 19d
prometheus-k8s-prometheus-rules 19d
prometheus-operator-rules 19d

也可以通过-oyaml查看某个rules的详细配置:

# kubectl get prometheusrule -n monitoring node-exporter-rules    -oyaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
...
spec:
  groups:
  - name: node-exporter
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up.
        runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 24 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning

Ø alert:告警策略的名称
Ø annotations:告警注释信息,一般写为告警信息
Ø expr:告警表达式
Ø for:评估等待时间,告警持续多久才会发送告警数据
Ø labels:告警的标签,用于告警的路由

(二) 域名访问延迟告警

假设需要对域名访问延迟进行监控,访问延迟大于1秒进行告警,此时可以创建一个PrometheusRule如下:

# **cat blackbox.yaml**

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

**metadata:**

 **labels:**

 app.kubernetes.io/component: exporter

 app.kubernetes.io/name: blackbox-exporter

 **prometheus: k8s**

 **role: alert-rules**

 name: blackbox

 namespace: monitoring

spec:

 **groups:**

 **- name: blackbox-exporter**

 **rules:**

 - alert: DomainAccessDelayExceeds1s

 annotations:

 **description**: 域名:{{ $labels.instance }} 探测延迟大于1秒,当前延迟为:{{ $value }}

 **summary**: 域名探测,访问延迟超过1秒

 expr: sum(probe_http_duration_seconds{job=~"blackbox"}) by (instance) > 1

 for: 1m

 **labels**:

 severity: warning

 type: blackbox

创建并查看该PrometheusRule:

# kubectl create -f blackbox.yaml
prometheusrule.monitoring.coreos.com/blackbox created
# kubectl get -f blackbox.yaml
NAME AGE
blackbox 65s

之后也可以在Prometheus的Web UI看到此规则:



如果探测延迟有超过1s的域名,就会触发告警,如图所示:



由于告警路由并未匹配黑盒监控的标签,所以会发送给默认的收件人,也就是邮箱:

接下来可以根据实际业务情况将告警发送给指定的人,此时可以更改路由,将域名探测发送至微信,配置如下(部分代码):

 - match:
 type: blackbox
 receiver: "wechat-ops"
 repeat_interval: 10m

之后在微信端即可收到告警:


上一篇下一篇

猜你喜欢

热点阅读