prometheus基本概念和使用

2022-03-13 本文已影响0人 fengzhihai

一、说明

写这篇文章的时候(2022年3月13号上午11点左右)，prometheus的最新release版本为2.33，同样我整个安装、使用过程都是基于官网。在这里我所有的组件都是裸机安装，当然你可以使用docker或者k8s搭建相关环境。同时我将使用consul作为配置中心，alertmanager、exporter、instrumentation等向consul完成注册，prometheus server从consul获取相关信息，完成配置。

二、环境

2.1、服务器信息

ip	hostname	remark
192.168.13.211	vm-master-01	consul、grafana、node_exporter
192.168.13.12	vm-master-02	alertmanager、prometheus、node_exporter
192.168.13.225	vm-master-03	node_exporter

2.2、组件信息

组件	版本	url
consul	v1.11.4	http://192.168.13.211:8500/ui/dc1/services
grafana	8.4.3	http://192.168.13.211:3000
node_exporter	1.3.1	http://192.168.13.225:9100/metrics
prometheus server	2.33.4	http://192.168.13.12:9090
prometheus instrumentation		http://192.168.13.12:9090/metrics

三、官方文档

# prometheus官网
https://prometheus.io/
# prometheus、alertmanager、exporter(mysql、node等)的下载
https://prometheus.io/download/
# consul官网
https://www.consul.io/

四、搭建

在这里，我先搭建整套环境，再参考官网，对搭建过程和环境进行解读。

4.1、consul搭建

4.1.1、在这里我直接使用yum的方式进行安装
yum-config-manager --add-repo https://rpm.releases.hashicorp.com/RHEL/hashicorp.repo
yum -y install consul
4.1.2、consul配置文件及数据日志目录
mkdir /data/consul/{config,data,logs} -p
[root@vm-master-01 ~]# cat /data/consul/config/consul.json 
{
  "datacenter": "DC1",
  "data_dir": "/data/consul/data",
  "log_file": "/data/consul/logs/consul.log",
  "log_level": "INFO", 
  "node_name": "consul-01",
  "server": true,
  "ui": true,
  "bootstrap_expect": 1,
  "bind_addr": "192.168.13.211",
  "client_addr": "0.0.0.0",
  "raft_protocol": 3,
  "enable_debug": false,
  "rejoin_after_leave": true,
  "enable_syslog": false
}
4.1.3、consul.service
[root@vm-master-01 ~]# cat /usr/lib/systemd/system/consul.service
[Unit]
Description=Consul
Documentation=https://www.consul.io/

[Service]
ExecStart=/usr/bin/consul agent -config-dir=/data/consul/config
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=SIGINT
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
4.1.4、启动
systemctl daemon-reload
systemctl restart consul.service

4.2、grafana

4.2.1、在这里我使用rpm包的方式进行安装
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.4.3-1.x86_64.rpm
rpm -i --nodeps grafana-enterprise-8.4.3-1.x86_64.rpm
systemctl start grafana-server

4.3、prometheus server

4.3.1、安装
wget https://github.com/prometheus/prometheus/releases/download/v2.33.4/prometheus-2.33.4.linux-386.tar.gz
tar -xf prometheus-2.33.4.linux-386.tar.gz
cd prometheus-2.33.4.linux-386
./prometheus

4.4、node_exporter

4.4.1、安装

https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar -xf node_exporter-1.3.1.linux-amd64.tar.gz
cp node_exporter /usr/local/bin/node_exporter
4.4.2、配置文件
[root@vm-master-03 node_exporter-1.3.1.linux-386]# cat /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
#User=prometheus
ExecStart=/usr/local/bin/node_exporter  \
            --collector.ntp \
            --collector.mountstats \
            --collector.systemd \
            --collector.tcpstat
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
Restart=always
[install]
WantedBy=multi-user.target
4.4.3、启动
systemctl start node_exporter

4.5、alertmanager

4.5.1、安装
wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
tar -xf alertmanager-0.23.0.linux-amd64.tar.gz
cp alertmanager /usr/local/bin/alertmanager
4.5.2、alertmanager配置文件及数据日志目录
mkdir /data/alertmanager/{config,data,templates} -p
4.5.3、配置文件
[root@vm-master-02 alertmanager-0.23.0.linux-amd64]# cat /data/alertmanager/config/alertmanager.yml   
global:
  resolve_timeout: 2m
  smtp_smarthost: 'smtp.exmail.qq.com:465'
  smtp_from: 'xxx@xxx.com'
  smtp_auth_username: 'xxx@xxx.com'
  smtp_auth_password: 'xxx'
  smtp_hello: 'xxx.com'
  smtp_require_tls: false

templates:
- '/data/alertmanager/templates/test.tmpl'
route:
  group_by: ['alertname'] # 将类似性质的报警 合并为单个通知
  group_wait: 30s        # 收到告警时 等待10s确认时间内是否有新告警 如果有则一并发送
  group_interval: 5m # 下一次评估过程中，同一个组的alert生效，则会等待该时长发送告警通知，此时不会等待group_wait设置时间
  repeat_interval: 10m          # 告警发送间隔时间 建议10m 或者30m 
  receiver: 'sendemail'           # 默认的receiver，如果一个报警没有被一个route匹配，则发送给默认的接收器
  routes:
  - match:
      severiry: 'critical'
    receiver: 'sendemail'
  - match:
      alertname: 'InstanceDown'
    receiver: 'test'

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
- name: 'sendemail'
  email_configs:
  - to: 'xxx@xxx.com'
    html: '{{ template "test.html" . }}' # 设定邮箱的内容模板
    send_resolved: true
    headers:
      subject: "[prometheus] 报警邮件"
      from: "运维报警中心"
      to: "研发部门" 
- name: 'test'
  email_configs:
  - to: 'xxx@qq.com'
    send_resolved: true

# 抑制器配置
inhibit_rules: # 抑制规则
- source_match: # 源标签警报触发时抑制含有目标标签的警报
    severity: 'high'
  target_match:
    severity: 'high' 
  equal: ['instance']
4.5.4、告警模版
[root@vm-master-02 alertmanager-0.23.0.linux-amd64]# cat /data/alertmanager/templates/test.tmpl 
{{ define "test.html" }}
{{ range .Alerts }}
====================start====================<br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
====================end====================<br>
{{ end }}
{{ end }}
4.5.5、systemctl service
[root@vm-master-02 alertmanager-0.23.0.linux-amd64]# cat /usr/lib/systemd/system/alertmanager.service 
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/alertmanager  --config.file=/data/alertmanager/config/alertmanager.yml \
          --storage.path="/data/alertmanager/data/" \
          --data.retention=48h
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
Restart=always

[Install]
WantedBy=multi-user.target
4.5.6、启动
systemctl start alertmanager

4.6、重新配置consul

4.6.1、alertmanager相关
[root@vm-master-01 ~]# cat /data/consul/config/alertmanager.json 
{
  "service": {
    "id": "alertmanager-vm-master-02",
    "name": "vm-master-02",
    "address": "192.168.13.12",
    "port": 9093,
    "tags": ["alertmanager"],
    "checks": [{
      "http": "http://192.168.13.12:9093/metrics",
      "interval": "5s"
    }]
  }
}
4.6.2、nodes相关
[root@vm-master-01 ~]# cat /data/consul/config/nodes.json 
{
  "services": [
    {
      "id": "node_exporter-vm-master-01",
      "name": "vm-master-01",
      "address": "192.168.13.211",
      "port": 9100,
      "tags": ["nodes"],
      "checks": [{
        "http": "http://192.168.13.211:9100/metrics",
        "interval": "5s"
      }]
    },
    {
      "id": "node_exporter-vm-master-02",
      "name": "vm-master-02",
      "address": "192.168.13.12",
      "port": 9100,
      "tags": ["nodes"],
      "checks": [{
        "http": "http://192.168.13.12:9100/metrics",
        "interval": "5s"
      }]
    },
    {
      "id": "node_exporter-vm-master-03",
      "name": "vm-master-03",
      "address": "192.168.13.225",
      "port": 9100,
      "tags": ["nodes"],
      "checks": [{
        "http": "http://192.168.13.225:9100/metrics",
        "interval": "5s"
      }]
    }
  ]
}
4.6.3、prometheus instrumentation相关
[root@vm-master-01 ~]# cat /data/consul/config/prometheus-server.json 
{
  "service": {
    "id": "prometheus-server-vm-master-02",
    "name": "vm-master-02",
    "address": "192.168.13.12",
    "port": 9090,
    "tags": ["prometheus"],
    "checks": [{
      "http": "http://192.168.13.12:9090/metrics",
      "interval": "5s"
    }]
  }
}
4.6.4、重新加载consul
consul reload
重新加载后consul的页面如下：

consul服务注册清单1

consul服务注册清单2

由于alertmanager配置的有些service在对应的服务器上并没有启动，因此consul 的ui界面有些service健康检查不通过。

4.7、prometheus.yml重新配置

# prometheus配置
[root@vm-master-02 prometheus-2.33.4.linux-386]# cat prometheus.yml
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - timeout: 10s
    consul_sd_configs:
    - server: "192.168.13.211:8500" 
      tags:
      - "alertmanager"

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    consul_sd_configs:
    - server: "192.168.13.211:8500"
      tags:
      - "prometheus"
      refresh_interval: 1m

  - job_name: "alertmanager"
    consul_sd_configs:
    - server: "192.168.13.211:8500"
      tags:
      - "alertmanager"
      refresh_interval: 2m

  - job_name: "nodes"
    consul_sd_configs:
    - server: "192.168.13.211:8500"
      tags:
      - "nodes"
      refresh_interval: 2m

    relabel_configs:
    - source_labels:
      - __scheme__
      - __address__
      - __metrics_path__
      regex: '(http|https)(.*)'
      target_label: "endpoint"
      replacement: "${1}://${2}"
      separator: ""

    metric_relabel_configs:
    - source_labels:
      - __name__
      regex: "go_info.*"
      action: "drop"

# 报警规则
[root@vm-master-02 prometheus-2.33.4.linux-386]# cat rules/node.yml 
groups:
- name: node # 组的名字，在这个文件中必须要唯一
  rules:
  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown # 告警的名字，在组中需要唯一,alertname=InstanceDown
    expr: up == 2  # 表达式, 执行结果为true: 表示需要告警
    for: 30s  # 超过多少时间才认为需要告警(即up==0需要持续的时间)
    labels:
      severity: high
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds."

  # Alert for any instance that is low load1 < 0.5
  - alert: Load1Idle # alertname=Load1Idle
    expr: irate(node_load1[5m]) < 0.5
    for: 30s
    labels: 
      severity: high
    annotations:
      summary: "Instance {{ $labels.instance }} is idle"
      description: "{{ $labels.instance }} load1 is lt 0.5 for more than 30 seconds."

总结：node_exporter、alertmanager、prometheus server instrumentation都将注册到consul里面，prometheus server从consul中取出相关信息，构成其配置文件。

5、官网解读

5.1、prometheus的架构图

prometheus的架构图

prometheus server：负责定时去目标上抓取metrics指标数据，每个被抓取目标都需要暴露一个http服务的接口,同时主要负责数据采集和存储，提供promql查询语言的支持。
client libray：客户端库，目的在于为哪些期望原声提供Instrumentation功能的应用程序提供便捷的开发途径。
alertmanager：独立的prometheus组件，支持prometheus的查询语句，提供灵活的报警方式。
pushgateway：支持client主动推送metrics到pushgateway，而prometheus只是定时去，gateway上抓取数据，多为临时性job主动推送指标到中间网关
exporter：数据采集组件的总称。负责从目标搜集数据，并将其转化为promotheus支持的格式，与传统的数据采集组件不同的是，它并不向中央服务器发送数据，而是等待中样服务器主动前来采集。

5.2、prometheus的一些术语

5.2.1、时序(样本)

eg:
cpu_usage:{core=1,ip=“127.0.0.1”}                      14.04
               时序标识                                                样本值
在时间序列中的每一个点称为一个样本，样本由以下三部分组成：
    * 指标（metric）：指标名称和描述当前样本特征的 labelsets；
    * 时间戳（timestamp）：一个精确到毫秒的时间戳；
    * 样本值（value）： 一个 folat64 的浮点型数据表示当前样本的值。

5.2.2、指标类型

prometheus指标类型

5.2.3、instance、job

instance和job

5.2.4、exporter Instrumentation pushgateway

prometheus要想scrape目标上的数据，目标必须暴露一个http服务，但不是所有的被监控对象都有这种http服务，所以我们一般有如下三种途径来帮助应用程序或者设备来暴露http服务：
1、exporters：内建不支持指标暴露或者有指标暴露接口但不是http服务格式的，需要部署一个http格式的exporter，由它来兼容对应target上的指标数据并转为prometheus格式的数据。
2、Instrumentation：应用程序内建的测量系统。
3、pushgateway：支持client主动推送metrics到pushgateway，而prometheus只是定时去，gateway上抓取数据，多为临时性job主动推送指标到中间网关。

5.3、指标抓取生命周期

指标抓取生命周期

官网解读

5.4、prometheus server配置文件

https://prometheus.io/docs/prometheus/latest/configuration/configuration/
大体来说，prometheus server的配置文件包括如下几个部分：
  global # 全局配置  向目标instance抓取指标的频配置率等
  alerting # 报警的媒介配置信息
  rule_files # 规则文件
  scrape_configs # 抓取指标的配置

5.4.1、scrape_configs

prometheus向目标抓取数据的方式，这里介绍几种常用的方式：

5.4.1.1、static_configs

# 静态配置
  static_configs:
  # 指定要抓取的目标地址
  - targets: ['localhost:9090', 'localhost:9191']
    # 给抓取出来的所有指标添加指定的标签
    labels:
      env: label
      app: label

5.4.1.2、file_sd_configs

 # 文件自动发现
  file_sd_configs:
    - files:
      - targets/*.yml
      - targets/*.json
      - nodes.yml
      # 重新读取文件的间隔,默认5m
      refresh_interval: 10m

5.4.1.3、consul_sd_configs

从consul中获取，我上面搭建环境使用的就是这种方式。

5.4.2、relabel_configs

重新标记是一种强大的工具，可以在抓取目标之前动态重写目标的标签集。每个抓取配置可以配置多个重新标记步骤。它们按照它们在配置文件中的出现顺序应用于每个目标的标签集。
最初，除了配置的每目标标签之外，目标的作业标签设置为相应的scrape配置的job_name值。address标签设置为目标的<host>：<port>地址。重新标记后，如果在重新标记期间未设置实例标签，则实例标签默认设置为address的值。scheme和metrics_path标签分别设置为目标的方案和度量标准路径。param <name>标签设置为名为<name>的第一个传递的URL参数的值。
在重新标记阶段，可以使用带有meta前缀的附加标签。它们由提供目标的服务发现机制设置，并在不同机制之间变化。
在目标重新标记完成后，将从标签集中删除以开头的标签。
如果重新标记步骤仅需临时存储标签值（作为后续重新标记步骤的输入），请使用__tmp标签名称前缀。保证Prometheus本身不会使用此前缀。

# The source labels select values from existing labels. Their content is concatenated
# using the configured separator and matched against the configured regular expression
# for the replace, keep, and drop actions.
# 源标签从现有标签中选择值。 它们的内容使用已配置的分隔符进行连接，并与已配置的正则表达式进行匹配，以进行替换，保留和删除操作。
[ source_labels: '[' <labelname> [, ...] ']' ]

# Separator placed between concatenated source label values.
# 分隔符放置在连接的源标签值之间。
[ separator: <string> | default = ; ]

# Label to which the resulting value is written in a replace action.
# It is mandatory for replace actions. Regex capture groups are available.
# 在替换操作中将结果值写入的标签。
# 替换操作是强制性的。 正则表达式捕获组可用。
[ target_label: <labelname> ]

# Regular expression against which the extracted value is matched.
# 与提取的值匹配的正则表达式。
[ regex: <regex> | default = (.*) ]

# Modulus to take of the hash of the source label values.
# 采用源标签值的散列的模数。
[ modulus: <uint64> ]

# Replacement value against which a regex replace is performed if the
# regular expression matches. Regex capture groups are available.
# 如果正则表达式匹配，则执行正则表达式替换的替换值。 正则表达式捕获组可用。
[ replacement: <string> | default = $1 ]

# Action to perform based on regex matching.
# 基于正则表达式匹配执行的操作。
[ action: <relabel_action> | default = replace ]

5.4.3、metric_relabel_configs

metric_relabel_configs是发生在抓取之后，但在数据被插入存储系统之前使用。因此如果有些你想过滤的指标，或者来自抓取本身的指标，你就可以使用metric_relabel_configs来处理。

5.5、报警抑制

alertmanager的inhibition机制可以避免当某种问题告警产生之后，用户连续接收到大量的由此问题导致的告警通知，官网(https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule)的配置截图如下：

报警抑制配置

在有新的告警通知匹配到target_match和target_match_re规则时：
    如果在已经发送的告警通知满足source_match或者source_match_re的匹配条件
    并且已经发送的告警与新的告警中由equal定义的标签完全相同(key1=value1,都有key1且value1相同),则启动抑制机制，新的告警不会发送。

5.5、prometheus接入grafana并定制dashboard

5.5.1、grafana接入prometheus

grafana接入prometheus数据源

5.5.2、导入dashboard模版

https://grafana.com/grafana/dashboards/?search=node

dashboard模版集

选择指定的dashboard模版id

grafana中根据上面的模版id导入目标模版

5.5.3、监控仪表盘

监控仪表盘