运维框架架构

Prometheus+Grafana监控JVM

2021-03-04  本文已影响0人  lix22

概要

体系结构.jpg
组件简介
监控搭建

spring-actuator + micrometer + Prometheus + Grafana + <WebHook>


Prometheus

特征
架构
prometheus-architecture.jpg
数据采集

GET /actuator/prometheus

指标查询接口数据.jpg
数据查询

prometheus提供了web界面执行PromSQL查询时序数据,上面的指标就可以直接作为查询条件语句,且支持多种函数查询,参见 https://prometheus.io/docs/prometheus/latest/querying/functions/

promSQL查询.jpg
预警规则

通过配置文件中的规则对每次查询的时序数据进行预警判定,可以在web界面中展示已加载的规则项
预警信息发送到AlertManager组件中进行统一预警管理

groups:
- name: Instances
  rules:
  - alert: InstanceDown
    expr: up != 1  # 规则表达式,支持PromSQL查询
    for: 1m  # 首次命中规则1分钟后发送报警,若延迟区间内对数据再次检查没有命中规则,就不再报警
    labels:  # 报警信息标签
      severity: page # 预警严重程度,后面可以根据这个字段抑制某些不需要的告警
      status: High
    annotations:  # 报警信息描述
      description: "Application: {{ $labels.job }} Instance: {{ $labels.instance }} is Down ! ! !"
      value: '{{ $value }}'
      summary:  "Instance {{ $labels.instance }} down"
AlertManager

Prometheus服务器中的警报规则向AlertManager发送警报。然后,警报管理器管理这些警报,包括沉默、抑制、聚合和通过电子邮件、待命通知系统和聊天平台等方法发送通知
这里我们选择使用WebHook方式,将报警信息发送到指定接口,我们可以针对报警数据自行选择通知方式和通知人

{
  "receiver": "web\\.hook",
  "status": "firing",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "InstanceDown",
        "instance": "10.10.10.10:9000",
        "job": "app-1",
        "severity": "page",
        "status": "High"
      },
      "annotations": {
        "description": "Application: bdp-gateway Instance: 10.10.10.10:9000 is Down ! ! !",
        "summary": "Instance 10.10.10.10:9000 down",
        "value": "0"
      },
      "startsAt": "2021-02-20T09:11:43.380766777Z",
      "endsAt": "0001-01-01T00:00:00Z",
      "generatorURL": "http://VM-102-32-centos:9090/graph?g0.expr=up+%21%3D+1\u0026g0.tab=1",
      "fingerprint": "8a9aadd8d34d09f7"
    },
    {
      "status": "resolved",
      "labels": {
        "alertname": "InstanceDown",
        "instance": "10.10.10.10:9001",
        "job": "app-2",
        "severity": "page",
        "status": "High"
      },
      "annotations": {
        "description": "Application: app-2 Instance: 10.10.10.10:9001 is Down ! ! !",
        "summary": "Instance 10.10.10.10:9001 down",
        "value": "0"
      },
      "startsAt": "2021-02-20T09:11:28.380766777Z",
      "endsAt": "2021-02-20T09:13:43.380766777Z",
      "generatorURL": "http://VM-102-32-centos:9090/graph?g0.expr=up+%21%3D+1\u0026g0.tab=1",
      "fingerprint": "6070b8cb7389ffc2"
    }
  ],
  "groupLabels": {
    "alertname": "InstanceDown"
  },
  "commonLabels": {
    "alertname": "InstanceDown",
    "severity": "page",
    "status": "High"
  },
  "commonAnnotations": {
    "value": "0"
  },
  "externalURL": "http://VM-102-32-centos:9093",
  "version": "4",
  "groupKey": "{}:{alertname=\"InstanceDown\"}",
  "truncatedAlerts": 0
}

Grafana

用于可视化大型测量数据的开源程序,他提供了强大和优雅的方式去创建、共享、浏览数据。dashboard中显示了你不同metric数据源中的数据

数据源

部署

应用端加入spring-actuator和micrometer组件
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
安装Prometheus & AlertManager
nohup ./prometheus --web.enable-lifecycle 2>&1 & 
curl -XPOST http://localhost:9090/-/reload 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
      - 127.0.0.1:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
    - "./rules/*.yml"
    
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'app-1'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    metrics_path: '/actuator/prometheus'
    static_configs:
    - targets: ['10.10.10.10:9000', '10.10.10.11:9000']

  - job_name: 'app-2'
    metrics_path: '/actuator/prometheus'
    static_configs:
    - targets: ['10.10.10.11:9001']
nohup ./alertmanager 2>&1 & 
curl -XPOST http://localhost:9093/-/reload 
route:
  group_by: ['alertname']  # 报警分组依据字段
  group_wait: 20s  # 收到新组时等待时间,目的是为了等待同组的警报合并发送报警
  group_interval: 5m  # 同组报警发送的间隔时间,从上次发送报警的时间开始计算
  repeat_interval: 3m  # 报警发送间隔
  receiver: 'web.hook'  # 接收报警的名称
receivers:
  - name: 'web.hook'
    webhook_configs:
    - send_resolved: false  # 已报警的指标恢复后是否通知,默认true
      url: 'http://10.10.10.11:8080/xxx/xxx'  #报警通知接口
inhibit_rules:  # 告警抑制配置,避免当某种问题告警产生之后用户接收到大量由此问题导致的一系列的其它告警通知
  - source_match:  # 源报警规则
      severity: 'critical'
    target_match:  # 抑制的报警规则
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']  # 需要都有相同的标签及值,否则抑制不起作用
安装Grafana
上一篇 下一篇

猜你喜欢

热点阅读