DevOps框架架构

Prometheus监控MQ集群

2021-07-21  本文已影响0人  阿当运维

mq 为Rabbitmq 。

下面是批量安装rabbitmq_exporter

[root@222-ansible mq_nodeexpoter]# cat install.yml 
---
- hosts: mq
  gather_facts: no
  tasks:
   - name: 复制软件包
     unarchive: src=rabbitmq_exporter-0.25.2.linux-amd64.tar.gz dest=/usr/local/
   - name: 重命名
     shell: mv /usr/local/rabbitmq_exporter-0.25.2.linux-amd64 /usr/local/rabbitmq_exporter
   - name: 复制启动脚本
     copy: src=rabbitmq_exporter.sh dest=/usr/local/rabbitmq_exporter/
   - name: 启动
     shell: /bin/sh /usr/local/rabbitmq_exporter/rabbitmq_exporter.sh start

启动脚本(自己简单写的,实现启动关闭即可):

[root@222-ansible mq_nodeexpoter]# cat rabbitmq_exporter.sh 
#!/bin/bash 

if [[ $1 == "start" ]];then

    RABBIT_USER=admin RABBIT_PASSWORD=admin OUTPUT_FORMAT=json PUBLIC_PORT=9090 RABBIT_URL=http://localhost:15672 nohup /usr/local/rabbitmq_exporter/rabbitmq_exporter >/dev/null  2>&1 &
    if [[ $? -eq 0 ]];then
                echo "rabbitmq_exporter is started!"
        else
                echo "mq_exporter start fail!"
        fi

elif [[ $1 == "stop" ]];then
    pid=$(ps -ef|grep rabbitmq_export|grep -v "grep"|awk -F" " '{print $2}')
    kill $pid
    if [[ $? -eq 0 ]];then
        echo "rabbitmq_exporter is stoped!"
    else
        echo "mq_exporter stop fail!"
    fi
else 
    echo "fail input"
fi
 - job_name: "RabbitMq-group"
    metrics_path: '/metrics'
    static_configs:
    - targets: ['192.168.1.199:9090','192.168.1.220:9090','192.168.1.221:9090']
      labels:
        dev: 测试环境
        project: mq

监控告警配置

大致分三步:

  1. 首次配置alertmanger配置文件,定义邮件服务,路由树分组,接收人
  2. 配置prometheus配置文件,配置上alertmanger的通信地址,和rules规则文件目录位置
  3. 写rules告警触发规则
global:
  resolve_timeout: 5m
  # 邮箱服务器
  smtp_smarthost: 'smtp.qq.com:25'
  smtp_from: 'asdcheru5896@qq.com'
  smtp_auth_username: 'asdcheru5896@qq.com'
  smtp_auth_password: 'mnzvqgnelcdobddc'
  smtp_require_tls: false

# 配置路由树
route:
  group_by: ['k8s'] # 根据告警规则组名进行分组
  group_wait: 30s # 分组内第一个告警等待时间,10s内如有第二个告警会合并一个告警
  group_interval: 10s # 发送新告警间隔时间
  repeat_interval: 50m # 重复告警间隔发送时间
  receiver: 'yunwei'


# 接收人
receivers:
 - name: 'yunwei'
   email_configs:
   - to: 'xxx@xxx.cc'
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - 172.17.0.3:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*.yml"
groups:
- name: k8s
 rules:
 - alert: MqgoupsDown
   expr: rabbitmq_running == 0
   for: 1m
   labels:
     severity: warning
   annotations:
     summary: "{{ $labels.instance }} mq集群发生异常"
     description: "{{$labels.instance}}: {{$labels.job }} 集群异常(当前值: {{ $value }}),异常主机为{{ $labels.node }}"
 
 - alert: MqInstanceDown
   expr: rabbitmq_up==0
   for: 1m
   labels:
     severity: warning
   annotations:
     summary: "{{ $labels.instance }} mq主机发生宕机"
     description: "{{$labels.instance}}: {{$labels.job }} 主机异常(当前值: {{ $value }})"

注:
groups: 是分组名。如果多个规则都用同一个分组,那所有条目触发告警的时候,只会中一个分组的名义汇聚在一个邮件中,这也是Promethues三大特性之一: 分组。
告警的触发器就是 promql查询语句做条件,可以实现在prometheus界面查一遍在写这里。
邮件中的信息,想要打印出异常主机的信息,是通过{{$labels.标签}}来控制。

测试,停掉其中一台mq服务。


image.png

Grafana图形化展示

模板4371和4279

image.png

注意:


image.png
上一篇下一篇

猜你喜欢

热点阅读