自动化监控Zabbix

zabbix自动发现SSD/HDD并监控寿命及状态

2019-08-14  本文已影响0人  圣地亚哥_SVIP

背景:

由于部署的一批Ceph集群有使用SSD作为缓存盘。而SSD是有读写寿命的,所以需要监控此SSD的寿命。

需求:

已有zabbix的平台,能够自动发现SSD并注册对应的监控项及告警。

监测项:

要求:

能够自动发现及注册

操作步骤

以下罗列了一些监控磁盘使用的一些命令:

注: 需要硬盘支持并开启smart

安装包, smartmontools

检索盘符:

#lsscsi | grep "disk" | awk '{ print $NF }'

判断盘类型:

# smartctl -i /dev/sda
    smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
    Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF INFORMATION SECTION ===
    Device Model:     SAMSUNG MZ7LM480HMHQ-00005
    Serial Number:    S2UJNX0K630474
    LU WWN Device Id: 5 002538 c40af1275
    Firmware Version: GXT5204Q
    User Capacity:    480,103,981,056 bytes [480 GB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    Solid State Device
    Form Factor:      2.5 inches
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Fri Aug  9 16:04:25 2019 CST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

关键字:Rotation Rate: Solid State Device

SSD寿命:

#smartctl -l devstat /dev/sda

SSD状态:

#smartctl -H /dev/sda
SMART overall-health self-assessment test result: PASSED/PROBLEM

HDD盘状态:

#smartctl -H /dev/sda
SMART Health Status: OK

zabbix配置自动发现及注册


自动发现的脚本(SSD/HDD), blk_discovery.py:

#!/usr/bin/env python

# Discovery block device.
# Usage: ./blk_discovery {type}
#    type: ssd/hdd/all
# Example:
#   ./blk_discovery ssd
# Return Json:
#  {
#    "data": [
#            {
#               "{#DEV}": /dev/sda,
#               "{#DEVTYPE}": ssd
#             },
#            {
#               "{#DEV}": /dev/sdb,
#               "{#DEVTYPE}": ssd
#            }
#            ]
# }

import sys
import json
import commands

result = {}

blk_type = sys.argv[1]


def discovery_blk():
    result["data"] = []
    (status, output) = commands.getstatusoutput("lsscsi | grep 'disk' | awk '{ print $NF }'")
    if status != 0:
        return {}
    devs = output.split('\n')
    for dev in devs:
        disk = {}
        cmmd = "smartctl -i %s | grep 'Rotation Rate:' | awk -F':' '{ print $NF }'" % dev
        (status, output) = commands.getstatusoutput(cmmd)
        if status != 0:
            continue
        dev_type = output.strip().lower()
        if dev_type == "solid state device" and (blk_type == "ssd" or blk_type == "all"):
            disk["{#DEV}"] = dev
            if blk_type == "all":
                disk["{#DEVTYPE}"] = "ssd"
            else:
                disk["{#DEVTYPE}"] = blk_type
        if dev_type != "solid state device" and (blk_type == "hdd" or blk_type == "all"):
            disk["{#DEV}"] = dev
            if blk_type == "all":
                disk["{#DEVTYPE}"] = "hdd"
            else:
                disk["{#DEVTYPE}"] = blk_type
        if len(disk) != 0:
            result["data"].append(disk)
    print json.dumps(result, sort_keys=True, indent=2)


discovery_blk()

监控SSD寿命,SSD/HDD状态的脚本,blk_parse.py:

#!/usr/bin/env python

# Parse Block Device Status
# Usage: ./blk_parse.py {dev} {feature}
# Example:
# ssd endurance:
#   ./blk_parse.py /dev/sda endurance
# Return:
#   - 34  # Which means SSD has consumed 34% life
# ssd/hdd status:
#   ./blk_parse.py /dev/sda status
# Return:
#   - UP(1),
#   - Down(0)


import sys
import commands

key = sys.argv[1]
feature = sys.argv[2]


class BlkStatus():
    UP = 1
    Down = 0


def get_status(dev):
    cmmd = "smartctl -H %s | grep -i 'health' | awk '{ print $NF }'" % dev
    (status, output) = commands.getstatusoutput(cmmd)
    if status != 0:
        return ""
    status = output.strip().upper()
    if status == "OK" or status == "PASSED":
        return BlkStatus.UP
    return BlkStatus.Down


def get_endurance(dev):
    cmmd = "smartctl -l devstat  %s | grep 'Used Endurance' | awk '{ print $4 }'" % dev
    (status, output) = commands.getstatusoutput(cmmd)
    if status != 0:
        return ""
    return int(output)


def blk_parse():
    result = ""
    if feature == "endurance":
        result = get_endurance(key)
    elif feature == "status":
        result = get_status(key)
    else:
        pass
    print result

blk_parse()

在所有Ceph(agent)节点,拷贝上述文件至/etc/zabbix/script下。

#chmod +x /etc/zabbix/script/blk_discovery.py
#chmod +x /etc/zabbix/script/blk_parse.py

目录下/etc/zabbix/zabbix_agentd.d/,添加配置文件,blk-status.conf:

UserParameter=blk_discovery[*],sudo /etc/zabbix/script/blk_discovery.py $1
UserParameter=blk.status[*],sudo /etc/zabbix/script/blk_parse.py $1 $2
UserParameter=blk.hdd.status[*],sudo /etc/zabbix/script/blk_parse.py $1 "status"

注: 此处最后两个重复,是为了key值不同,否则无法在不同的自动发现策略中添加具有相同key的监控原型。

重启zabbix-agent:

#systemctl restart zabbix-agent

zabbix web管理平台配置LLD,平台已存在一个Ceph主机监控模板,LLD配置在此模板中:

  1. 配置SSD自动发现规则
自动发现规则
  1. 配置监控项
SSD状态监控 SSD寿命监控
  1. 配置触发器
SSD状态告警 SSD寿命告警
  1. 配置图形
SSD寿命趋势图
  1. 配置HDD自动发现及注册
HDD自动发现
  1. HDD状态监控
HDD盘状态
  1. HDD触发器

如下是自动发现并自动注册的监控项,获取的最新数据:

数据

如上,完成SSD及HDD盘在zabbix中的自动发现及监控。

上一篇 下一篇

猜你喜欢

热点阅读