云原生

十六 Prometheus监控入门

2022-06-10  本文已影响0人  負笈在线

(一) Prometheus架构




Ø Prometheus Server:Prometheus生态最重要的组件,主要用于抓取和存储时间序列数据,同时提供数据的查询和告警策略的配置管理;

Ø Alertmanager:Prometheus生态用于告警的组件,Prometheus Server会将告警发送给Alertmanager,Alertmanager根据路由配置,将告警信息发送给指定的人或组。Alertmanager支持邮件、Webhook、微信、钉钉、短信等媒介进行告警通知;

Ø Grafana:用于展示数据,便于数据的查询和观测;

Ø Push Gateway:Prometheus本身是通过Pull的方式拉取数据,但是有些监控数据可能是短期的,如果没有采集数据可能会出现丢失。Push Gateway可以用来解决此类问题,它可以用来接收数据,也就是客户端可以通过Push的方式将数据推送到Push Gateway,之后Prometheus可以通过Pull拉取该数据;

Ø Exporter:主要用来采集监控数据,比如主机的监控数据可以通过node_exporter采集,MySQL的监控数据可以通过mysql_exporter采集,之后Exporter暴露一个接口,比如/metrics,Prometheus可以通过该接口采集到数据;

Ø PromQL:PromQL其实不算Prometheus的组件,它是用来查询数据的一种语法,比如查询数据库的数据,可以通过SQL语句,查询Loki的数据,可以通过LogQL,查询Prometheus数据的叫做PromQL;

Ø Service Discovery:用来发现监控目标的自动发现,常用的有基于Kubernetes、Consul、Eureka、文件的自动发现等。

(****二) Prometheus安装

Kube-Prometheus项目地址:https://github.com/prometheus-operator/kube-prometheus/
首先需要通过该项目地址,找到和自己Kubernetes版本对应的Kube Prometheus Stack的版本:

# git clone -b release-0.8 https://github.com/prometheus-operator/kube-prometheus.git
# cd kube-prometheus/manifests

安装Prometheus Operator:

# kubectl create -f setup/**
namespace/monitoring created
...
deployment.apps/prometheus-operator created
service/prometheus-operator created
serviceaccount/prometheus-operator created

查看Operator容器的状态:

# kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
prometheus-operator-bb5c5b6c8-xtkdn 2/2 Running 0 25s

Operator容器启动后,安装Prometheus Stack:

# kubectl create -f .
alertmanager.monitoring.coreos.com/main created
...
service/prometheus-k8s created
serviceaccount/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus-k8s created

查看Prometheus容器状态:

# kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 59s
alertmanager-main-1 2/2 Running 0 59s
alertmanager-main-2 2/2 Running 0 59s
blackbox-exporter-7f88596689-fl2v8 3/3 Running 0 59s
grafana-766bfd54b9-cchqm 1/1 Running 0 58s
kube-state-metrics-5fd8b545b-hrzxc 3/3 Running 0 58s
node-exporter-265df 2/2 Running 0 58s
node-exporter-5qj7b 2/2 Running 0 58s
node-exporter-lxngk 2/2 Running 0 58s
node-exporter-n8p7w          2/2 Running 0 58s
node-exporter-xjjf2 2/2 Running 0 58s
prometheus-adapter-5b849bbc57-tlwvd 1/1 Running 0 58s
prometheus-adapter-5b849bbc57-xjznh 1/1 Running 0 58s
prometheus-k8s-0 2/2 Running 1 57s
prometheus-k8s-1 2/2 Running 1 57s
prometheus-operator-bb5c5b6c8-xtkdn 2/2 Running 0 18m

将Grafana的Service改成NodePort类型:

# kubectl edit svc grafana -n monitoring

查看Grafana Service的NodePort:

# kubectl get svc grafana -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana NodePort 192.168.183.25 <none> 3000:**31266**/TCP 4m56s

之后可以通过任意一个安装了kube-proxy服务的节点IP+31266端口即可访问到Grafana:


Grafana默认登录的账号密码为admin/admin。然后相同的方式更改Prometheus的Service为NodePort:
# kubectl edit svc prometheus-k8s -n monitoring
# kubectl get svc -n monitoring prometheus-k8s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-k8s NodePort 192.168.135.107 <none> 9090:**31922**/TCP 9m52s

通过31922端口即可访问到Prometheus的Web UI:



提示:默认安装完成后,会有几个告警,先忽略。

(三) 云原生和非云原生应用的监控流程

1.监控数据来源

目前比较常用的Exporter工具如下:

类型 Exporter
数据库 MySQL Exporter, Redis Exporter, MongoDB Exporter, MSSQL Exporter
硬件 Apcupsd Exporter,IoT Edison Exporter, IPMI Exporter, Node Exporter
消息队列 Beanstalkd Exporter, Kafka Exporter, NSQ Exporter, RabbitMQ Exporter
存储 Ceph Exporter, Gluster Exporter, HDFS Exporter, ScaleIO Exporter
HTTP服务 Apache Exporter, HAProxy Exporter, Nginx Exporter
API服务 AWS ECS Exporter, Docker Cloud Exporter, Docker Hub Exporter, GitHub Exporter
日志 Fluentd Exporter, Grok Exporter
监控系统 Collectd Exporter, Graphite Exporter, InfluxDB Exporter, Nagios Exporter, SNMP Exporter
其它 Blackbox Exporter, JIRA Exporter, Jenkins Exporter, Confluence Exporter

2.云原生应用Etcd监控

测试访问Etcd Metrics接口:

# curl -s --cert /etc/kubernetes/pki/etcd/etcd.pem --key /etc/kubernetes/pki/etcd/etcd-key.pem https://YOUR_ETCD_IP:2379/metrics -k | tail -1
promhttp_metric_handler_requests_total{code="503"} 0

证书的位置可以在Etcd配置文件中获得(注意配置文件的位置,不同的集群位置可能不同,Kubeadm安装方式可能会在/etc/kubernetes/manifests/etcd.yml中):

# grep -E "key-file|cert-file" /etc/etcd/etcd.config.yml
 cert-file: '/etc/kubernetes/pki/etcd/etcd.pem'
 key-file: '/etc/kubernetes/pki/etcd/etcd-key.pem'

Etcd Service创建
首先需要配置Etcd的Service和Endpoint:

# vim etcd-svc.yaml
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app: etcd-prom
  name: etcd-prom
  namespace: kube-system
subsets:
- addresses: 
  - ip: YOUR_ETCD_IP01
  - ip: YOUR_ETCD_IP02
  - ip: YOUR_ETCD_IP03
  ports:
  - name: https-metrics
    port: 2379   # etcd端口
    protocol: TCP
---
apiVersion: v1
kind: Service 
metadata:
  labels:
    app: etcd-prom
  name: etcd-prom
  namespace: kube-system
spec:
  ports:
  - name: https-metrics
    port: 2379
    protocol: TCP
    targetPort: 2379
  type: ClusterIP

需要注意将YOUR_ETCD_IP改成自己的Etcd主机IP,另外需要注意port的名称为https-metrics,需要和后面的ServiceMonitor保持一致。之后创建该资源并查看Service的ClusterIP:

# kubectl create -f etcd-svc.yaml
# kubectl get svc -n kube-system etcd-prom
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
etcd-prom ClusterIP **192.168.2.188** <none> 2379/TCP      8s

通过ClusterIP访问测试:

# curl -s --cert /etc/kubernetes/pki/etcd/etcd.pem --key /etc/kubernetes/pki/etcd/etcd-key.pem https://192.168.2.188:2379/metrics -k | tail -1
promhttp_metric_handler_requests_total{code="503"} 0

创建Etcd证书的Secret(证书路径根据实际环境进行更改):

# kubectl create secret generic etcd-ssl --from-file=/etc/kubernetes/pki/etcd/etcd-ca.pem --from-file=/etc/kubernetes/pki/etcd/etcd.pem --from-file=/etc/kubernetes/pki/etcd/etcd-key.pem -n monitoring
secret/etcd-ssl created

将证书挂载至Prometheus容器(由于Prometheus是Operator部署的,所以只需要修改Prometheus资源即可):

# kubectl edit prometheus k8s -n monitoring

保存退出后,Prometheus的Pod会自动重启,重启完成后,查看证书是否挂载(任意一个Prometheus的Pod均可):

# kubectl get po -n monitoring -l app=prometheus
NAME READY STATUS RESTARTS AGE
prometheus-k8s-0 4/4 Running 1 29s
# kubectl exec -n monitoring prometheus-k8s-0 -c prometheus -- ls /etc/prometheus/secrets/etcd-ssl/
etcd-ca.pem
etcd-key.pem
etcd.pem

Etcd ServiceMonitor创建
之后创建Etcd的ServiceMonitor:

# cat servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: etcd
  namespace: monitoring
  labels:
    app: etcd
spec:
  jobLabel: k8s-app
  endpoints:
    - interval: 30s
      port: https-metrics  # 这个port对应 Service.spec.ports.name
      scheme: https
      tlsConfig:
        caFile: /etc/prometheus/secrets/etcd-ssl/etcd-ca.pem #证书路径
        certFile: /etc/prometheus/secrets/etcd-ssl/etcd.pem
        keyFile: /etc/prometheus/secrets/etcd-ssl/etcd-key.pem
        insecureSkipVerify: true  # 关闭证书校验
  selector:
    matchLabels:
      app: etcd-prom  # 跟svc的lables保持一致
  namespaceSelector:
    matchNames:
    - kube-system

和之前的ServiceMonitor相比,多了tlsConfig的配置,http协议的Metrics无需该配置。创建该ServiceMonitor:

# kubectl create -f servicemonitor.yaml
servicemonitor.monitoring.coreos.com/etcd created

创建完成后,在Prometheus的Web UI即可看到相关配置,在此不再演示。
Grafana配置
接下来打开Grafana,添加Etcd的监控面板:


依次点击“+”号---> Import,之后输入Etcd的Grafana Dashboard地址https://grafana.com/grafana/dashboards/3070,如下图所示:

点击Load,然后选择Prometheus,点击Import即可:

之后就可以看到Etcd集群的状态:

3.非云原生监控Exporter

本节将使用MySQL作为一个测试用例,演示如何使用Exporter监控非云原生应用。
部署测试用例
首先部署MySQL至Kubernetes集群中,直接配置MySQL的权限即可:

# kubectl create deploy mysql --image=registry.cn-beijing.aliyuncs.com/dotbalo/mysql:5.7.23
deployment.apps/mysql created
# 设置密码
# kubectl set env deploy/mysql MYSQL_ROOT_PASSWORD=mysql
deployment.apps/mysql env updated
# 查看Pod是否正常
# kubectl get po -l app=mysql
NAME         READY STATUS RESTARTS AGE
mysql-69d6f69557-5vnvg 1/1 Running 0 47s

创建Service暴露MySQL:

# kubectl expose deploy mysql --port 3306
service/mysql exposed
# kubectl get svc -l app=mysql
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mysql ClusterIP l92.168.l40.81 <none> 3306/TCP 29s

检查Service是否可用:

# telnet 192.168.140.81 3306
Trying 192.168.140.81...
Connected to 192.168.140.81.
Escape character is '^]'.
J
;FuNUnhZmysql_native_password^Connection closed by foreign host.

登录MySQL,创建Exporter所需的用户和权限(如果已经有需要监控的MySQL,可以直接执行此步骤即可):

# kubectl exec -ti mysql-69d6f69557-5vnvg -- bash
root@mysql-69d6f69557-5vnvg:/# **mysql -uroot -pmysql**
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.23 MySQL Community Server (GPL)
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporter' WITH MAX_USER_CONNECTIONS 3;
Query OK, 0 rows affected (0.01 sec)
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
Query OK, 0 rows affected (0.00 sec)
mysql> quit
Bye
root@mysql-69d6f69557-5vnvg:/# exit
exit

配置MySQL Exporter采集MySQL监控数据:

# cat mysql-exporter.yaml 
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mysql-exporter
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: mysql-exporter
  template:
    metadata:
      labels:
        k8s-app: mysql-exporter
    spec:
      containers:
      - name: mysql-exporter
        image: registry.cn-beijing.aliyuncs.com/dotbalo/mysqld-exporter 
        env:
         - name: DATA_SOURCE_NAME
           value: "exporter:exporter@(mysql.default:3306)/"
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 9104
---
apiVersion: v1
kind: Service
metadata:
  name: mysql-exporter
  namespace: monitoring
  labels:
    k8s-app: mysql-exporter
spec:
  type: ClusterIP
  selector:
    k8s-app: mysql-exporter
  ports:
  - name: api
    port: 9104
    protocol: TCP

注意DATA_SOURCE_NAME的配置,需要将exporter:exporter@(mysql.default:3306)/改成自己的实际配置,格式如下USERNAME:PASSWORD@MYSQL_HOST_ADDRESS:MYSQL_PORT。
创建Exporter:

# kubectl create -f mysql-exporter.yaml
deployment.apps/mysql-exporter created
service/mysql-exporter created
# kubectl get -f mysql-exporter.yaml
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/mysql-exporter 1/1 1 1 39s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/mysql-exporter ClusterIP **192.168.150.122** <none> 9104/TCP 39s

通过该Service地址,检查是否能正常获取Metrics数据:

# curl 192.168.150.122:9104/metrics | tail -1
promhttp_metric_handler_requests_total{code="503"} 0

ServiceMonitor和Grafana配置
配置ServiceMonitor:

# cat mysql-sm.yaml 
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mysql-exporter
  namespace: monitoring
  labels:
    k8s-app: mysql-exporter
    namespace: monitoring
spec:
  jobLabel: k8s-app
  endpoints:
  - port: api
    interval: 30s
    scheme: http
  selector:
    matchLabels:
      k8s-app: mysql-exporter
  namespaceSelector:
    matchNames:
    - monitoring

需要注意matchLabels和endpoints的配置,要和MySQL的Service一致。之后创建该ServiceMonitor:

# kubectl create -f mysql-sm.yaml
servicemonitor.monitoring.coreos.com/mysql-exporter created

接下来即可在Prometheus Web UI看到该监控:


导入Grafana Dashboard,地址:https://grafana.com/grafana/dashboards/6239,导入步骤和之前类似,在此不再演示。导入完成后,即可在Grafana看到监控数据:

4.Service Monitor找不到监控主机排查

# kubectl get servicemonitor -n monitoring kube-controller-manager kube-scheduler
NAME AGE
kube-controller-manager 39h
kube-scheduler 39h
# kubectl get servicemonitor -n monitoring kube-controller-manager -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
...
    port: https-metrics
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-controller-manager

该Service Monitor匹配的是kube-system命名空间下,具有app.kubernetes.io/name=kube-controller-manager标签,接下来通过该标签查看是否有该Service:

# kubectl get svc -n kube-system -l app.kubernetes.io/name=kube-controller-manager
No resources found in kube-system namespace.

可以看到并没有此标签的Service,所以导致了找不到需要监控的目标,此时可以手动创建该Service和Endpoint指向自己的Controller Manager:

apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app.kubernetes.io/name: kube-controller-manager
  name: kube-controller-manager-prom
  namespace: kube-system
subsets:
- addresses:
  - ip: YOUR_CONTROLLER_IP01
  - ip: YOUR_CONTROLLER_IP02
  - ip: YOUR_CONTROLLER_IP03
  ports:
  - name: http-metrics
    port: 10252
    protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: kube-controller-manager
  name: kube-controller-manager-prom
  namespace: kube-system
spec:
  ports:
  - name: http-metrics
    port: 10252
    protocol: TCP
    targetPort: 10252
  sessionAffinity: None
  type: ClusterIP

注意需要更改Endpoint配置的YOUR_CONTROLLER_IP为自己的Controller Manager的IP,接下来创建该Service和Endpoint:

# kubectl create -f controller.yaml
endpoints/kube-controller-manager-prom created
service/kube-controller-manager-prom created

查看创建的Service和Endpoint:

# kubectl get svc -n kube-system kube-controller-manager-prom
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-controller-manager-prom ClusterIP 192.168.213.1 <none> 10252/TCP 34s

此时该Service可能是不通的,因为在集群搭建时,可能Controller Manager和Scheduler是监听的127.0.0.1就导致无法被外部访问,此时需要更改它的监听地址为0.0.0.0:

# sed -i "s#address=127.0.0.1#address=0.0.0.0#g" /usr/lib/systemd/system/kube-controller-manager.service
# systemctl daemon-reload
# systemctl restart kube-controller-manager

通过该Service的ClusterIP访问Controller Manager的Metrics接口:

# kubectl get svc -n kube-system kube-controller-manager-prom
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-controller-manager-prom ClusterIP ** 192.168.213.1** <none> 10252/TCP 5m59s
# curl -s 192.168.213.1:10252/metrics | tail -1
workqueue_work_duration_seconds_count{name="DynamicServingCertificateController"} 3

更改ServiceMonitor的配置和Service一致:

# kubectl edit servicemonitor kube-controller-manager -n monitoring

等待几分钟后,就可以在Prometheus的Web UI上看到Controller Manager的监控目标:




通过Service Monitor监控应用时,如果监控没有找到目标主机的排查步骤时,排查步骤大致如下:
Ø 确认Service Monitor是否成功创建
Ø 确认Prometheus是否生成了相关配置
Ø 确认存在Service Monitor匹配的Service
Ø 确认通过Service能够访问程序的Metrics接口
Ø 确认Service的端口和Scheme和Service Monitor一致

(四) 黑盒监控

新版Prometheus Stack已经默认安装了Blackbox Exporter,可以通过以下命令查看:

# kubectl get po -n monitoring -l app.kubernetes.io/name=blackbox-exporter
NAME READY STATUS RESTARTS AGE
blackbox-exporter-7f88596689-fl2v8 3/3 Running 0 8d

同时也会创建一个Service,可以通过该Service访问Blackbox Exporter并传递一些参数:

# kubectl get svc -n monitoring -l app.kubernetes.io/name=blackbox-exporter
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
blackbox-exporter ClusterIP **192.168.204.117**   <none> 9115/TCP,19115/TCP 8d

比如检测下gaoxin.kubeasy.com(使用任何一个公网域名或者公司内的域名探测即可)网站的状态,可以通过如下命令进行检查:

# curl -s "http://192.168.204.117:19115/probe?target=gaoxin.kubeasy.com&module=http_2xx" | tail -1
probe_tls_version_info{version="TLS 1.2"} 1

probe是接口地址,target是检测的目标,module是使用哪个模块进行探测。
如果集群中没有配置Blackbox Exporter,可以参考https://github.com/prometheus/blackbox_exporter进行安装。

(五) Prometheus静态配置

首先创建一个空文件,然后通过该文件创建一个Secret,那么这个Secret即可作为Prometheus的静态配置:

# touch prometheus-additional.yaml
# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
secret/additional-configs created

创建完Secret后,需要编辑下Prometheus配置:

# kubectl edit prometheus -n monitoring k8s
 additionalScrapeConfigs:
 key: prometheus-additional.yaml
 name: additional-configs
 optional: true

添加上述配置后保存退出,无需重启Prometheus的Pod即可生效。之后在prometheus-additional.yaml文件内编辑一些静态配置,此处用黑盒监控的配置进行演示:

- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]  # Look for a HTTP 200 response.
  static_configs:
    - targets:
      - http://gaoxin.kubeasy.com    # Target to probe with http.
      - https://www.baidu.com   # Target to probe with https.
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter:19115  # The blackbox exporter's real hostname:port.
1)  targets:探测的目标,根据实际情况进行更改
2)  params:使用哪个模块进行探测
3)  replacement:Blackbox Exporter的地址

可以看到此处的内容,和传统配置的内容一致,只需要添加对应的job即可。之后通过该文件更新该Secret:

# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml --dry-run=client -oyaml | kubectl replace -f - -n monitoring

更新完成后,稍等一分钟即可在Prometheus Web UI看到该配置:


监控状态UP后,导入黑盒监控的模板(https://grafana.com/grafana/dashboards/13659)即可:

其他模块使用方法类似,
可以参考:https://github.com/prometheus/blackbox_exporter

(六) Prometheus监控Windows(外部)主机

监控Linux的Exporter是:https://github.com/prometheus/node_exporter
监控Windows主机的Exporter是:https://github.com/prometheus-community/windows_exporter

首先下载对应的Exporter至Windows主机(MSI文件下载地址:https://github.com/prometheus-community/windows_exporter/releases):


下载完成后,双击打开即可完成安装,之后可以在任务管理器上看到对应的进程:

Windows Exporter会暴露一个9182端口,可以通过该端口访问到Windows的监控数据。接下来在静态配置文件中添加以下配置:
- job_name: 'WindowsServerMonitor'
  static_configs:
    - targets:
      - "1.1.1.1:9182"
      labels:
        server_type: 'windows'
  relabel_configs:
    - source_labels: [__address__]
      target_label: instance

Targets配置的是监控主机,如果是多台Windows主机,配置多行即可,当然每台主机都需要配置Exporter。之后可以在Prometheus Web UI看到监控数据:


之后导入模板(地址:https://grafana.com/grafana/dashboards/12566)即可:

(七) Prometheus语法PromQL入门

1.PromQL语法初体验

PromQL Web UI的Graph选项卡提供了简单的用于查询数据的入口,对于PromQL的编写和校验都可以在此位置,如图所示:


输入up,然后点击Execute,就能查到监控正常的Target:

通过标签选择器过滤出job为node-exporter的监控,语法为:up{job="node-exporter"}

注意此时是up{job="node-exporter"}属于绝对匹配,PromQL也支持如下表达式:
Ø !=:不等于;
Ø =~:表示等于符合正则表达式的指标;
Ø !:和=类似,=表示正则匹配,!表示正则不匹配。

如果想要查看主机监控的指标有哪些,可以输入node,会提示所有主机监控的指标:



假如想要查询Kubernetes集群中每个宿主机的磁盘总量,可以使用node_filesystem_size_bytes:



查询指定分区大小node_filesystem_size_bytes{mountpoint="/"}

或者是查询分区不是/boot,且磁盘是/dev/开头的分区大小(结果不再展示):

node_filesystem_size_bytes{device=~"/dev/.*", mountpoint!="/boot"}

查询主机k8s-master01在最近5分钟可用的磁盘空间变化:

node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"}[5m]

目前支持的范围单位如下:
Ø s:秒
Ø m:分钟
Ø h:小时
Ø d:天
Ø w:周
Ø y:年
查询10分钟之前磁盘可用空间,只需要指定offset参数即可:

node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"} offset 10m

查询10分钟之前,5分钟区间的磁盘可用空间的变化:

node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"}[5m] offset 10m

2.PromQL操作符

通过PromQL的语法查到了主机磁盘的空间数据,查询结果如下:



可以通过以下命令将字节转换为GB或者MB:

node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"} / 1024 / 1024 / 1024

也可以将1024 / 1024 / 1024改为(1024 ^ 3)

node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"} / (1024 ^ 3)

查询结果如下图所示,此时为15GB左右:



此时可以在宿主机上比对数据是否正确:

# df -Th | grep /dev/mapper/centos-root
/dev/mapper/centos-root xfs 36G 20G 16G 57% /

上述使用的“/”为数学运算的“除”,“^”为幂运算,同时也支持如下运算符:
Ø +: 加
Ø -: 减
Ø *: 乘
Ø /: 除
Ø ^: 幂运算
Ø %: 求余

查询k8s-master01根区分磁盘可用率,可以通过如下指令进行计算:

node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"} / node_filesystem_size_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"}

查询所有主机根分区的可用率:

node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

也可以将结果乘以100直接得到百分比:

(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100

找到集群中根分区空间可用率大于60%的主机:

**(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 > 60**

PromQL也支持如下判断:
Ø ==: (相等)
Ø != :(不相等)
Ø >:(大于)
Ø < :(小于)
Ø >= :(大于等于)
Ø <= :(小于等于)

磁盘可用率大于30%小于等于60%的主机:

30 < (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 <= 60

也可以用and进行联合查询:

(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 > 30 and (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 <=60

除了and外,也支持or和unless:

  1. and: 并且
  2. or :或者
  3. unless :排除

查询主机磁盘剩余空间,并且排除掉shm和tmpfs的磁盘:

node_filesystem_free_bytes unless node_filesystem_free_bytes{device=~"shm|tmpfs"}

当然,这个语法也可以直接写为:

node_filesystem_free_bytes{device=~"shm|tmpfs"}

3.PromQL常用函数

使用sum函数统计当前监控目标所有主机根分区剩余的空间:

sum(node_filesystem_free_bytes{mountpoint="/"}) / 1024^3

也可以用同样方式,计算所有的请求总量:

sum(http_request_total)

根据statuscode字段进行统计请求数据:

sum(http_request_total) by (statuscode)

根据statuscode和handler两个指标进一步统计:

sum(http_request_total) by (statuscode, handler)

找到排名前五的数据:

topk(5, sum(http_request_total) by (statuscode, handler))

取最后三个数据:

bottomk(3, sum(http_request_total) by (statuscode, handler))

找出统计结果中最小的数据:

min(node_filesystem_avail_bytes{mountpoint="/"})

最大的数据:

max(node_filesystem_avail_bytes{mountpoint="/"})

平均值:

avg(node_filesystem_avail_bytes{mountpoint="/"})

四舍五入,向上取最接近的整数,2.79 à 3:

ceil(node_filesystem_files_free{mountpoint="/"} / 1024 / 1024)

向下取整数, 2.79 à 2:

floor(node_filesystem_files_free{mountpoint="/"} / 1024 / 1024)

对结果进行正向排序:

sort(sum(http_request_total) by (handler, statuscode))

对结果进行逆向排序:

sort_desc(sum(http_request_total) by (handler, statuscode))

函数predict_linear可以用于预测分析和预测性告警,比如可以根据一天的数据,预测4个小时后,磁盘分区的空间会不会小于0:

predict_linear(node_filesystem_files_free{mountpoint="/"}[1d], 4*3600) < 0

除了上述的函数,还有几个比较重要的函数,比如increase、rate、irate。其中increase是计算在一段时间范围内数据的增长(只能计算count类型的数据),rate和irate是计算增长率。比如查询某个请求在1小时的时间增长了多少:

increase(http_request_total{handler="/api/datasources/proxy/:id/*",method="get",namespace="monitoring",service="grafana",statuscode="200"}[1h])

将1h增长的数量处于该时间即为增长率:

increase(http_request_total{handler="/api/datasources/proxy/:id/*",method="get",namespace="monitoring",service="grafana",statuscode="200"}[1h]) / 3600

相对于increase,rate可以直接计算出某个指标在给定时间范围内的增长率,比如还是计算1h的增长率,可以用rate函数进行计算:

rate(http_request_total{handler="/api/datasources/proxy/:id/*",method="get",namespace="monitoring",service="grafana",statuscode="200"}[1h])
上一篇 下一篇

猜你喜欢

热点阅读