十六 Prometheus监控入门
(一) Prometheus架构
![](https://img.haomeiwen.com/i20896689/034ae2e54d865cbd.png)
![](https://img.haomeiwen.com/i20896689/261794f662d4faab.png)
![](https://img.haomeiwen.com/i20896689/f7a44a72bc4600b0.png)
Ø Prometheus Server:Prometheus生态最重要的组件,主要用于抓取和存储时间序列数据,同时提供数据的查询和告警策略的配置管理;
Ø Alertmanager:Prometheus生态用于告警的组件,Prometheus Server会将告警发送给Alertmanager,Alertmanager根据路由配置,将告警信息发送给指定的人或组。Alertmanager支持邮件、Webhook、微信、钉钉、短信等媒介进行告警通知;
Ø Grafana:用于展示数据,便于数据的查询和观测;
Ø Push Gateway:Prometheus本身是通过Pull的方式拉取数据,但是有些监控数据可能是短期的,如果没有采集数据可能会出现丢失。Push Gateway可以用来解决此类问题,它可以用来接收数据,也就是客户端可以通过Push的方式将数据推送到Push Gateway,之后Prometheus可以通过Pull拉取该数据;
Ø Exporter:主要用来采集监控数据,比如主机的监控数据可以通过node_exporter采集,MySQL的监控数据可以通过mysql_exporter采集,之后Exporter暴露一个接口,比如/metrics,Prometheus可以通过该接口采集到数据;
Ø PromQL:PromQL其实不算Prometheus的组件,它是用来查询数据的一种语法,比如查询数据库的数据,可以通过SQL语句,查询Loki的数据,可以通过LogQL,查询Prometheus数据的叫做PromQL;
Ø Service Discovery:用来发现监控目标的自动发现,常用的有基于Kubernetes、Consul、Eureka、文件的自动发现等。
(****二) Prometheus安装
Kube-Prometheus项目地址:https://github.com/prometheus-operator/kube-prometheus/
首先需要通过该项目地址,找到和自己Kubernetes版本对应的Kube Prometheus Stack的版本:
![](https://img.haomeiwen.com/i20896689/04295093d604669a.png)
# git clone -b release-0.8 https://github.com/prometheus-operator/kube-prometheus.git
# cd kube-prometheus/manifests
安装Prometheus Operator:
# kubectl create -f setup/**
namespace/monitoring created
...
deployment.apps/prometheus-operator created
service/prometheus-operator created
serviceaccount/prometheus-operator created
查看Operator容器的状态:
# kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
prometheus-operator-bb5c5b6c8-xtkdn 2/2 Running 0 25s
Operator容器启动后,安装Prometheus Stack:
# kubectl create -f .
alertmanager.monitoring.coreos.com/main created
...
service/prometheus-k8s created
serviceaccount/prometheus-k8s created
servicemonitor.monitoring.coreos.com/prometheus-k8s created
查看Prometheus容器状态:
# kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 59s
alertmanager-main-1 2/2 Running 0 59s
alertmanager-main-2 2/2 Running 0 59s
blackbox-exporter-7f88596689-fl2v8 3/3 Running 0 59s
grafana-766bfd54b9-cchqm 1/1 Running 0 58s
kube-state-metrics-5fd8b545b-hrzxc 3/3 Running 0 58s
node-exporter-265df 2/2 Running 0 58s
node-exporter-5qj7b 2/2 Running 0 58s
node-exporter-lxngk 2/2 Running 0 58s
node-exporter-n8p7w 2/2 Running 0 58s
node-exporter-xjjf2 2/2 Running 0 58s
prometheus-adapter-5b849bbc57-tlwvd 1/1 Running 0 58s
prometheus-adapter-5b849bbc57-xjznh 1/1 Running 0 58s
prometheus-k8s-0 2/2 Running 1 57s
prometheus-k8s-1 2/2 Running 1 57s
prometheus-operator-bb5c5b6c8-xtkdn 2/2 Running 0 18m
将Grafana的Service改成NodePort类型:
# kubectl edit svc grafana -n monitoring
![](https://img.haomeiwen.com/i20896689/f79eeb0f66c0cfb3.png)
查看Grafana Service的NodePort:
# kubectl get svc grafana -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana NodePort 192.168.183.25 <none> 3000:**31266**/TCP 4m56s
之后可以通过任意一个安装了kube-proxy服务的节点IP+31266端口即可访问到Grafana:
![](https://img.haomeiwen.com/i20896689/f102439903206f33.png)
Grafana默认登录的账号密码为admin/admin。然后相同的方式更改Prometheus的Service为NodePort:
# kubectl edit svc prometheus-k8s -n monitoring
# kubectl get svc -n monitoring prometheus-k8s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-k8s NodePort 192.168.135.107 <none> 9090:**31922**/TCP 9m52s
通过31922端口即可访问到Prometheus的Web UI:
![](https://img.haomeiwen.com/i20896689/1bfed2463c16dc88.png)
提示:默认安装完成后,会有几个告警,先忽略。
(三) 云原生和非云原生应用的监控流程
1.监控数据来源
目前比较常用的Exporter工具如下:
类型 | Exporter |
---|---|
数据库 | MySQL Exporter, Redis Exporter, MongoDB Exporter, MSSQL Exporter |
硬件 | Apcupsd Exporter,IoT Edison Exporter, IPMI Exporter, Node Exporter |
消息队列 | Beanstalkd Exporter, Kafka Exporter, NSQ Exporter, RabbitMQ Exporter |
存储 | Ceph Exporter, Gluster Exporter, HDFS Exporter, ScaleIO Exporter |
HTTP服务 | Apache Exporter, HAProxy Exporter, Nginx Exporter |
API服务 | AWS ECS Exporter, Docker Cloud Exporter, Docker Hub Exporter, GitHub Exporter |
日志 | Fluentd Exporter, Grok Exporter |
监控系统 | Collectd Exporter, Graphite Exporter, InfluxDB Exporter, Nagios Exporter, SNMP Exporter |
其它 | Blackbox Exporter, JIRA Exporter, Jenkins Exporter, Confluence Exporter |
2.云原生应用Etcd监控
测试访问Etcd Metrics接口:
# curl -s --cert /etc/kubernetes/pki/etcd/etcd.pem --key /etc/kubernetes/pki/etcd/etcd-key.pem https://YOUR_ETCD_IP:2379/metrics -k | tail -1
promhttp_metric_handler_requests_total{code="503"} 0
证书的位置可以在Etcd配置文件中获得(注意配置文件的位置,不同的集群位置可能不同,Kubeadm安装方式可能会在/etc/kubernetes/manifests/etcd.yml中):
# grep -E "key-file|cert-file" /etc/etcd/etcd.config.yml
cert-file: '/etc/kubernetes/pki/etcd/etcd.pem'
key-file: '/etc/kubernetes/pki/etcd/etcd-key.pem'
Etcd Service创建
首先需要配置Etcd的Service和Endpoint:
# vim etcd-svc.yaml
apiVersion: v1
kind: Endpoints
metadata:
labels:
app: etcd-prom
name: etcd-prom
namespace: kube-system
subsets:
- addresses:
- ip: YOUR_ETCD_IP01
- ip: YOUR_ETCD_IP02
- ip: YOUR_ETCD_IP03
ports:
- name: https-metrics
port: 2379 # etcd端口
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app: etcd-prom
name: etcd-prom
namespace: kube-system
spec:
ports:
- name: https-metrics
port: 2379
protocol: TCP
targetPort: 2379
type: ClusterIP
需要注意将YOUR_ETCD_IP改成自己的Etcd主机IP,另外需要注意port的名称为https-metrics,需要和后面的ServiceMonitor保持一致。之后创建该资源并查看Service的ClusterIP:
# kubectl create -f etcd-svc.yaml
# kubectl get svc -n kube-system etcd-prom
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
etcd-prom ClusterIP **192.168.2.188** <none> 2379/TCP 8s
通过ClusterIP访问测试:
# curl -s --cert /etc/kubernetes/pki/etcd/etcd.pem --key /etc/kubernetes/pki/etcd/etcd-key.pem https://192.168.2.188:2379/metrics -k | tail -1
promhttp_metric_handler_requests_total{code="503"} 0
创建Etcd证书的Secret(证书路径根据实际环境进行更改):
# kubectl create secret generic etcd-ssl --from-file=/etc/kubernetes/pki/etcd/etcd-ca.pem --from-file=/etc/kubernetes/pki/etcd/etcd.pem --from-file=/etc/kubernetes/pki/etcd/etcd-key.pem -n monitoring
secret/etcd-ssl created
将证书挂载至Prometheus容器(由于Prometheus是Operator部署的,所以只需要修改Prometheus资源即可):
# kubectl edit prometheus k8s -n monitoring
![](https://img.haomeiwen.com/i20896689/b9aac77e2afdd36a.png)
保存退出后,Prometheus的Pod会自动重启,重启完成后,查看证书是否挂载(任意一个Prometheus的Pod均可):
# kubectl get po -n monitoring -l app=prometheus
NAME READY STATUS RESTARTS AGE
prometheus-k8s-0 4/4 Running 1 29s
# kubectl exec -n monitoring prometheus-k8s-0 -c prometheus -- ls /etc/prometheus/secrets/etcd-ssl/
etcd-ca.pem
etcd-key.pem
etcd.pem
Etcd ServiceMonitor创建
之后创建Etcd的ServiceMonitor:
# cat servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd
namespace: monitoring
labels:
app: etcd
spec:
jobLabel: k8s-app
endpoints:
- interval: 30s
port: https-metrics # 这个port对应 Service.spec.ports.name
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-ssl/etcd-ca.pem #证书路径
certFile: /etc/prometheus/secrets/etcd-ssl/etcd.pem
keyFile: /etc/prometheus/secrets/etcd-ssl/etcd-key.pem
insecureSkipVerify: true # 关闭证书校验
selector:
matchLabels:
app: etcd-prom # 跟svc的lables保持一致
namespaceSelector:
matchNames:
- kube-system
和之前的ServiceMonitor相比,多了tlsConfig的配置,http协议的Metrics无需该配置。创建该ServiceMonitor:
# kubectl create -f servicemonitor.yaml
servicemonitor.monitoring.coreos.com/etcd created
创建完成后,在Prometheus的Web UI即可看到相关配置,在此不再演示。
Grafana配置
接下来打开Grafana,添加Etcd的监控面板:
![](https://img.haomeiwen.com/i20896689/0afa5f8044481114.png)
依次点击“+”号---> Import,之后输入Etcd的Grafana Dashboard地址https://grafana.com/grafana/dashboards/3070,如下图所示:
![](https://img.haomeiwen.com/i20896689/fbb556a4e9d7186b.png)
点击Load,然后选择Prometheus,点击Import即可:
![](https://img.haomeiwen.com/i20896689/c708f745e65f1ad3.png)
之后就可以看到Etcd集群的状态:
![](https://img.haomeiwen.com/i20896689/2446abb7d2dd89f3.png)
3.非云原生监控Exporter
本节将使用MySQL作为一个测试用例,演示如何使用Exporter监控非云原生应用。
部署测试用例
首先部署MySQL至Kubernetes集群中,直接配置MySQL的权限即可:
# kubectl create deploy mysql --image=registry.cn-beijing.aliyuncs.com/dotbalo/mysql:5.7.23
deployment.apps/mysql created
# 设置密码
# kubectl set env deploy/mysql MYSQL_ROOT_PASSWORD=mysql
deployment.apps/mysql env updated
# 查看Pod是否正常
# kubectl get po -l app=mysql
NAME READY STATUS RESTARTS AGE
mysql-69d6f69557-5vnvg 1/1 Running 0 47s
创建Service暴露MySQL:
# kubectl expose deploy mysql --port 3306
service/mysql exposed
# kubectl get svc -l app=mysql
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mysql ClusterIP l92.168.l40.81 <none> 3306/TCP 29s
检查Service是否可用:
# telnet 192.168.140.81 3306
Trying 192.168.140.81...
Connected to 192.168.140.81.
Escape character is '^]'.
J
;FuNUnhZmysql_native_password^Connection closed by foreign host.
登录MySQL,创建Exporter所需的用户和权限(如果已经有需要监控的MySQL,可以直接执行此步骤即可):
# kubectl exec -ti mysql-69d6f69557-5vnvg -- bash
root@mysql-69d6f69557-5vnvg:/# **mysql -uroot -pmysql**
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.23 MySQL Community Server (GPL)
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporter' WITH MAX_USER_CONNECTIONS 3;
Query OK, 0 rows affected (0.01 sec)
mysql> GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
Query OK, 0 rows affected (0.00 sec)
mysql> quit
Bye
root@mysql-69d6f69557-5vnvg:/# exit
exit
配置MySQL Exporter采集MySQL监控数据:
# cat mysql-exporter.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysql-exporter
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
k8s-app: mysql-exporter
template:
metadata:
labels:
k8s-app: mysql-exporter
spec:
containers:
- name: mysql-exporter
image: registry.cn-beijing.aliyuncs.com/dotbalo/mysqld-exporter
env:
- name: DATA_SOURCE_NAME
value: "exporter:exporter@(mysql.default:3306)/"
imagePullPolicy: IfNotPresent
ports:
- containerPort: 9104
---
apiVersion: v1
kind: Service
metadata:
name: mysql-exporter
namespace: monitoring
labels:
k8s-app: mysql-exporter
spec:
type: ClusterIP
selector:
k8s-app: mysql-exporter
ports:
- name: api
port: 9104
protocol: TCP
注意DATA_SOURCE_NAME的配置,需要将exporter:exporter@(mysql.default:3306)/改成自己的实际配置,格式如下USERNAME:PASSWORD@MYSQL_HOST_ADDRESS:MYSQL_PORT。
创建Exporter:
# kubectl create -f mysql-exporter.yaml
deployment.apps/mysql-exporter created
service/mysql-exporter created
# kubectl get -f mysql-exporter.yaml
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/mysql-exporter 1/1 1 1 39s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/mysql-exporter ClusterIP **192.168.150.122** <none> 9104/TCP 39s
通过该Service地址,检查是否能正常获取Metrics数据:
# curl 192.168.150.122:9104/metrics | tail -1
promhttp_metric_handler_requests_total{code="503"} 0
ServiceMonitor和Grafana配置
配置ServiceMonitor:
# cat mysql-sm.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mysql-exporter
namespace: monitoring
labels:
k8s-app: mysql-exporter
namespace: monitoring
spec:
jobLabel: k8s-app
endpoints:
- port: api
interval: 30s
scheme: http
selector:
matchLabels:
k8s-app: mysql-exporter
namespaceSelector:
matchNames:
- monitoring
需要注意matchLabels和endpoints的配置,要和MySQL的Service一致。之后创建该ServiceMonitor:
# kubectl create -f mysql-sm.yaml
servicemonitor.monitoring.coreos.com/mysql-exporter created
接下来即可在Prometheus Web UI看到该监控:
![](https://img.haomeiwen.com/i20896689/02c88ee26efd51a8.png)
导入Grafana Dashboard,地址:https://grafana.com/grafana/dashboards/6239,导入步骤和之前类似,在此不再演示。导入完成后,即可在Grafana看到监控数据:
![](https://img.haomeiwen.com/i20896689/b9a9882493ac8672.png)
4.Service Monitor找不到监控主机排查
# kubectl get servicemonitor -n monitoring kube-controller-manager kube-scheduler
NAME AGE
kube-controller-manager 39h
kube-scheduler 39h
# kubectl get servicemonitor -n monitoring kube-controller-manager -oyaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
...
port: https-metrics
scheme: https
tlsConfig:
insecureSkipVerify: true
jobLabel: app.kubernetes.io/name
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
app.kubernetes.io/name: kube-controller-manager
该Service Monitor匹配的是kube-system命名空间下,具有app.kubernetes.io/name=kube-controller-manager标签,接下来通过该标签查看是否有该Service:
# kubectl get svc -n kube-system -l app.kubernetes.io/name=kube-controller-manager
No resources found in kube-system namespace.
可以看到并没有此标签的Service,所以导致了找不到需要监控的目标,此时可以手动创建该Service和Endpoint指向自己的Controller Manager:
apiVersion: v1
kind: Endpoints
metadata:
labels:
app.kubernetes.io/name: kube-controller-manager
name: kube-controller-manager-prom
namespace: kube-system
subsets:
- addresses:
- ip: YOUR_CONTROLLER_IP01
- ip: YOUR_CONTROLLER_IP02
- ip: YOUR_CONTROLLER_IP03
ports:
- name: http-metrics
port: 10252
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: kube-controller-manager
name: kube-controller-manager-prom
namespace: kube-system
spec:
ports:
- name: http-metrics
port: 10252
protocol: TCP
targetPort: 10252
sessionAffinity: None
type: ClusterIP
注意需要更改Endpoint配置的YOUR_CONTROLLER_IP为自己的Controller Manager的IP,接下来创建该Service和Endpoint:
# kubectl create -f controller.yaml
endpoints/kube-controller-manager-prom created
service/kube-controller-manager-prom created
查看创建的Service和Endpoint:
# kubectl get svc -n kube-system kube-controller-manager-prom
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-controller-manager-prom ClusterIP 192.168.213.1 <none> 10252/TCP 34s
此时该Service可能是不通的,因为在集群搭建时,可能Controller Manager和Scheduler是监听的127.0.0.1就导致无法被外部访问,此时需要更改它的监听地址为0.0.0.0:
# sed -i "s#address=127.0.0.1#address=0.0.0.0#g" /usr/lib/systemd/system/kube-controller-manager.service
# systemctl daemon-reload
# systemctl restart kube-controller-manager
通过该Service的ClusterIP访问Controller Manager的Metrics接口:
# kubectl get svc -n kube-system kube-controller-manager-prom
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-controller-manager-prom ClusterIP ** 192.168.213.1** <none> 10252/TCP 5m59s
# curl -s 192.168.213.1:10252/metrics | tail -1
workqueue_work_duration_seconds_count{name="DynamicServingCertificateController"} 3
更改ServiceMonitor的配置和Service一致:
# kubectl edit servicemonitor kube-controller-manager -n monitoring
![](https://img.haomeiwen.com/i20896689/44dc18e24cd8d6f1.png)
等待几分钟后,就可以在Prometheus的Web UI上看到Controller Manager的监控目标:
![](https://img.haomeiwen.com/i20896689/08a2bfc50e2bc379.png)
![](https://img.haomeiwen.com/i20896689/99f87cee1001a7c8.png)
通过Service Monitor监控应用时,如果监控没有找到目标主机的排查步骤时,排查步骤大致如下:
Ø 确认Service Monitor是否成功创建
Ø 确认Prometheus是否生成了相关配置
Ø 确认存在Service Monitor匹配的Service
Ø 确认通过Service能够访问程序的Metrics接口
Ø 确认Service的端口和Scheme和Service Monitor一致
(四) 黑盒监控
新版Prometheus Stack已经默认安装了Blackbox Exporter,可以通过以下命令查看:
# kubectl get po -n monitoring -l app.kubernetes.io/name=blackbox-exporter
NAME READY STATUS RESTARTS AGE
blackbox-exporter-7f88596689-fl2v8 3/3 Running 0 8d
同时也会创建一个Service,可以通过该Service访问Blackbox Exporter并传递一些参数:
# kubectl get svc -n monitoring -l app.kubernetes.io/name=blackbox-exporter
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
blackbox-exporter ClusterIP **192.168.204.117** <none> 9115/TCP,19115/TCP 8d
比如检测下gaoxin.kubeasy.com(使用任何一个公网域名或者公司内的域名探测即可)网站的状态,可以通过如下命令进行检查:
# curl -s "http://192.168.204.117:19115/probe?target=gaoxin.kubeasy.com&module=http_2xx" | tail -1
probe_tls_version_info{version="TLS 1.2"} 1
probe是接口地址,target是检测的目标,module是使用哪个模块进行探测。
如果集群中没有配置Blackbox Exporter,可以参考https://github.com/prometheus/blackbox_exporter进行安装。
(五) Prometheus静态配置
首先创建一个空文件,然后通过该文件创建一个Secret,那么这个Secret即可作为Prometheus的静态配置:
# touch prometheus-additional.yaml
# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
secret/additional-configs created
创建完Secret后,需要编辑下Prometheus配置:
# kubectl edit prometheus -n monitoring k8s
additionalScrapeConfigs:
key: prometheus-additional.yaml
name: additional-configs
optional: true
![](https://img.haomeiwen.com/i20896689/69b480ca913f81e0.png)
添加上述配置后保存退出,无需重启Prometheus的Pod即可生效。之后在prometheus-additional.yaml文件内编辑一些静态配置,此处用黑盒监控的配置进行演示:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- http://gaoxin.kubeasy.com # Target to probe with http.
- https://www.baidu.com # Target to probe with https.
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:19115 # The blackbox exporter's real hostname:port.
1) targets:探测的目标,根据实际情况进行更改
2) params:使用哪个模块进行探测
3) replacement:Blackbox Exporter的地址
可以看到此处的内容,和传统配置的内容一致,只需要添加对应的job即可。之后通过该文件更新该Secret:
# kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml --dry-run=client -oyaml | kubectl replace -f - -n monitoring
更新完成后,稍等一分钟即可在Prometheus Web UI看到该配置:
![](https://img.haomeiwen.com/i20896689/137335bf85732790.png)
监控状态UP后,导入黑盒监控的模板(https://grafana.com/grafana/dashboards/13659)即可:
![](https://img.haomeiwen.com/i20896689/40ae713bc56a3b2a.png)
其他模块使用方法类似,
可以参考:https://github.com/prometheus/blackbox_exporter
(六) Prometheus监控Windows(外部)主机
监控Linux的Exporter是:https://github.com/prometheus/node_exporter,
监控Windows主机的Exporter是:https://github.com/prometheus-community/windows_exporter。
首先下载对应的Exporter至Windows主机(MSI文件下载地址:https://github.com/prometheus-community/windows_exporter/releases):
![](https://img.haomeiwen.com/i20896689/366321d7e0f82f26.png)
下载完成后,双击打开即可完成安装,之后可以在任务管理器上看到对应的进程:
![](https://img.haomeiwen.com/i20896689/f87d4d6b44dcca41.png)
Windows Exporter会暴露一个9182端口,可以通过该端口访问到Windows的监控数据。接下来在静态配置文件中添加以下配置:
- job_name: 'WindowsServerMonitor'
static_configs:
- targets:
- "1.1.1.1:9182"
labels:
server_type: 'windows'
relabel_configs:
- source_labels: [__address__]
target_label: instance
Targets配置的是监控主机,如果是多台Windows主机,配置多行即可,当然每台主机都需要配置Exporter。之后可以在Prometheus Web UI看到监控数据:
![](https://img.haomeiwen.com/i20896689/a8dd7eeb4bc0b4a8.png)
之后导入模板(地址:https://grafana.com/grafana/dashboards/12566)即可:
![](https://img.haomeiwen.com/i20896689/06a1ca32e37f4d28.png)
(七) Prometheus语法PromQL入门
1.PromQL语法初体验
PromQL Web UI的Graph选项卡提供了简单的用于查询数据的入口,对于PromQL的编写和校验都可以在此位置,如图所示:
![](https://img.haomeiwen.com/i20896689/102f2030c13fcb91.png)
输入up,然后点击Execute,就能查到监控正常的Target:
![](https://img.haomeiwen.com/i20896689/06c8ce7670fa8658.png)
通过标签选择器过滤出job为node-exporter的监控,语法为:up{job="node-exporter"}:
![](https://img.haomeiwen.com/i20896689/b613df7426073dd9.png)
注意此时是up{job="node-exporter"}属于绝对匹配,PromQL也支持如下表达式:
Ø !=:不等于;
Ø =~:表示等于符合正则表达式的指标;
Ø !:和=类似,=表示正则匹配,!表示正则不匹配。
如果想要查看主机监控的指标有哪些,可以输入node,会提示所有主机监控的指标:
![](https://img.haomeiwen.com/i20896689/80f2c7c5d49b7f4e.png)
假如想要查询Kubernetes集群中每个宿主机的磁盘总量,可以使用node_filesystem_size_bytes:
![](https://img.haomeiwen.com/i20896689/587174de3648379d.png)
查询指定分区大小node_filesystem_size_bytes{mountpoint="/"}
![](https://img.haomeiwen.com/i20896689/fe7d23b8e17e22d4.png)
或者是查询分区不是/boot,且磁盘是/dev/开头的分区大小(结果不再展示):
node_filesystem_size_bytes{device=~"/dev/.*", mountpoint!="/boot"}
查询主机k8s-master01在最近5分钟可用的磁盘空间变化:
node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"}[5m]
![](https://img.haomeiwen.com/i20896689/6657475ec4f986a8.png)
目前支持的范围单位如下:
Ø s:秒
Ø m:分钟
Ø h:小时
Ø d:天
Ø w:周
Ø y:年
查询10分钟之前磁盘可用空间,只需要指定offset参数即可:
node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"} offset 10m
查询10分钟之前,5分钟区间的磁盘可用空间的变化:
node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"}[5m] offset 10m
2.PromQL操作符
通过PromQL的语法查到了主机磁盘的空间数据,查询结果如下:
![](https://img.haomeiwen.com/i20896689/1c61cfa169f96a58.png)
可以通过以下命令将字节转换为GB或者MB:
node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"} / 1024 / 1024 / 1024
也可以将1024 / 1024 / 1024改为(1024 ^ 3)
node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"} / (1024 ^ 3)
查询结果如下图所示,此时为15GB左右:
![](https://img.haomeiwen.com/i20896689/477a8c780a0689f8.png)
此时可以在宿主机上比对数据是否正确:
# df -Th | grep /dev/mapper/centos-root
/dev/mapper/centos-root xfs 36G 20G 16G 57% /
上述使用的“/”为数学运算的“除”,“^”为幂运算,同时也支持如下运算符:
Ø +: 加
Ø -: 减
Ø *: 乘
Ø /: 除
Ø ^: 幂运算
Ø %: 求余
查询k8s-master01根区分磁盘可用率,可以通过如下指令进行计算:
node_filesystem_avail_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"} / node_filesystem_size_bytes{instance="k8s-master01", mountpoint="/", device="/dev/mapper/centos-root"}
查询所有主机根分区的可用率:
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
![](https://img.haomeiwen.com/i20896689/a44a6535c82f7919.png)
也可以将结果乘以100直接得到百分比:
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100
![](https://img.haomeiwen.com/i20896689/eb984af0cdb00f0f.png)
找到集群中根分区空间可用率大于60%的主机:
**(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 > 60**
![](https://img.haomeiwen.com/i20896689/13cc3b5481c85ecb.png)
PromQL也支持如下判断:
Ø ==: (相等)
Ø != :(不相等)
Ø >:(大于)
Ø < :(小于)
Ø >= :(大于等于)
Ø <= :(小于等于)
磁盘可用率大于30%小于等于60%的主机:
30 < (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 <= 60
也可以用and进行联合查询:
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 > 30 and (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} ) * 100 <=60
除了and外,也支持or和unless:
- and: 并且
- or :或者
- unless :排除
查询主机磁盘剩余空间,并且排除掉shm和tmpfs的磁盘:
node_filesystem_free_bytes unless node_filesystem_free_bytes{device=~"shm|tmpfs"}
当然,这个语法也可以直接写为:
node_filesystem_free_bytes{device=~"shm|tmpfs"}
3.PromQL常用函数
使用sum函数统计当前监控目标所有主机根分区剩余的空间:
sum(node_filesystem_free_bytes{mountpoint="/"}) / 1024^3
![](https://img.haomeiwen.com/i20896689/e49886682885d212.png)
也可以用同样方式,计算所有的请求总量:
sum(http_request_total)
![](https://img.haomeiwen.com/i20896689/0d7313357d50451a.png)
根据statuscode字段进行统计请求数据:
sum(http_request_total) by (statuscode)
![](https://img.haomeiwen.com/i20896689/3540fd5f208fd318.png)
根据statuscode和handler两个指标进一步统计:
sum(http_request_total) by (statuscode, handler)
![](https://img.haomeiwen.com/i20896689/b1e441b5a29661e9.png)
找到排名前五的数据:
topk(5, sum(http_request_total) by (statuscode, handler))
取最后三个数据:
bottomk(3, sum(http_request_total) by (statuscode, handler))
找出统计结果中最小的数据:
min(node_filesystem_avail_bytes{mountpoint="/"})
最大的数据:
max(node_filesystem_avail_bytes{mountpoint="/"})
平均值:
avg(node_filesystem_avail_bytes{mountpoint="/"})
四舍五入,向上取最接近的整数,2.79 à 3:
ceil(node_filesystem_files_free{mountpoint="/"} / 1024 / 1024)
向下取整数, 2.79 à 2:
floor(node_filesystem_files_free{mountpoint="/"} / 1024 / 1024)
对结果进行正向排序:
sort(sum(http_request_total) by (handler, statuscode))
对结果进行逆向排序:
sort_desc(sum(http_request_total) by (handler, statuscode))
函数predict_linear可以用于预测分析和预测性告警,比如可以根据一天的数据,预测4个小时后,磁盘分区的空间会不会小于0:
predict_linear(node_filesystem_files_free{mountpoint="/"}[1d], 4*3600) < 0
除了上述的函数,还有几个比较重要的函数,比如increase、rate、irate。其中increase是计算在一段时间范围内数据的增长(只能计算count类型的数据),rate和irate是计算增长率。比如查询某个请求在1小时的时间增长了多少:
increase(http_request_total{handler="/api/datasources/proxy/:id/*",method="get",namespace="monitoring",service="grafana",statuscode="200"}[1h])
将1h增长的数量处于该时间即为增长率:
increase(http_request_total{handler="/api/datasources/proxy/:id/*",method="get",namespace="monitoring",service="grafana",statuscode="200"}[1h]) / 3600
相对于increase,rate可以直接计算出某个指标在给定时间范围内的增长率,比如还是计算1h的增长率,可以用rate函数进行计算:
rate(http_request_total{handler="/api/datasources/proxy/:id/*",method="get",namespace="monitoring",service="grafana",statuscode="200"}[1h])