Prometheus

2020-10-14 本文已影响0人蜀山_竹君子

一、简介

Prometheus是一套开源的监控与告警框架，由工作在 SoundCloud 的 google 前员工在 2012 年创建，作为社区开源项目进行开发，并于 2015 年正式发布。2016 年，继 Kubernetes 之后，Prometheus 成为 Cloud Native Computing Foundation 的第二个项目。

1.1 特点

多维度的时间序列数据模型，以 metric 和键值对加以区分；
灵活的查询语言；
部署方便：不依赖分布式存储；可自治的单服务器节点；
时间序列数据通过HTTP协议以拉取（pull）的方式收集；
通过中间的网关可以实现时间序列的推送；
监控目标可以通过服务发现或静态配置；
支持多种绘图和仪表盘模式

1.2 架构

二、Spring Cloud 集成Prometheus

Spring Cloud 通过actuator暴露应用暴露监控指标，为Prometheus-server提供监控指标元数据。

2.1 Springboot增加Prometheus依赖

<!--prometheus monitor start -->
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-core</artifactId>
            <version>1.2.0</version>
        </dependency>
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
            <version>1.2.0</version>
        </dependency>
        <!--prometheus monitor end -->

2.2 增加Prometheus配置

#actutor相关配置
management:
  metrics:
    export:
      prometheus:
        enabled: true
        step: 1m
        descriptions: true
  web:
    server:
      auto-time-requests: true
  endpoints:
    web:
      exposure:
        include: health,prometheus

访问：应用url/actutor/prometheus，可以看到 Prometheus 格式的指标数据

2.3 nacos集成Prometheus

nacos server集成了Prometheus，只需要添加配置暴露metrics即可接入prometheus。
配置集群application.properties文件，暴露metrics数据

management.endpoints.web.exposure.include=*

访问{ip}:8848/nacos/actuator/prometheus，看是否能访问到metrics数据

三、监控指标分析

Spring Cloud 集成Prometheus 使用的是SpringBoot2.0 Actuator监控指标

3.1 SpringBoot2.0 Actuator监控指标说明

序号	参数	参数说明	是否监控	监控手段	重要度
---	JVM	---
1	jvm.memory.max	JVM最大内存
2	jvm.memory.committed	JVM可用内存	是	展示并监控堆内存和Metaspace	重要
3	jvm.memory.used	JVM已用内存	是	展示并监控堆内存和Metaspace	重要
4	jvm.buffer.memory.used	JVM缓冲区已用内存
5	jvm.buffer.count	当前缓冲区数
6	jvm.threads.daemon	JVM守护线程数	是	显示在监控页面
7	jvm.threads.live	JVM当前活跃线程数	是	显示在监控页面；监控达到阈值时报警	重要
8	jvm.threads.peak	JVM峰值线程数	是	显示在监控页面
9	jvm.classes.loaded	加载classes数
10	jvm.classes.unloaded	未加载的classes数
11	jvm.gc.memory.allocated	GC时，年轻代分配的内存空间
12	jvm.gc.memory.promoted	GC时，老年代分配的内存空间
13	jvm.gc.max.data.size	GC时，老年代的最大内存空间
14	jvm.gc.live.data.size	FullGC时，老年代的内存空间
15	jvm.gc.pause	GC耗时	是	显示在监控页面
---	TOMCAT	---
16	tomcat.sessions.created	tomcat已创建session数
17	tomcat.sessions.expired	tomcat已过期session数
18	tomcat.sessions.active.current	tomcat活跃session数
19	tomcat.sessions.active.max	tomcat最多活跃session数	是	显示在监控页面，超过阈值可报警或者进行动态扩容	重要
20	tomcat.sessions.alive.max.second	tomcat最多活跃session数持续时间
21	tomcat.sessions.rejected	超过session最大配置后，拒绝的session个数	是	显示在监控页面，方便分析问题
22	tomcat.global.error	错误总数	是	显示在监控页面，方便分析问题
23	tomcat.global.sent	发送的字节数
24	tomcat.global.request.max	request最长时间
25	tomcat.global.request	全局request次数和时间
26	tomcat.global.received	全局received次数和时间
27	tomcat.servlet.request	servlet的请求次数和时间
28	tomcat.servlet.error	servlet发生错误总数
29	tomcat.servlet.request.max	servlet请求最长时间
30	tomcat.threads.busy	tomcat繁忙线程	是	显示在监控页面，据此检查是否有线程夯住
31	tomcat.threads.current	tomcat当前线程数（包括守护线程	）是	显示在监控页面	重要
32	tomcat.threads.config.max	tomcat配置的线程最大数	是	显示在监控页面	重要
33	tomcat.cache.access	tomcat读取缓存次数
34	tomcat.cache.hit	tomcat缓存命中次数
---	CPU...	---
35	system.cpu.count	CPU数量
36	system.load.average.1m	load average	是	超过阈值报警	重要
37	system.cpu.usage	系统CPU使用率
38	process.cpu.usage	当前进程CPU使用率	是	超过阈值报警
39	http.server.requests	http请求调用情况	是	显示10个请求量最大，耗时最长的URL；统计非200的请求量	重要
40	process.uptime	应用已运行时间	是	显示在监控页面
41	process.files.max	允许最大句柄数	是	配合当前打开句柄数使用
42	process.start.time	应用启动时间点	是	显示在监控页面
43	process.files.open	当前打开句柄数	是	监控文件句柄使用率，超过阈值后报警	重要

3.2 Grafana 观察分析 Prometheus

Grafana 支持多种监控数据源，具体安装、数据源配置这里不再赘述，主要讲解下核心数据的监控使用。
生产目前已经配置完生产数据源，根据Instance下拉选项选择对应的系统进行监控指标观察和分析：

监控大屏
目前appliaction暂时未显示IP对应的应用名称，这里需要运维二次开发，暂时无法加上，后续会补上，可以在wiki-运维服务器-端口总览查看IP:HOST对应的应用名称。

3.2.1 基本信息Basic Statistics

grafana仪表盘最上面就应用基本信息监控

Start time：对应监控指标process.start.time-应用启动时间点，显示的是最近一次系统启动时间‘
Uptime：对应监控指标process.uptime 应用已运行时间
Heap Used：对应vm.memory.used，堆内存使用，重要指标
Non-Heap Used：jvm.buffer.memory.used，缓冲区使用监控
Process Open Files：
process.files.open，当前打开句柄数，重要指标
process.files.max，最大句柄数
CPU Usage：CPU使用情况监控
system.cpu.usage：系统CPU使用率
process.cpu.usage：当前进程CPU使用率
Load Average：
system.load.average.1m ：load average 系统的平均负荷重要指标，load average"的值越低，比如等于0.2或0.3，就说明电脑的工作量越小，系统负荷比较轻。

JVM Statistics - Memory

应用JVM内存统计分析，只管的反应应用的JVM内存使用、缓冲区使用、类加载情况、线程数等信息。

PS Eden Space (heap)：新生代Eden 区堆内存使用情况，能够直观反应应用new 对象内存分配情况
Used:jvm.memory.max JVM最大内存
committed:jvm.memory.committed JVM可用内存是展示并监控堆内存和Metaspace 重要
used:jvm.memory.used JVM已用内存
PS Old Gen (heap)：老年代代堆内存使用情况，能够直观反应应用大对象、长生命周期对象内存分配情况
Used:jvm.memory.max JVM最大内存
committed:jvm.memory.committed JVM可用内存是展示并监控堆内存和Metaspace 重要
used:jvm.memory.used JVM已用内存
PS Survivor Space (heap)：新生代Survivor 区堆内存使用情况，对象年代提升情况，通过对该区的内存使用监控，可以防止应用出现“过早提升”问题
Used:jvm.memory.max JVM最大内存
committed:jvm.memory.committed JVM可用内存是展示并监控堆内存和Metaspace 重要
used:jvm.memory.used JVM已用内存
Code Cache (non-heap):JVM生成的native code存放的内存空间称之为Code Cache；JIT编译、JNI等都会编译代码到native code，其中JIT生成的native code占用了Code Cache的绝大部分空间
Compressed Class Space (non-heap): 类指针压缩空间（Compressed Class Pointer Space）内存分配。
Metaspace (non-heap)：监控展示了Java元数据内存分配情况。元空间，Java8移除了持久空间，引入元空间内存模型
Classes ：classes加载情况监控
Classes Unloaded：未加载的classes数
Classes Loaded：已加载的classes数
Mapped Buffers: 内存映射区内存分配，可忽略
Direct Buffers: JVM缓冲区已用内存监控
Memory Allocate/Promote：GC时，年轻代分配的内存空间/GC时，老年代分配的内存空间监控

JVM Statistics - GC

JVM内存垃圾回收统计分析，对jvm进行gc的时间、数量、jvm停顿时间的监控

GC Count：GC次数统计
GC Stop the World Duration：GC全局停顿时间统计

HTTP Statistics

http请求调用情况，对请求数、请求响应状态码，请求响应时间进行监控

Request Count：统计url请求数，

时间端

请求状态码：统计非200的请求量
请求时间：