prometheus编程实践（一）-相关知识

2019-08-07 本文已影响2人简单是美美

1. Prometheus的适合与不适合

Prometheus官网这样描述它的适用场景：

Prometheus works well for recording any purely numeric time series. It fits both machine-centric monitoring as well as monitoring of highly dynamic service-oriented architectures.

Prometheus的核心数据模型是时间序列，因此它非常适合记录时间序列数据，并根据记录的时间序列进行相关的聚合和分析操作。Prometheus使用定时采样的方式来采集指标数据，因此非常适合于在服务集群中采集实时监控数据，这也是被广泛应用于kubernetes集群监控中的一个原因。
由于Prometheus中的数据是采样数据，因而也并非适用所有场景。Prometheus官网这样描述它的不适合之处：

Prometheus values reliability. You can always view what statistics are available about your system, even under failure conditions. If you need 100% accuracy, such as for per-request billing, Prometheus is not a good choice as the collected data will likely not be detailed and complete enough.

即对于需要采集完整的数据，并保证100%数据精确的情况下，Prometheus并不是一个好的选择。
对于这里提到的100%数据精确，我的理解有两方面：1.在prometheus的时序数据库中查询数据，并非在时序数据库中去精确匹配查询时间戳，而是选一个有效时间内（即时有效性，5分钟内）最接近这个查询时间戳的时间戳匹配。2.对于入库指标的时间戳，常用方式是由prometheus服务器来打，若自己定义时间戳，在使用上有一些限制。

2. Prometheus相关知识

2.1. Prometheus知识导图

图1.png

Prometheus由服务器(server)，采集器(exporter)，推送网关(push gateway)和告警管理器(AlertManager)组成。实际上，服务器(server)，采集器(exporter)是必须的，推送网关和告警管理器可根据应用场景选用。
Prometheus的核心数据模型是基于指标数据的时间序列模型。在Prometheus中，每一个时间序列由指标名称与其对应的标签集合唯一标识，如“vehicle_passed_num{instance="viid",job="zhangkai",lane_no="1",tollgate_no="1"}”和“vehicle_passed_num{instance="viid",job="zhangkai",lane_no="1",tollgate_no="2"}”这两个时间序列标识尽管指标名称相同，但标签值不同。因此可以视为两个不同的时间序列。
对于每个时间序列，在Prometheus中存储的是一个二元组(时间戳，指标值)的列表，这里时间戳为毫秒级的UTC时间，指标值为一个float64的值。
一个时间序列看起来就像下面这个样子：

图2.png

Prometheus采集数据的方式主要为拉(pull)的方式，即使用HTTP接口定时从指定目标(Target)中获取指标数据，采集目标提供的指标数据中包含自定义的时间戳，若不包含自定义的时间戳，则Prometheus在存入自己的时序数据库时使用自身服务器的系统时间打上时间戳。
Prometheus提供推送网关(push gateway)实现一种伪“推数据”的方式，之所以说它是伪的方式，因为推送网关只是将指标值暂存在内存中，最后还是由Prometheus服务器来定时拉取。推送网关有一些问题可能导致在某些情况下不适用：

POST到推送网关的指标数据不能携带自定义的时间戳。
对于同一时间序列，推送网关只保留当前值：当Prometheus服务器的采集周期大于推送周期时，可能出现指标漏采的情况；当Prometheus服务器的采集周期小于推送周期时，可能出现指标重复采集的情况。

下图显示了Prometheus服务器自身提供的一些指标值，拉取地址为“/metrics”。

图3.png

2.2. 指标模型

在Prometheus中提供了四种指标模型，分别为：Counter,Gauge,Histogram,Summary。
 Counter：一种累加的指标，用于表示持续的计数。典型的应用如：请求的个数，结束的任务数，出现的错误数等等。如下面的示例：

# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 1.639836264e+09

 Gauge：一种标识即时测量值指标。如下面的示例：

# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 1.9423568e+07

 Histogram：直方图对观察结果(通常是请求持续时间或响应大小之类的东西)进行采样，并在可配置的桶中计数。它还提供了所有观测值的和。如下面的示例：

# TYPE prometheus_http_response_size_bytes histogram
prometheus_http_response_size_bytes_bucket{handler="/",le="100"} 1
prometheus_http_response_size_bytes_bucket{handler="/",le="1000"} 1
prometheus_http_response_size_bytes_bucket{handler="/",le="10000"} 1
prometheus_http_response_size_bytes_bucket{handler="/",le="100000"} 1
prometheus_http_response_size_bytes_bucket{handler="/",le="1e+06"} 1
prometheus_http_response_size_bytes_bucket{handler="/",le="1e+07"} 1
prometheus_http_response_size_bytes_bucket{handler="/",le="1e+08"} 1
prometheus_http_response_size_bytes_bucket{handler="/",le="1e+09"} 1
prometheus_http_response_size_bytes_bucket{handler="/",le="+Inf"} 1
prometheus_http_response_size_bytes_sum{handler="/"} 29
prometheus_http_response_size_bytes_count{handler="/"} 1

 Summary：与直方图类似，摘要样例观察结果(通常是请求持续时间和响应大小之类的内容)。虽然它还提供了观测值的总数和所有观测值的和，但它计算了一个滑动时间窗口上的可配置分位数。如下面的示例：

# TYPE prometheus_notifications_latency_seconds summary
prometheus_notifications_latency_seconds{alertmanager="http://172.16.64.159:8081/api/v1/alerts",quantile="0.5"} 0.008799
prometheus_notifications_latency_seconds{alertmanager="http://172.16.64.159:8081/api/v1/alerts",quantile="0.9"} 0.0140762
prometheus_notifications_latency_seconds{alertmanager="http://172.16.64.159:8081/api/v1/alerts",quantile="0.99"} 0.065916
prometheus_notifications_latency_seconds_sum{alertmanager="http://172.16.64.159:8081/api/v1/alerts"} 17.340421599999974
prometheus_notifications_latency_seconds_count{alertmanager="http://172.16.64.159:8081/api/v1/alerts"} 1454

2.3. 查询表达式

查询表达式是Prometheus应用编程的核心概念。Prometheus提供了一种名为PromQL (Prometheus查询语言)的函数式查询语言，允许用户实时选择和聚合时间序列数据。表达式的结果既可以显示为图形，也可以在Prometheus的表达式浏览器中作为表格数据查看，或者通过HTTP API由外部系统使用。
可使用查询表达式实现时间序列即时数据和指定时间段数据的查询。
假设采集器每分钟采集每个卡口车道在这分钟的过车数目。
如下图所示，查询卡口1车道1的即时过车数据：

图4.png

也可使用查询表达式查询最近5分钟内的过车数目明细，如下图所示：

图5.png
这两种模式可使用下面的图来表示：

图6.png

图7.png

针对这两种模式的查询，prometheus提供了HTTP API供使用：
Instant queries
格式：GET /api/v1/query
查询参数：query=<string>，time=<rfc3339 | unix_timestamp>，timeout=<duration>
Range queries
格式：GET /api/v1/query_range
查询参数：query=<string>，start=<rfc3339 | unix_timestamp>，end=<rfc3339 | unix_timestamp>，step=<duration | float>，timeout=<duration>

2.4. 告警管理

Prometheus可定义告警规则，并在告警条件达成告警。告警的触发条件是一个查询表达式。告警规则可配置在Prometheus服务端的配置文件中。
由下图可以看到配置的告警规则中，我们将“sum_over_time(vehicle_passed_num{instance="viid",job="zhangkai"}[5m]) < 200”这个表达式作为告警规则配置在服务器端，当5分钟内所有车道的过车总数小于触发告警。

图8.png

在Prometheus的服务端配置文件中可设置告警触发后发送的目标地址，如我们配置的告警发送地址为“172.16.64.159:8081”。使用抓包工具抓取这个地址发送的告警，可以看到在URI为“/api/v1/alerts”的地址上。

图9.png
Prometheus服务器发送了告警的POST请求，请求消息体格式为JSON格式，示例如下：

[
    {
        "labels": {
            "alertname": "5分钟内过车数据小于200",
            "instance": "viid",
            "job": "zhangkai",
            "lane_no": "1",
            "status": "warning",
            "tollgate_no": "2",
            "value": "129"
        },
        "annotations": {
            "description": "viid:5分钟内过车数据小于200:129",
            "summary": "viid:5分钟内过车数据小于200:129"
        },
        "startsAt": "2019-07-19T06:42:08.639138396Z",
        "endsAt": "2019-07-19T06:45:08.639138396Z",
        "generatorURL": "http://68724a9ce8bb:9090/graph?g0.expr=sum_over_time%28vehicle_passed_num%7Binstance%3D%22viid%22%2Cjob%3D%22zhangkai%22%7D%5B5m%5D%29+%3C+200\u0026g0.tab=1"
    },
    {
        "labels": {
            "alertname": "5分钟内过车数据小于200",
            "instance": "viid",
            "job": "zhangkai",
            "lane_no": "1",
            "status": "warning",
            "tollgate_no": "3",
            "value": "193"
        },
        "annotations": {
            "description": "viid:5分钟内过车数据小于200:193",
            "summary": "viid:5分钟内过车数据小于200:193"
        },
        "startsAt": "2019-07-19T06:42:08.639138396Z",
        "endsAt": "2019-07-19T06:45:08.639138396Z",
        "generatorURL": "http://68724a9ce8bb:9090/graph?g0.expr=sum_over_time%28vehicle_passed_num%7Binstance%3D%22viid%22%2Cjob%3D%22zhangkai%22%7D%5B5m%5D%29+%3C+200\u0026g0.tab=1"
}
]