Kyverno监控
使用 Prometheus 监控 Kyverno 策略指标
介绍
作为集群管理员,拥有监控集群应用的 Kyverno 策略的状态和执行的能力可能会让您受益。这包括监控对策略的任何应用更改、与传入请求相关的任何活动以及作为结果产生的任何结果。如果启用,监控将允许您对应用的策略进行可视化和警报,这对于整个集群的可观察性和合规性至关重要。
此外,您可以将监控目标的范围指定为规则、策略或集群级别,这使您能够从收集的指标中提取更精细的信息。
安装和配置
当你使用 Helm 安装 Kyverno时,在 kyverno
命名空间中会创建一个名为 kyverno-svc-metrics
的 service,该 service 会暴露 8000
端口。
$ values.yaml
...
metricsService:
create: true
type: ClusterIP
## Kyverno's metrics server will be exposed at this port
port: 8000
## The Node's port which will allow access Kyverno's metrics at the host level. Only used if service.type is NodePort.
nodePort:
## Provide any additional annotations which may be required. This can be used to
## set the LoadBalancer service type to internal only.
## ref: https://kubernetes.io/docs/concepts/services-networking/service/#internal-load-balancer
##
annotations: {}
...
默认情况下,该 service 的类型是 ClusterIP
,意味着它的指标只能被集群内的 Prometheus 服务抓取。
在某些情况下,Prometheus 服务器可能作为共享服务位于您的工作负载集群之外。在这些场景中,您将希望 kyverno-svc-metrics
服务公开,以便将指标(在端口 8000 上可用)公开给您集群外的 Prometheus 服务。
服务可以通过 Ingress
或使用 LoadBalancer
或 NodePort
service 类型向外部客户端公开。
要将您的 kyverno-svc-metrics 服务公开为主机/节点的端口号 8000 的 NodePort
,您可以在安装 Helm 之前配置您的 values.yaml,如下所示:
...
metricsService:
create: true
type: NodePort
## Kyverno's metrics server will be exposed at this port
port: 8000
## The Node's port which will allow access Kyverno's metrics at the host level. Only used if service.type is NodePort.
nodePort: 8000
## Provide any additional annotations which may be required. This can be used to
## set the LoadBalancer service type to internal only.
## ref: https://kubernetes.io/docs/concepts/services-networking/service/#internal-load-balancer
##
annotations: {}
...
要使用 LoadBalancer
类型公开 kyverno-svc-metrics service,您可以在安装 Helm 之前配置您的 values.yaml,如下所示:
...
metricsService:
create: true
type: LoadBalancer
## Kyverno's metrics server will be exposed at this port
port: 8000
## The Node's port which will allow access Kyverno's metrics at the host level. Only used if service.type is NodePort.
nodePort:
## Provide any additional annotations which may be required. This can be used to
## set the LoadBalancer service type to internal only.
## ref: https://kubernetes.io/docs/concepts/services-networking/service/#internal-load-balancer
##
annotations: {}
...
配置指标
通过 Helm 安装 Kyverno 时,您还可以配置要公开的指标。
- 在配置 Helm 图表时,您可以配置要
include
和/或exclude
哪些命名空间以用于指标导出。当您可能希望排除某些无用、可能会定期处理的命名空间(如测试命名空间)的 Kyverno 指标暴露时,此配置很有用。同样,如果您只想监控一组特定关键名称空间的 Kyverno 相关活动,您可以包含特定名称空间。导出正确的命名空间集(而不是公开所有命名空间)最终可以大大减少 Kyverno 指标导出器的内存占用。
...
config:
metricsConfig:
namespaces: {
"include": [],
"exclude": []
}
# 'namespaces.include': list of namespaces to capture metrics for. Default: all namespaces included.
# 'namespaces.exclude': list of namespaces to NOT capture metrics for. Default: [], none of the namespaces excluded.
...
注意:如果在 include
和 exclude
下都提供了命名空间,则 exclude
优先于“include”。
- 也可以配置指标刷新间隔,并允许指标注册表在该时间范围内清除所有相关指标。这种清理会重置与 Kyverno 的指标导出器相关的内存占用。这在涉及 Kyverno 指标导出器的整体内存占用的场景中特别有用。
...
config:
# rate at which metrics should reset so as to clean up the memory footprint of kyverno metrics, if you might be expecting high memory footprint of Kyverno's metrics.
metricsRefreshInterval: 24h
#Default: 0, no refresh of metrics
...
注意:您仍然不会丢失以前的指标,因为您的指标会保留在 Prometheus 后端。
指标和仪表板
策略和规则计数
此指标可用于跟踪集群中当前处于活跃状态的策略和规则的数量,甚至是过去创建的、当前不活跃规则数量。
指标名称
kyverno_policy_rule_info_total
指标值
-
0 - 如果规则不再存在于集群中(尽管它是在过去创建的)。
-
1 - 如果规则当前活跃地存在于集群中。
使用示例
-
集群管理员想知道自去年以来集群中集群策略的平均数量。
-
集群管理员想要跟踪 default 命名空间中应用的策略计数的趋势。
-
集群管理员想要跟踪并查看 default 命名空间拥有最多策略的月份。
过滤标签
标签 | 允许值 | 描述 |
---|---|---|
policy_validation_mode | “enforce”, “audit” | PolicyValidationFailure action of the rule’s parent policy |
policy_type | “cluster”, “namespaced” | Kind of the rule’s parent policy. Kind: ClusterPolicy or Kind: Policy |
policy_background_mode | “true”, “false” | Policy’s set background mode |
policy_name | Name of the policy to which the rule belongs | |
policy_namespace | Namespace in which this Policy resides (only for policies with kind: Policy), For ClusterPolicies, this field will be “-” | |
rule_name | Name of the rule, in the above policy, which is evaluating in this situation | |
rule_type | “validate”, “mutate”, “generate” | Rule’s behavior type. |
For rule_execution_cause=“background_scan”, it will always be “validate” as background scans only run validate rules | ||
status_ready | “true”, “false” | Readiness of the policy. When ready, the policy is able to serve admission requests |
有用的查询
-
跟踪当前活跃的集群策略的计数:count(count(kyverno_policy_rule_info_total{policy_type="cluster"} == 1) by (policy_name))
-
跟踪每分钟(平均超过 30 秒)将“validate”规则(集群和命名空间策略)添加到集群的速率:rate(kyverno_policy_rule_info_total{rule_type="validate"}[30s] == 1)*60
-
跟踪过去 24 小时内添加的 mutate 规则总数:count(kyverno_policy_rule_info_total{rule_type="mutate"}[24h]==1)
-
跟踪使用 enforce 模式和 background 模式的活跃策略的总数:count(count(kyverno_policy_rule_info_total{policy_validation_mode="enforce", policy_background_mode="true"}==1) by (policy_name))
策略和规则执行
该指标可用于跟踪与作为传入资源请求及后台扫描的执行的规则相关联的结果。该指标还可以进一步聚合以跟踪策略级别的结果。
指标名称
kyverno_policy_results_total
指标值
Counter - 一个仅递增的整数,表示与对应于度量样本的规则相关联的结果/执行的数量。
使用示例
-
管理员想要跟踪自过去 24 小时以来导致任何集群策略的 PASS 状态的传入资源请求的数量。
-
集群管理员想要跟踪在创建时违反了名为 sample-cluster-policy 的特定集群策略的 Deployment 对象的数量。
-
集群管理员想要跟踪过去 1 小时内属于默认命名空间的、违反了某些 Kyverno 策略而被阻止创建的计数。
-
用户有一个专用的命名空间,他/她在其中一次创建了大量的 Kubernetes 资源,并希望跟踪其中有多少违反了现有的集群策略。
过滤标签
标签 | 允许值 | 描述 |
---|---|---|
policy_validation_mode | “enforce”, “audit” | PolicyValidationFailure action of the rule’s parent policy |
policy_type | “cluster”, “namespaced” | Kind of the rule’s parent policy. Kind: ClusterPolicy or Kind: Policy |
policy_background_mode | “true”, “false” | Policy’s set background mode |
policy_name | Name of the policy to which the rule belongs | |
policy_namespace | Namespace in which this Policy resides (only for policies with kind: Policy), For ClusterPolicies, this field will be “-” | |
resource_kind | “Pod”, “Deployment”, “StatefulSet”, “ReplicaSet”, etc. | Kind of this resource |
resource_namespace | Namespace in which this resource lies | |
resource_request_operation | “create”, “update”, “delete” | If the requested resource is being created, updated, or deleted. |
rule_name | Name of the rule, in the above policy, which is evaluating in this situation | |
rule_result | “PASS”, “FAIL” | Result of the rule’s execution |
rule_type | “validate”, “mutate”, “generate” | Rule’s behavior type. For rule_execution_cause=“background_scan”, it will always be “validate” as background scans only run validate rules |
rule_execution_cause | “admission_request”, “background_scan” | Identifies whether the rule is executing in response to an admission request or a periodic background scan. In background scans, only validate rules whereas in the case of admission requests, all validate/mutate/generate rules run |
有用的查询
-
跟踪过去24小时内,在 default 命名空间在执行失败的规则数量,并按类型(validate, mutate, generate)分组:sum(increase(kyverno_policy_results_total{policy_namespace="default", rule_result="fail"}[24h])) by (rule_type)
-
跟踪集群上传入 Pod 请求触发的规则执行次数的每分钟速率:rate(kyverno_policy_results_total{resource_kind="Pod", rule_execution_cause="admission_request"}[1m])*60
-
跟踪过去 2 小时内,集群上作为后台扫描而运行的策略的总数:count(increase(kyverno_policy_results_total{rule_execution_cause="background_scan"}[2h]) by (policy_name))
策略规则执行延迟
该指标用于跟踪单个规则执行/处理(无论是传入的资源请求或执行后台扫描)耗时情况。该指标可以进一步聚合以在策略级别显示延迟。
指标名称
kyverno_policy_execution_duration_seconds
指标值
Histogram - 一个浮点值,表示规则执行的延迟(以秒为单位)。
使用示例
-
集群管理员想通过跟踪自过去 24 小时以来与 Kyverno 策略执行相关的平均延迟来了解策略的执行效率。
-
集群管理员想要跟踪在某个集群策略中导致最高延迟的规则。
过滤标签
标签 | 允许值 | 描述 |
---|---|---|
policy_validation_mode | “enforce”, “audit” | PolicyValidationFailure action of the rule’s parent policy |
policy_type | “cluster”, “namespaced” | Kind of the rule’s parent policy. Kind: ClusterPolicy or Kind: Policy |
policy_background_mode | “true”, “false” | Policy’s set background mode |
policy_name | Name of the policy to which the rule belongs | |
policy_namespace | Namespace in which this Policy resides (only for policies with kind: Policy), For ClusterPolicies, this field will be “-” | |
resource_kind | “Pod”, “Deployment”, “StatefulSet”, “ReplicaSet”, etc. | Kind of this resource |
resource_namespace | Namspace in which this resource lies | |
resource_request_operation | “create”, “update”, “delete” | If the requested resource is being created, updated, or deleted. |
rule_name | Name of the rule, in the above policy, which is evaluating in this situation | |
rule_result | “PASS”, “FAIL” | Result of the rule’s execution |
rule_type | “validate”, “mutate”, “generate” | Rule’s behavior type. For rule_execution_cause=“background_scan”, it will always be “validate” as background scans only run validate rules |
rule_execution_cause | “admission_request”, “background_scan” | Identifies whether the rule is executing in response to an admission request or a periodic background scan. In background scans, only validate rules whereas in the case of admission requests, all validate/mutate/generate rules run |
有用的查询
-
跟踪规则运行的平均时延,并按类型(validate, mutate, generate)分组:avg(kyverno_policy_execution_duration_seconds{}) by (rule_type)
-
列出过去 24 小时内具有最大延迟的验证规则:max(kyverno_policy_execution_duration_seconds{rule_type="validate"}[24h])
-
跟踪“default”命名空间中 enforce 策略的平均策略级执行延迟:avg(kyverno_policy_execution_duration_seconds{policy_validation_mode="enforce", policy_namespace="default", policy_type="namespaced"}) by (policy_name)
Admission Review延迟
该指标可用于跟踪与整个个人准入审查相关的端到端延迟,对应于触发一堆策略和规则的传入资源请求。
指标名称
kyverno_admission_review_duration_seconds
指标值
Counter - 一个浮点值,表示 admission review 的延迟(以秒为单位)。
使用示例
-
集群管理员想知道默认命名空间中围绕“Deployment”创建的传入请求的准入审查有多快/慢。
-
集群管理员希望在与传入的“Pod”创建请求相关的准入审查的 p95 延迟超过某个阈值时立即收到警报。
过滤标签
标签 | 允许值 | 描述 |
---|---|---|
resource_kind | “Pod”, “Deployment”, “StatefulSet”, “ReplicaSet”, etc. | Kind of this resource |
resource_namespace | Namespace in which this resource lies | |
resource_request_operation | “create”, “update”, “delete” | If the requested resource is being created, updated, or deleted. |
有用的查询
-
与传入资源请求触发的准入审查相关的平均延迟,并按资源分组:avg(kyverno_admission_review_duration_seconds{}) by (resource_type)
-
与过去 24 小时内传入 pod 请求触发的准入审查相关的最大延迟:max(kyverno_admission_review_duration_seconds{resource_type="Pod"}[24h])
-
列出在过去 60 分钟内消耗最大延迟量的准入请求:max(kyverno_admission_review_duration_seconds{}[60m])
Admission Requests计数
该指标可用于跟踪 Kyverno 接收到的准入请求的数量。
指标名称
kyverno_admission_requests_total
指标值
Counter - 一个仅递增的整数,表示与样本相关联的准入请求的计数
使用示例
-
集群管理员想知道在过去 24 小时内触发了多少准入请求,因此需要知道 Kyverno 的活跃程度。
-
集群管理员想知道到 Kyverno 的总传入准入请求与传入资源创建相对应的百分比。
过滤标签
标签 | 允许值 | 描述 |
---|---|---|
resource_kind | “Pod”, “Deployment”, “StatefulSet”, “ReplicaSet”, etc. | Kind of this resource |
resource_namespace | Namspace in which this resource lies | |
resource_request_operation | “create”, “update”, “delete” | If the requested resource is being created, updated, or deleted. |
有用的查询
-
过去 24 小时内触发的准入请求总数:sum(increase(kyverno_admission_requests_total{}[24h]))
-
资源创建请求所占总传入准入请求的百分比:sum(kyverno_admission_requests_total{resource_request_operation="create"})/sum(kyverno_admission_requests_total{})
策略修改计数
该指标可用于跟踪所有 Kyverno 策略相关更改的历史记录,例如策略创建、更新和删除。
指标名称
kyverno_policy_changes_total
指标值
Counter - 一个仅递增的整数,表示与度量样本关联的策略级别更改的总数。
使用示例
-
集群管理员想要跟踪在过去 1 年中创建了多少集群策略。
-
终端用户想要跟踪在他们的个人命名空间中创建了多少策略(kind: Policy)。
-
集群管理员想要查看自上周以来创建了多少启用了 validationFailureAction: enforce 和 background 模式的策略。
过滤标签
标签 | 允许值 | 描述 |
---|---|---|
policy_validation_mode | “enforce”, “audit” | PolicyValidationFailure action of the rule’s parent policy |
policy_type | “cluster”, “namespaced” | Kind of the rule’s parent policy. Kind: ClusterPolicy or Kind: Policy |
policy_background_mode | “true”, “false” | Policy’s set background mode |
policy_name | Name of the policy to which the rule belongs | |
policy_namespace | Namespace in which this Policy resides (only for policies with kind: Policy), For ClusterPolicies, this field will be “-” | |
policy_change_type | “create”, “update”, “delete” | Action which happened with the policy behind this policy change. |
有用的查询
-
跟踪过去 60 分钟内创建的 audit 模式的集群策略的数量:sum(increase(kyverno_policy_changes_total{policy_type="cluster", policy_change_type="create", policy_validation_mode="audit"}[60m]))
-
列出过去 5 分钟内在“default”命名空间中删除的命名空间级策略:kyverno_policy_changes_total{policy_type="namespaced", policy_namespace="default", policy_change_type="delete"}[5m]
-
跟踪名为“sample-policy”的集群策略发生的更改次数:sum(kyverno_policy_changes_total{policy_type="cluster", policy_name="sample-policy"})
Grafana仪表盘
Kyverno 指标的即用型仪表板。
安装
- 下载仪表板的 JSON 并将其保存在
kyverno-dashboard.json
curl https://raw.githubusercontent.com/kyverno/grafana-dashboard/master/grafana/dashboard.json -o kyverno-dashboard.json
-
打开您的 Grafana 门户并转到导入仪表板的选项。
-
转到“Upload JSON file”按钮,选择您在第一步中获得的
kyverno-dashboard.json
,然后单击导入。 -
根据您的偏好配置字段,然后单击导入。
-
您的仪表板将在您面前准备就绪。