K8s云原生

Grafana + Prometheus(面向 Java 开发者

2025-09-06  本文已影响0人  _浅墨_

Prometheus 负责「抓取 + 存时序指标 + 简易告警规则评估(Alerting)」;
Grafana 负责「可视化 + 告警路由/通知展示/仪表盘」。

Java 程序通过 Micrometer 等库把 JVM / 业务指标暴露为 Prometheus 能抓取的格式(/actuator/prometheus),Prometheus 抓取后用 PromQL 做告警与分析,Grafana 把这些 Query 做成漂亮的监控面板与告警面板。

核心组件职责(一句话)

何时选它们(使用场景)

快速实战

A. 在 Spring Boot 中暴露指标(Micrometer + Prometheus)

Maven 依赖

<!-- Spring Boot Actuator + Micrometer Prometheus registry -->
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

application.properties / application.yml(暴露 prometheus endpoint)

management.endpoints.web.exposure.include=health,info,prometheus
management.endpoint.prometheus.enabled=true
management.server.port=8080
# 若使用 Spring Security,需要允许 /actuator/prometheus 访问或配置 bearer token

简单的业务计数器与延迟示例

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class DemoController {
    private final Counter requests;
    private final Timer latency;
    private final MeterRegistry registry;

    public DemoController(MeterRegistry registry) {
        this.registry = registry;
        this.requests = registry.counter("app_requests_total","app","demo");
        this.latency = registry.timer("app_request_latency_seconds","app","demo");
    }

    @GetMapping("/hello")
    public String hello() {
        requests.increment();
        Timer.Sample sample = Timer.start(registry);
        try {
            // 业务逻辑
            return "ok";
        } finally {
            sample.stop(latency);
        }
    }
}

访问 http://your-app:8080/actuator/prometheus 会看到 Prometheus 格式的指标。

B. Prometheus 抓取配置(prometheus.yml 最小示例)

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'spring-apps'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['app1:8080', 'app2:8080']

在 Kubernetes 中通常用 kubernetes_sd_configs + relabel_configs 或 Prometheus Operator 的 ServiceMonitor 来自动发现。

K8s 下常见(简化)ServiceMonitor 示例(Prometheus Operator)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-servicemonitor
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: myapp
  namespaceSelector:
    any: true
  endpoints:
  - port: http
    path: /actuator/prometheus
    interval: 15s

常用 PromQL(示例 + 说明)

以下例子假定 Micrometer 已产生默认名为 http_server_requests_seconds_buckethttp_server_requests_seconds_countjvm_memory_used_bytes 等指标。

  1. 95th 延迟(秒)(基于 histogram)
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket{job="myapp"}[5m])) by (le))
  1. 错误率(5 分钟)(status 标签含 5xx)
sum(rate(http_server_requests_seconds_count{job="myapp", status=~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count{job="myapp"}[5m]))
  1. Heap 使用率(百分比)
(jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) * 100
  1. 进程 CPU 使用率(近 1 分钟速率)(结合 node exporter)
100 * rate(process_cpu_seconds_total{job="myapp"}[1m]) / machine_cpu_cores
  1. 实例是否存活
up{job="myapp"} == 0

Grafana 仪表盘要点(实践)

告警规则示例(Prometheus rule files)

groups:
- name: java.rules
  rules:
  - alert: HighHeapUsage
    expr: (jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Heap usage >80% on {{ $labels.instance }}"
      description: "Heap usage has been >80% for 5 minutes. value={{ $value }}"

  - alert: HighErrorRate
    expr: (sum(rate(http_server_requests_seconds_count{job="myapp",status=~"5.."}[5m])) by (instance)
           / sum(rate(http_server_requests_seconds_count{job="myapp"}[5m])) by (instance)) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High error rate (>5%) on {{ $labels.instance }}"

Alertmanager(路由示例)

route:
  receiver: 'team-email'
receivers:
- name: 'team-email'
  email_configs:
  - to: 'oncall@example.com'
    from: 'prometheus@your.org'
    smarthost: 'smtp.your.org:587'
    auth_username: 'prometheus'
    auth_identity: 'prometheus'

Kubernetes + 生产注意事项(快速清单)

典型故障排查流程(案例:线上延迟飙升)

  1. 在 Grafana 查看 95th latency 面板,确认发生时间窗口。
  2. 在同一窗口查 error ratecpu usageGC pausethread count
  3. 若 GC pause 增大:检查 jvm_gc_pause_seconds、heap usage;考虑回退部署或扩大 pod replicas。
  4. 若 CPU 飙升但请求数不变:检查外部依赖(DB、Redis)延迟、连接池耗尽(hikaricp_connections_active)。
  5. topk(10, increase(http_server_requests_seconds_count[5m])) 找出最热 endpoint,定位代码。
  6. 最终在 Grafana 上标注(annotation)本次根因与处理步骤,便于事后复盘。

最佳实践(经验总结)

上一篇 下一篇

猜你喜欢

热点阅读