Kubernetes Nvidia GPU Monitor &

2021-03-05  本文已影响0人  Anoyi

▶ Export Metrics

1、前置条件

2、标记 GPU 服务器

kubectl label nodes <node-name> device_type=gpu

3、在 GPU 节点上运行 DCGM Exporter

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: dcgm-exporter
  template:
    metadata:
      labels:
        k8s-app: dcgm-exporter
    spec:
      nodeSelector:
        device_type: gpu
      hostNetwork: true
      hostPID: true
      containers:
        - name: dcgm-exporter
          image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04"
          imagePullPolicy: Always
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
          ports:
            - name: metrics
              containerPort: 9400
              hostPort: 9400

更多细节,查看 https://github.com/NVIDIA/gpu-monitoring-tools

4、测试获取 Metrics

上一步,会在宿主机暴露 9400 端口

curl <host-ip>:9400/metrics

Metrics 信息如下,显示的是单服务器上两块 GPU 的情况:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
......

DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 1290
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 877
DCGM_FI_DEV_MEMORY_TEMP{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 39
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 42
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 57.555000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 154680858400
......

DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 1290
DCGM_FI_DEV_MEM_CLOCK{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 877
DCGM_FI_DEV_MEMORY_TEMP{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 40
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 43
DCGM_FI_DEV_POWER_USAGE{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 55.157000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="1",UUID="GPU-a381d221-0718-a65d-a9bc-512d4e0fb9e2",device="nvidia1"} 148793918798
.....

▶ 使用 Prometheus 收集 Metrics

1、创建 ConfigMap

每个 Job 对应一个 GPU 服务器

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: kube-system
data:
  prometheus.yml: |
    scrape_configs:
    - job_name: 'metrics-gpu-1'
      honor_labels: true
      static_configs:
        - targets: ['<host01-ip>:9400']
          labels:
            instance: GN1
    - job_name: 'metrics-gpu-2'
      honor_labels: true
      static_configs:
        - targets: ['<host02-ip>:9400']
          labels:
            instance: GN2

2、部署 Prometheus

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: kube-system
spec:
  replicas: 1
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      k8s-app: prometheus
  template:
    metadata:
      labels:
        k8s-app: prometheus
    spec:
      volumes:
        - name: prometheus
          configMap:
            name: prometheus-config
      serviceAccountName: admin-user
      containers:
        - name: prometheus
          image: "prom/prometheus:latest"
          volumeMounts:
            - name: prometheus
              mountPath: /etc/prometheus/
          imagePullPolicy: Always
          ports:
            - containerPort: 9090
              protocol: TCP

3、创建 Prometheus Service

kind: Service
apiVersion: v1
metadata:
  labels:
    k8s-app: prometheus
  name: prometheus-service
  namespace: kube-system
spec:
  ports:
    - port: 9090
      targetPort: 9090
  selector:
    k8s-app: prometheus

▶ 使用 Grafana 可视化 Metrics

1、部署 Grafana

kind: Deployment
apiVersion: apps/v1
metadata:
  name: grafana
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: grafana
  template:
    metadata:
      labels:
        k8s-app: grafana
    spec:
      containers:
        - name: grafana
          image: grafana/grafana:latest
          env:
            - name: GF_SECURITY_ADMIN_PASSWORD
              value: <your-password>
            - name: GF_SECURITY_ADMIN_USER
              value: <your-username>
          ports:
            - containerPort: 3000
              protocol: TCP

2、创建 Grafana Service

kind: Service
apiVersion: v1
metadata:
  labels:
    k8s-app: grafana
  name: grafana-service
  namespace: kube-system
spec:
  ports:
    - port: 3000
      targetPort: 3000
      nodePort: 31111
  selector:
    k8s-app: grafana
  type: NodePort

3、访问 Grafana

Web 地址: http://<kubernetes-node-ip>:31111/ ,账号密码详见第一步的配置。

4、添加 DataSource

依次点击 setting -> DateSource -> Add data source -> Prometheus。 配置示例:

点击 Save & Test 即可接入 Prometheus 数据

5、自定义 GPU 监控面板

例如, 显示 GPU 温度:

# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-679826e5-629f-6b0e-d235-8d47cc5ac02f",device="nvidia0"} 42

查询每个 GPU 的温度,查询语句为 DCGM_FI_DEV_GPU_TEMP

其他查询语句:

上一篇 下一篇

猜你喜欢

热点阅读