k8s 搭建 gpu 监控

2020-04-24  本文已影响0人  流月汐志

部署架构

部署方式:kubernetes
node 监控和 gpu 监控

gpu 监控

使用项目
pod-gpu-metrics-exporter

需要环境

安装环境

安装脚本(ubuntu):
install-nvidia-docker.sh

#!/bin/bash

pwd=$1

if [[ -z ${pwd} ]]
then
    echo "please run [bash $0 <pwd>]"
    exit 0
fi

# 安装 docker
echo ${pwd} | sudo apt-get update

echo ${pwd} | sudo apt-get install curl && \
curl -fsSL https://get.docker.com -o get-docker.sh && \
echo ${pwd} | sudo sh get-docker.sh
echo ${pwd} | sudo usermod -aG docker digisky
echo ${pwd} | sudo systemctl enable docker
# nvidia-docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

echo ${pwd} | sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit nvidia-container-runtime
# nvidia-container-runtime
echo ${pwd} | sudo cp -f daemon.json /etc/docker/daemon.json 

echo ${pwd} | sudo systemctl restart docker
# gpu-monitoring-tools-master

daemon.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "registry-mirrors": ["https://vs2fctcq.mirror.aliyuncs.com"]
}

pod-gpu-metrics-exporter.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app.kubernetes.io/name: gpu-metrics-exporter
    app.kubernetes.io/version: latest
  name: gpu-metrics-exporter
  namespace: monitor
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: pod-gpu-metrics-exporter
  template:
    metadata:
      labels:
        app.kubernetes.io/name: pod-gpu-metrics-exporter
        app.kubernetes.io/part-of: gpu-metrics-exporter
        app.kubernetes.io/version: latest
      name: pod-gpu-metrics-exporter
    spec:
      containers:
      - image: xxx/pod-gpu-metrics-exporter:latest
        imagePullPolicy: Always
        name: pod-nvidia-gpu-metrics-exporter
        ports:
        - containerPort: 9400
          hostPort: 59101
          name: gpu-port
          protocol: TCP
        volumeMounts:
        - mountPath: /var/lib/kubelet/pod-resources
          name: pod-gpu-resources
          readOnly: true
        - mountPath: /run/prometheus
          name: device-metrics
          readOnly: true
      - image: xxx/dcgm-exporter:latest
        imagePullPolicy: Always
        name: nvidia-dcgm-exporter
        volumeMounts:
        - mountPath: /run/prometheus
          name: device-metrics
      dnsPolicy: ClusterFirst
#      imagePullSecrets:
#      - name: hub-out
      restartPolicy: Always
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pod-resources
          type: ""
        name: pod-gpu-resources
      - emptyDir:
          medium: Memory
        name: device-metrics

采集指标解释

指标 解释
dcgm_fan_speed_percent GPU风扇转速占比(%)
dcgm_sm_clock GPU sm时钟(MHz)
dcgm_memory_clock GPU 内存时钟(MHz)
dcgm_gpu_temp GPU 运行的温度(℃)
dcgm_power_usage GPU 的功率(w)
dcgm_pcie_tx_throughput GPU PCIeTX传输的字节总数 (kb)
dcgm_pcie_rx_throughput GPU PCIeRX接收的字节总数 (kb)
dcgm_pcie_replay_counter GPU PCIe重试的总数
dcgm_gpu_utilization GPU利用率(%)
dcgm_mem_copy_utilization GPU 内存利用率(%)
dcgm_enc_utilization GPU编码器利用率(%)
dcgm_dec_utilization GPU解码器利用率(%)
dcgm_xid_errors GPU 上一个xid错误的值
dcgm_power_violation GPU 功率限制导致的节流持续时间(us)
dcgm_thermal_violation GPU 热约束节流持续时间(us)
dcgm_sync_boost_violation GPU 同步增强限制,限制持续时间(us)
dcgm_fb_free GPUfb(帧缓存)的剩余(MiB)
dcgm_fb_used GPUfb(帧缓存)的使用(MiB)

node 监控

参考yaml

修改后并测试成功的yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app.kubernetes.io/name: node-exporter
  name: node-exporter
  namespace: monitor
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
    spec:
      containers:
      - args:
        - --web.listen-address=0.0.0.0:59100
        - --path.procfs=/host/proc
        - --path.sysfs=/host/sys
        - --path.rootfs=/host/root
        - --no-collector.wifi
        - --no-collector.hwmon
        - --collector.filesystem.ignored-mount-points=^/(var.*|run.*|boot.*|snap.*|dev|proc|sys|var/lib/docker/.+)($|/)
        - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
        image: xxx/node-exporter:latest
        imagePullPolicy: IfNotPresent
        name: node-exporter
        ports:
        # hostNetwork开启为 true 时, containerPort 和 hostPort 需设置一样
        - containerPort: 59100
          hostPort: 59100
          name: node-port
          protocol: TCP
        resources:
          limits:
            cpu: 250m
            memory: 180Mi
          requests:
            cpu: 102m
            memory: 180Mi
        securityContext:
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /host/proc
          name: proc
        - mountPath: /host/sys
          name: sys
        - mountPath: /host/root
          mountPropagation: HostToContainer
          name: root
          readOnly: true
      # 以下参数用以采集 node 的真实数据
      hostIPC: true
      hostNetwork: true
      hostPID: true
      # 指定镜像仓库的密钥
#      imagePullSecrets:
#      - name: hub-out
      nodeSelector:
        beta.kubernetes.io/os: linux
      restartPolicy: Always
      volumes:
      - hostPath:
          path: /proc
          type: ""
        name: proc
      - hostPath:
          path: /sys
          type: ""
        name: sys
      - hostPath:
          path: /
          type: ""
        name: root

prometheus

file_sd_configs 采用 file_sd_configs 的方式
prometheus.yaml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
scrape_configs:
  - job_name: 'prometheus-dev'
    file_sd_configs:
    - files:
      - prometheus-etc.json
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['192.168.20.75:9093']

grafana

资料网站:

上一篇 下一篇

猜你喜欢

热点阅读