prometheus监控GPU服务器

692人浏览 / 0人评论

一、背景

公司环境新进30台GPU服务器,需要通过现有的prometheus进行监控,目前prometheus的插件中并不包含对GPU的监控功能,
通过查询资料发现第三方插件,nvidia_gpu_expoter可以满足现有的需求;

github地址:https://github.com/utkuozdemir/nvidia_gpu_exporter

安装官方文档:https://github.com/utkuozdemir/nvidia_gpu_exporter/blob/master/INSTALL.md

 

二、部署exporter

下载地址:https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v1.1.0/nvidia_gpu_exporter_1.1.0_linux_x86_64.tar.gz

 cd /opt
VERSION=1.1.0
wget https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v${VERSION}/nvidia_gpu_exporter_${VERSION}_linux_x86_64.tar.gz
 tar -xvzf nvidia_gpu_exporter_${VERSION}_linux_x86_64.tar.gz
 ./nvidia_gpu_exporter --help

 三、运行exporter

nohup ./nvidia_gpu_exporter --web.listen-address=":9835" --nvidia-smi-command="nvidia-smi" --web.read-timeout=40s --web.read-header-timeout=40s --web.write-timeout=40s --web.idle-timeout=60s --log.level=info > nvidia_gpu_exporter.log 2>&1 &

四、测试访问

curl http://127.0.0.1:9835/metrics

五、prometheus监控

 <   - job_name: "nvidia-gpu"
<     scrape_interval: 15s
<     metrics_path: /metrics
<     consul_sd_configs:
<       - server: "consulIP:8500"
<         services: ["nvidia-gpu"]
<       - server: "consulIP:8500"
<         services: ["nvidia-gpu"]
<       - server: "consulIP:8500"
<         services: ["nvidia-gpu"]
<     relabel_configs:
<       - source_labels: [__meta_consul_tags]
<         regex: .*nvidia-gpu.*
<         action: keep
<       - regex: __meta_consul_service_metadata_(.+)
<         action: labelmap

6、consul注册

cat GPUIP.json 
{
  "ID": "ops-software-gpu-36-12",
  "Name": "ops-gpu",
  "Tags": [
    "ops-gpu"
  ],
  "Address": "GPUIP",
  "Port": 9835,
  "Meta": {
    "type": "host",
    "team": "ops",
    "ip": "GPUIP"
  },
  "EnableTagOverride": false,
  "Check": {
    "HTTP": "http://GPUIP:9835/metrics",
    "Interval": "10s"
  },
  "Weights": {
    "Passing": 10,
    "Warning": 1
  }

注册

 curl -X PUT  http://consulIP:8500/v1/agent/service/register -d @GPUIP.json 

7、grafana大图展示

14575 官网搜索 

全部评论