一、背景
公司环境新进30台GPU服务器,需要通过现有的prometheus进行监控,目前prometheus的插件中并不包含对GPU的监控功能,
通过查询资料发现第三方插件,nvidia_gpu_expoter可以满足现有的需求;
github地址:https://github.com/utkuozdemir/nvidia_gpu_exporter
安装官方文档:https://github.com/utkuozdemir/nvidia_gpu_exporter/blob/master/INSTALL.md
二、部署exporter
下载地址:https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v1.1.0/nvidia_gpu_exporter_1.1.0_linux_x86_64.tar.gz
cd /opt
VERSION=1.1.0
wget https://github.com/utkuozdemir/nvidia_gpu_exporter/releases/download/v${VERSION}/nvidia_gpu_exporter_${VERSION}_linux_x86_64.tar.gz
tar -xvzf nvidia_gpu_exporter_${VERSION}_linux_x86_64.tar.gz
./nvidia_gpu_exporter --help
三、运行exporter
nohup ./nvidia_gpu_exporter --web.listen-address=":9835" --nvidia-smi-command="nvidia-smi" --web.read-timeout=40s --web.read-header-timeout=40s --web.write-timeout=40s --web.idle-timeout=60s --log.level=info > nvidia_gpu_exporter.log 2>&1 &
四、测试访问
curl http://127.0.0.1:9835/metrics
五、prometheus监控
< - job_name: "nvidia-gpu"
< scrape_interval: 15s
< metrics_path: /metrics
< consul_sd_configs:
< - server: "consulIP:8500"
< services: ["nvidia-gpu"]
< - server: "consulIP:8500"
< services: ["nvidia-gpu"]
< - server: "consulIP:8500"
< services: ["nvidia-gpu"]
< relabel_configs:
< - source_labels: [__meta_consul_tags]
< regex: .*nvidia-gpu.*
< action: keep
< - regex: __meta_consul_service_metadata_(.+)
< action: labelmap
6、consul注册
cat GPUIP.json
{
"ID": "ops-software-gpu-36-12",
"Name": "ops-gpu",
"Tags": [
"ops-gpu"
],
"Address": "GPUIP",
"Port": 9835,
"Meta": {
"type": "host",
"team": "ops",
"ip": "GPUIP"
},
"EnableTagOverride": false,
"Check": {
"HTTP": "http://GPUIP:9835/metrics",
"Interval": "10s"
},
"Weights": {
"Passing": 10,
"Warning": 1
}
}
注册
curl -X PUT http://consulIP:8500/v1/agent/service/register -d @GPUIP.json
7、grafana大图展示
14575 官网搜索
有问题请加博主微信进行沟通!
全部评论