Monitoring Kubernetes and OpenShift - Monitoring GPU (beta)

^{August 6, 2019}

If you are using NVIDIA GPU devices for your workloads, including machine learning (ML), high performance computing (HPC), financial analytics, and video transcoding, you want to be able to monitor how efficiently you are using these devices.

We provide a solution, based on the nvidia-smi tool, that will allow you to monitor GPU attached devices to your Kubernetes and OpenShift nodes, to review CPU/Memory utilization, Power consumption and more. Currently it is in the beta mode, and you will need to add the required dashboards to configurations manually. With the next versions we will include these dashboards as part of our application.

Please review documentations on installation

NVIDIA (GPU)

We are using nvidia-smi tool to collect the data, which allows us to install the collection part on any Kubernetes or OpenShift version. The official NVIDIA monitoring tool relies on Kubernetes 1.13+, which is a sugnificant limitation, considering that you can't run it on the most popular OpenShift version 3.11 (which is based on Kubernetes 1.11). If you prefer to use NVIDIA/gpu-monitoring-tools you can easily use our Prometheus annotations to collect these metrics and forward them to Splunk.

About Outcold Solutions