Outcold Solutions LLC

Monitoring OpenShift - Version 5

Monitoring GPU (beta)

Monitoring Nvidia GPU devices

Installing collection

Pre-requirements

If in your cluster not all nodes have GPU devices attached, label them similarly to

oc label nodes <gpu-node-name> hardware-type=NVIDIAGPU

The DaemonSet that we use below rely on this label.

Nvidia-SMI DaemonSet

We use nvidia-smi tool to collect metrics from the GPU devices. You can find documentation of this tool at https://developer.nvidia.com/nvidia-system-management-interface. We also use set of annotations to conver the output from this tool into easy parsable CSV format, which helps ut to configure fields extraction with Splunk.

Create a file nvidia-smi.yaml and save it with the following content.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: collectorforopenshift-nvidia-smi
  namespace: collectorforopenshift
  labels:
    app: collectorforopenshift-nvidia-smi
spec:
  updateStrategy:
    type: RollingUpdate

  selector:
    matchLabels:
      daemon: collectorforopenshift-nvidia-smi

  template:
    metadata:
      name: collectorforopenshift-nvidia-smi
      labels:
        daemon: collectorforopenshift-nvidia-smi
      annotations:
        collectord.io/logs-joinpartial: 'false'
        collectord.io/logs-joinmultiline: 'false'
        # remove headers
        collectord.io/logs-replace.1-search: '^#.*$'
        collectord.io/logs-replace.1-val: ''
        # trim spaces from both sides
        collectord.io/logs-replace.2-search: '(^\s+)|(\s+$)'
        collectord.io/logs-replace.2-val: ''
        # make a CSV from console presented line
        collectord.io/logs-replace.3-search: '\s+'
        collectord.io/logs-replace.3-val: ','
        # empty values '-' replace with empty values
        collectord.io/logs-replace.4-search: '-'
        collectord.io/logs-replace.4-val: ''
        # nothing to report from pmon - just ignore the line
        collectord.io/pmon--logs-replace.0-search: '^\s+\d+(\s+-)+\s*$'
        collectord.io/pmon--logs-replace.0-val: ''
        # set log source types
        collectord.io/pmon--logs-type: openshift_gpu_nvidia_pmon
        collectord.io/dmon--logs-type: openshift_gpu_nvidia_dmon
    spec:
      # Make sure to attach matching label to the GPU node
      # $ oc label nodes <gpu-node-name> hardware-type=NVIDIAGPU
      # nodeSelector:
      #   hardware-type: NVIDIAGPU  
      hostPID: true
      containers:
      - name: pmon
        image: nvidia/cuda:latest
        args:
          - "bash"
          - "-c"
          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi pmon -s um --count 1 --id {}'; sleep 30 ;done"
      - name: dmon
        image: nvidia/cuda:latest
        args:
          - "bash"
          - "-c"
          - "while true; do nvidia-smi --list-gpus | cut -d':' -f 3 | cut -c2-41 | xargs -L4 echo | sed 's/ /,/g' | xargs -I {} bash -c 'nvidia-smi dmon -s pucvmet --count 1 --id {}'; sleep 30 ;done"

Apply this DaemonSet to your cluster with

oc apply -f nvidia-smi.yaml

Within version 5.11 you should see the data within the dashboard.

NVIDIA (GPU)


About Outcold Solutions

Outcold Solutions provides solutions for monitoring Kubernetes, OpenShift and Docker clusters in Splunk Enterprise and Splunk Cloud. We offer certified Splunk applications, which give you insights across all containers environments. We are helping businesses reduce complexity related to logging and monitoring by providing easy-to-use and deploy solutions for Linux and Windows containers. We deliver applications, which help developers monitor their applications and operators to keep their clusters healthy. With the power of Splunk Enterprise and Splunk Cloud, we offer one solution to help you keep all the metrics and logs in one place, allowing you to quickly address complex questions on container performance.