Building a Custom NVIDIA DCGM Grafana Dashboard on OpenShift

3 min readApr 23, 2024

Custom Panel on top of default DCGM dashboard

Monitoring GPU performance is crucial when managing clusters for AI, machine learning, or HPC workloads. Thanks to the community-driven Grafana Operator and the powerful telemetry stack in OpenShift, setting up a custom Grafana dashboard to monitor NVIDIA GPUs using DCGM (Data Center GPU Manager) metrics has never been easier. In this post, I’ll walk through the process step by step.

Installing Grafana Operator

The first step is to install the Grafana Operator from the OpenShift Operator Hub. I choose version 5.6.0 by Grafana Labs for the latest features and compatibility.

Deploy the operator using the following manifest file. You can also use the Openshift console to deploy this.

subscription.yaml

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/grafana-operator.grafana-operator: ""
  name: grafana-operator
  namespace: grafana-operator
spec:
  channel: v5
  installPlanApproval: Automatic
  name: grafana-operator
  source: community-operators
  sourceNamespace: openshift-marketplace

oc apply -f subscription.yaml

Configuring the Grafana Instance

Now use the following grafana.yaml file to deploy Grafana instance.

apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana
  namespace: grafana
spec:
  config:
    auth:
      disable_login_form: 'false'
      disable_signout_menu: 'true'
    auth.anonymous:
      enabled: 'true'
    auth.basic:
      enabled: 'true'
    log:
      level: warn
      mode: console
  route:
    spec:
      tls:
        termination: edge

Login into Grafana

You should able to access Grafana instance using Openshift routes. Also you can retrieve the admin credentials from a secret in the Grafana namespace.

# Get grafana routes
oc get routes -n grafana

# Command to extract Grafana admin credentials
oc extract secrets/grafana-admin-credentials -n grafana --to=-

Integrating Prometheus as a Data Source

The goal here is to leverage the existing Prometheus instance in OpenShift for Grafana. Prometheus is available in openshift-monitoring namespace.

Use https://prometheus-k8s.openshift-monitoring:9091 as the endpoint for Grafana's data source. For authentication, you can generate a token with the following:

# Command to generate a token for Prometheus
oc create token prometheus-adapter -n openshift-monitoring

Now go to Grafana UI and navigate to settings -> Data Sources and configure the Prometheus data source in Grafana as shown in the screenshots.

Header: Authorization

Value: Bearer <TOKEN_FROM_ABOVE_COMMAND>

Importing and Customizing the NVIDIA DCGM Dashboard

Start with the base NVIDIA DCGM dashboard JSON file, which can be fetched with:

# Download the NVIDIA DCGM Grafana dashboard JSON file
curl -LfO https://github.com/NVIDIA/dcgm-exporter/raw/main/grafana/dcgm-exporter-dashboard.json

Import this JSON file into your Grafana instance and customize it further.

Add following queries as shown in screenshots. Make sure use format as “Table”

DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}
DCGM_FI_DEV_GPU_UTIL{instance=~".*"}

Make sure to transform the query as shown in the following screenshots.

And select “Table” and “Visualizations” type

Now you can finally save the dashboard.

You can also use the following github gist to import this dashboard.