Building a Custom NVIDIA DCGM Grafana Dashboard on OpenShift
Monitoring GPU performance is crucial when managing clusters for AI, machine learning, or HPC workloads. Thanks to the community-driven Grafana Operator and the powerful telemetry stack in OpenShift, setting up a custom Grafana dashboard to monitor NVIDIA GPUs using DCGM (Data Center GPU Manager) metrics has never been easier. In this post, I’ll walk through the process step by step.
Installing Grafana Operator
The first step is to install the Grafana Operator from the OpenShift Operator Hub. I choose version 5.6.0 by Grafana Labs for the latest features and compatibility.
Deploy the operator using the following manifest file. You can also use the Openshift console to deploy this.
subscription.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
labels:
operators.coreos.com/grafana-operator.grafana-operator: ""
name: grafana-operator
namespace: grafana-operator
spec:
channel: v5
installPlanApproval: Automatic
name: grafana-operator
source: community-operators
sourceNamespace: openshift-marketplace
oc apply -f subscription.yaml
Configuring the Grafana Instance
Now use the following grafana.yaml file to deploy Grafana instance.
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
name: grafana
namespace: grafana
spec:
config:
auth:
disable_login_form: 'false'
disable_signout_menu: 'true'
auth.anonymous:
enabled: 'true'
auth.basic:
enabled: 'true'
log:
level: warn
mode: console
route:
spec:
tls:
termination: edge
Login into Grafana
You should able to access Grafana instance using Openshift routes. Also you can retrieve the admin credentials from a secret in the Grafana namespace.
# Get grafana routes
oc get routes -n grafana
# Command to extract Grafana admin credentials
oc extract secrets/grafana-admin-credentials -n grafana --to=-
Integrating Prometheus as a Data Source
The goal here is to leverage the existing Prometheus instance in OpenShift for Grafana. Prometheus is available in openshift-monitoring
namespace.
Use https://prometheus-k8s.openshift-monitoring:9091
as the endpoint for Grafana's data source. For authentication, you can generate a token with the following:
# Command to generate a token for Prometheus
oc create token prometheus-adapter -n openshift-monitoring
Now go to Grafana UI and navigate to settings -> Data Sources and configure the Prometheus data source in Grafana as shown in the screenshots.
Header: Authorization
Value: Bearer <TOKEN_FROM_ABOVE_COMMAND>
Importing and Customizing the NVIDIA DCGM Dashboard
Start with the base NVIDIA DCGM dashboard JSON file, which can be fetched with:
# Download the NVIDIA DCGM Grafana dashboard JSON file
curl -LfO https://github.com/NVIDIA/dcgm-exporter/raw/main/grafana/dcgm-exporter-dashboard.json
Import this JSON file into your Grafana instance and customize it further.
Add following queries as shown in screenshots. Make sure use format as “Table”
DCGM_FI_DEV_GPU_UTIL{instance=~"$instance", gpu=~"$gpu"}
DCGM_FI_DEV_GPU_UTIL{instance=~".*"}
Make sure to transform the query as shown in the following screenshots.
And select “Table” and “Visualizations” type
Now you can finally save the dashboard.
You can also use the following github gist to import this dashboard.