Setting Up NVIDIA Tesla T4 Time-Slicing for AI Workloads on Red Hat OpenShift (RHOAI) in AWS Cloud

Authors: Balkrishna Pandey, Monson Xavier, Puneet Marhatha

17 min readFeb 15, 2024

In this blog post, We will cover the following,

What Time Slicing is, and compare with Parallel processing
Setup Single Node OpenShift on AWS
Setup GPU Worker node Openshift using MachineSets
Setup Node Feature Discovery Operator
Setup NVIDIA GPU operator
Configure Time Slicing
Setup and configure Red Hat Openshift AI (RHOAI)

Let’s begin this blog with a fundamental overview of Time-Slicing, diving into the basic concepts and exploring how they compare and contrast with parallel processing.

Understanding GPU Time-Slicing Versus GPU Parallel Processing

Two key methodologies for managing and utilizing the computational capabilities of GPUs are time-slicing and parallel processing. Both strategies aim to optimize the use of GPU resources, but they approach the task in fundamentally different ways.

What is Time-Slicing?

Time-slicing is a technique used to divide the processing power of a GPU among multiple tasks or processes over time, rather than through the physical or permanent allocation of resources. This method allows for the concurrent processing of multiple tasks by sharing the GPU’s computational power in time-sequenced intervals.

Shared Processing Power: By dividing the GPU’s time among several tasks, time-slicing enables multiple applications to utilize the GPU simultaneously, enhancing resource sharing and multitasking.
Efficient Utilization: This approach ensures that the GPU’s capabilities are fully utilized, preventing idle time and maximizing efficiency and throughput.
Dynamic Allocation: GPU time is dynamically allocated to different tasks, with the GPU rapidly switching between them. This gives the impression of simultaneous processing, although tasks are handled in discrete time slices.
Context Switching: Involves saving the state of a current task before switching to the next, introducing some latency due to the time required for these context switches.
Balancing Act: Effective time-slicing requires careful balance in allocating time slices to tasks, considering their nature, computational needs, and priority.

GPU Parallel Processing

In contrast, GPU parallel processing divides tasks into smaller work units that are executed simultaneously across the GPU’s multiple processing cores. This method exploits the inherent parallel architecture of GPUs to achieve high throughput.

Concurrent Task Execution: Tasks are broken down and distributed across multiple cores, allowing for simultaneous execution and significantly increasing the processing capacity of the GPU.
High Throughput: By leveraging the parallel nature of GPU cores, this approach can handle a larger volume of computations in a shorter amount of time, ideal for intensive computational tasks.

Comparing Time-Slicing and Parallel Processing

Sequential vs. Concurrent Execution: Time-slicing allocates GPU resources to tasks sequentially in time-discrete slices, ensuring tasks are processed turn by turn. Parallel processing, however, executes parts of tasks concurrently across multiple cores.
Resource Sharing Over Time vs. Space: Time-slicing shares GPU resources over time, switching between tasks. Parallel processing divides tasks spatially across the GPU’s cores for simultaneous processing.
Use Cases: Time-slicing is beneficial in environments where tasks vary significantly in nature and computational requirements, requiring dynamic allocation for efficient processing. Parallel processing shines in scenarios where tasks can be easily parallelized, maximizing throughput for computationally intensive operations.

Now that we’ve grasped the essentials of time-slicing and explored how it differs from parallel processing, let’s move forward. Our next step involves creating a single-node OpenShift Cluster on AWS to test time-slicing.

Create Single Node OpenShift Cluster on AWS

Prerequisite

There are some prerequisites before we start installing the OpenShift Cluster on AWS.

Generate ssh-key pair

When installing the OpenShift Container Platform, we need to provide the SSH public key to the installation program.

To generate an SSH key pair for authentication onto your cluster nodes, particularly if you don’t have an existing pair, follow these steps on a Linux-based system:

Generate SSH Key Pair: Use the command ssh-keygen -t ed25519 -N ‘’ -f <path>/<file_name>, replacing <path>/<file_name> with your desired path and file name, such as ~/.ssh/id_ed25519. This command creates a new SSH key of type ed25519.
View the Public SSH Key: To see your public key, use cat <path>/<file_name>.pub. For example, cat ~/.ssh/id_ed25519.pub will display the public key of the file you just created.
Add SSH Private Key to SSH Agent: If the ssh-agent isn’t running for your user, start it using eval $(ssh-agent -s). This will manage your SSH private key, enabling password-less SSH authentication to your cluster nodes. Add your SSH private key to the ssh-agent with ssh-add <path>/<file_name>, specifying the same path and file name used earlier.

This process ensures secure, password-less authentication to your cluster nodes and is necessary for specific commands like ./openshift-install gather.

Subscribe to AWS Marketplace for AMI Image

Complete the OpenShift Container Platform subscription from the AWS Marketplace, if you haven’t done so already.

Download the Installation program for AWS

Visit the OpenShift Cluster Manager site. Log in with your Red Hat account, or create one if you don’t have an account.
Download the installation program suitable for your operating system and architecture, and save it in the directory where you plan to keep your installation configuration files.

After downloading, extract the files using the tar command. For example, if you’ve downloaded the OpenShift client (openshift-client-mac.tar.gz) and the OpenShift installer (openshift-install-mac.tar.gz), extract them with the following commands:

tar -zxvf openshift-client-mac.tar.gz
tar -zxvf openshift-install-mac.tar.gz

This will extract files like README.md, oc, kubectl, and openshift-install.

Move the extracted command-line tools (oc, kubectl, and openshift-install) to a directory in your system’s PATH for easy execution. Typically, this is /usr/local/bin/. Use the mv command as follows:

mv oc /usr/local/bin/oc
mv kubectl /usr/local/bin/
mv openshift-install /usr/local/bin/

By doing this, you make these tools accessible from any directory in your terminal, facilitating the further steps of your OpenShift installation on AWS.

Configure AWS CLI

Before installing OpenShift on AWS, ensure your AWS CLI is configured:

Install AWS CLI: Download and install from the AWS CLI website.
Run AWS Configure:
- Execute aws configure in the terminal.
- Enter your AWS Access Key ID, Secret Access Key, default region name (e.g., us-east-2), and output format (e.g., JSON).
Check IAM Permissions:
- Verify that your AWS IAM user has adequate permissions, ideally AdministratorAccess, for the installation.
Security Note:
- Handle AWS credentials securely and never share them. Rotate keys regularly.

This setup prepares your AWS account for the OpenShift installation.

Create Installation Configuration File

If all prerequisites are met, let’s move on to the next steps. We will create an installation configuration file by executing the following command:

openshift-install create install-config — dir install-config

This process involves a series of interactive prompts where you need to provide specific details about your OpenShift cluster setup. Here’s a breakdown of each step:

SSH Public Key: You’ll be asked to provide the path to your SSH public key, e.g., ~/.ssh/id_rsa.pub. This key is used for secure access to your OpenShift nodes.
Platform: Specify the platform for your OpenShift cluster. Since you’re setting up on AWS, you’ll select aws.
AWS Credentials: The installer will automatically load credentials from your AWS configuration file, typically found at ~/.aws/credentials. This step confirms that the installer has successfully loaded the “default” AWS profile.
Region: Choose the AWS region where you want to deploy your OpenShift cluster, such as us-east-2.
Base Domain: Provide the base domain for your cluster, like example.com. This domain is used to create fully qualified domain names for your cluster resources.
Cluster Name: Decide on a name for your OpenShift cluster, for example, demogpu. This name will be part of the domain structure for your cluster resources.
Pull Secret: Enter your pull secret, which is a unique code that authenticates your access to OpenShift images and repositories. This is typically a long string of characters, which you can get from Red Hat account.

Once all these details are provided, the installer generates the install-config.yaml file in the specified directory (install-config). This file contains all the configuration details for your OpenShift cluster on AWS. The creation of this file is a crucial step in setting up your OpenShift cluster.

Optional: Adjust install-config.yaml for SNO Openshift

Note: This step is optional. Since we’re setting up a test cluster, it can help reduce costs.

When installing a single-node OpenShift cluster on AWS, there are specific requirements and adjustments to consider, which differ from the standard high-availability cluster setup typically described in AWS documentation:

Reduced Number of Nodes: Unlike the high availability cluster that requires a temporary bootstrap machine, three control plane machines, and at least two compute machines, a single-node cluster on AWS only needs a temporary bootstrap machine and one AWS instance to serve as the control plane node. No separate worker nodes are required. Take a look at the following screenshots, it created 2 machines during installation. Once the installation is complete, it will delete the bootstrap node.

Increased Resource Requirements: For a single-node cluster, the minimum resource requirements are higher compared to a standard control plane node. You’ll need an instance with at least 8 vCPU cores and 120GB of storage, as opposed to the standard 4 vCPUs and 100GB storage.
Configuration in install-config.yaml: Adjustments need to be made in the install-config.yaml file:
- Set controlPlane.replicas to 1. This indicates that there will be only one control plane node in your cluster.
- Set compute.replicas to 0. This configuration makes the single control plane node also act as a worker node, making it schedulable.

These modifications ensure that your single-node OpenShift cluster on AWS is configured correctly with the necessary resources and node setup, differing from the typical multi-node, high-availability cluster setup.

Here is the adjusted configuration, we have also specified aws zones, ami images, root volumes, instance type, etc.

additionalTrustBundlePolicy: Proxyonly
apiVersion: v1
baseDomain: example.com
compute:
- architecture: amd64
 hyperthreading: Enabled
 name: worker
 platform: {}
 replicas: 0
controlPlane:
 architecture: amd64
 hyperthreading: Enabled
 name: master
 platform:
   aws:
     zones:
     - us-east-2a
     - us-east-2b
     - us-east-2c
     rootVolume:
       iops: 4000
       size: 500
       type: io1
     metadataService:
       authentication: Optional
     type: m5.4xlarge
 replicas: 1
metadata:
 creationTimestamp: null
 name: demogpu
networking:
 clusterNetwork:
 - cidr: 10.128.0.0/14
   hostPrefix: 23
 machineNetwork:
 - cidr: 10.0.0.0/16
 networkType: OVNKubernetes
 serviceNetwork:
 - 172.30.0.0/16
platform:
 aws:
   region: us-east-2
   propagateUserTags: true
   userTags:
     adminContact: bpandey
   amiID: ami-08210c197869e16cd
fips: false
publish: External
pullSecret: '<pull-secrets>'
sshKey: |
 <pub-key>

Deploy Cluster

To continue with the installation process of your single-node OpenShift cluster on AWS, use the following command:

openshift-install create cluster — dir install-dir/ — log-level=info

Output:

INFO Credentials loaded from the “default” profile in file “/Users/bpandey/.aws/credentials”
WARNING Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings
INFO Consuming Install Config from target directory
INFO Creating infrastructure resources…
INFO Waiting up to 20m0s (until 2:23PM MST) for the Kubernetes API at https://api.demogpu.goglides.com:6443...
INFO API v1.27.9+5c56cc3 up
INFO Waiting up to 30m0s (until 2:35PM MST) for bootstrapping to complete…
INFO Destroying the bootstrap resources…
INFO Waiting up to 40m0s (until 2:58PM MST) for the cluster at https://api.demogpu.goglides.com:6443 to initialize…
INFO Checking to see if there is a route at openshift-console/console…
INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/Users/bpandey/Documents/ocp-demo-gpu/install-dir/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.demogpu.goglides.com
INFO Login to the console with user: "kubeadmin", and password: "4s7fp-JDrVy-FgVDK-XYZss"
INFO Time elapsed: 27m46s

Verify installation

Upon successful completion, the terminal will display instructions for accessing your cluster, including a link to the web console and credentials for the kubeadmin user.
Credential information is also output to <installation_directory>/.openshift_install.log.
Do not delete the installation program or its created files, as they are necessary for cluster deletion.

Setting Up an AWS GPU Node in OpenShift SNO

Next, we’ll add a GPU node using the instance type g4dn.4xlarge. This addition will be facilitated through the configuration of a MachineSet to incorporate a single GPU node into our cluster.

Step 1: Fetch Cluster-Specific Values

First, we’ll create a Bash script to retrieve essential values like the infrastructure ID and the AMI ID. Here’s the script:

#!/bin/bash
machineset_name="gpu001"
# Fetch infrastructure ID
infrastructure_id=$(oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster)
echo "Infrastructure ID: $infrastructure_id"
# Define role, zone, and region
role=worker
zone=us-east-2b
region=us-east-2
# Fetch AMI ID
ami_id=$(oc -n openshift-machine-api -o jsonpath='{.spec.template.spec.providerSpec.value.ami.id}{"\n"}' get machineset/${infrastructure_id}-${role}-${zone})
echo "AMI ID: $ami_id"
instanceType="g4dn.4xlarge"

Use the following template to start the machine set. Replace variable based on the above output, machine-set.yaml

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
 labels:
   machine.openshift.io/cluster-api-cluster: <infrastructure_id>
 name: <infrastructure_id>-<role>-<zone>-<machineset_name>
 namespace: openshift-machine-api
spec:
 replicas: 1
 selector:
   matchLabels:
     machine.openshift.io/cluster-api-cluster: <infrastructure_id>
     machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role>-<zone>-<machineset_name>
 template:
   metadata:
     labels:
       machine.openshift.io/cluster-api-cluster: <infrastructure_id>
       machine.openshift.io/cluster-api-machine-role: <role>
       machine.openshift.io/cluster-api-machine-type: <role>
       machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role>-<zone>-<machineset_name>
   spec:
     metadata:
       labels:
         node-role.kubernetes.io/<role>: ""
         # Add your node labels here
         # Example: key: "value"
     providerSpec:
       value:
         ami:
           id: <ami_id>
         apiVersion: awsproviderconfig.openshift.io/v1beta1
         blockDevices:
           - ebs:
               iops: 0
               kmsKey:
                 arn: ''
               volumeSize: 120
               volumeType: gp2
         credentialsSecret:
           name: aws-cloud-credentials
         deviceIndex: 0
         iamInstanceProfile:
           id: <infrastructure_id>-worker-profile
         instanceType: <instanceType>
         kind: AWSMachineProviderConfig
         placement:
           availabilityZone: <zone>
           region: <region>
         securityGroups:
           - filters:
               - name: tag:Name
                 values:
                   - <infrastructure_id>-worker-sg
         subnet:
           filters:
             - name: tag:Name
               values:
                 - <infrastructure_id>-private-<zone>
         tags:
           - name: kubernetes.io/cluster/<infrastructure_id>
             value: owned
           # Add your custom tags here
           # Example: - name: "tag_name"
           #          value: "tag_value"
         userDataSecret:
           name: worker-user-data

Now you can apply the config using oc apply -f machine-set.yaml or use the OpenShift console to apply it as shown in the following screenshot.

This will create a new machine,

You can also use oc command to verify.


bash-4.4 ~ $ oc get nodes
NAME                                         STATUS   ROLES                         AGE   VERSION
ip-10-0-156-130.us-east-2.compute.internal   Ready    worker                        19d   v1.26.11+8cfd402
ip-10-0-170-95.us-east-2.compute.internal    Ready    control-plane,master,worker   30d   v1.26.11+8cfd402

Handy Script

You can use following handy script to automate the above process.

INSTANCE_TYPE=${1:-g4dn.4xlarge}
MACHINE_SET=$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | grep worker | head -n1)
oc -n openshift-machine-api get "${MACHINE_SET}" -o yaml | \
  sed '/machine/ s/-worker/-gpu/g
    /name/ s/-worker/-gpu/g
    s/instanceType.*/instanceType: '"${INSTANCE_TYPE}"'/
    s/replicas.*/replicas: 1/' | \
  oc apply -f -

Setup Node Feature Discovery Operator(NFD)

The following step involves setting up the Node Feature Discovery (NFD) Operator. The NFD Operator automates the detection of hardware features and capabilities on Kubernetes nodes, enabling the cluster to optimize workload placement and scheduling decisions based on the actual hardware attributes. This ensures that workloads are assigned to nodes that best match their hardware requirements, enhancing overall performance and efficiency.

Install NFD Operator As follows:

Navigate to Operators -> OperatorHub
Search for “Node Feature Discovery Operator” in OperatorHub
Click “Install” on “Node Feature Discover Operator” page.
From the installation page select the default values and click “Install”

5. Now navigate to “Operators -> Installed Operator” and search for “Node Feature Discovery Operator” on “openshift-nfd” namespace.

6. Click on this operator and navigate to “NodeFeatureDiscovery” tab and click on “Create NodeFeatureDiscovery” button. Click on “Create” to create the “NodeFeatureDiscovery” object.

7. Verify that Node Feature Discovery Operator is running

oc get pods -n openshift-nfd

NAME READY STATUS RESTARTS AGE
nfd-controller-manager-587fb5f4f9-s8xpw 2/2 Running 0 2d14h
nfd-master-5fbd664646-jxv4q 1/1 Running 0 11d
nfd-worker-c24td 1/1 Running 0 151m

8. Verify that labels are present on the node, The Node Feature Discovery Operator identifies node hardware using vendor PCI IDs, like NVIDIA’s 10de.

oc get nodes -l feature.node.kubernetes.io/pci-10de.present=true

NAME STATUS ROLES AGE VERSION
ip-10–0–156–130.us-east-2.compute.internal Ready worker 152m v1.26.11+8cfd402

To check if the GPU device (PCI ID 10de) is recognized on the GPU node, run the command:

oc describe node | egrep 'Roles|pci' | grep -v master

This filters node descriptions for GPU roles and PCI information, excluding master nodes.

Roles:             worker
                   feature.node.kubernetes.io/pci-10de.present=true
                   feature.node.kubernetes.io/pci-1d0f.present=true
                   feature.node.kubernetes.io/pci-1d0f.present=true

Setup NVIDIA GPU operator

It is possible to install NVIDIA GPU Operator using Openshift Console as well, here we are going to show you how to install Operator using CLI.

If you installing using UI search for the following operator “NVIDIA GPU Operator”, as shown below.

Create a namespace

---
apiVersion: v1
kind: Namespace
metadata:
 name: nvidia-gpu-operator

2. Create an operator group

---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
 name: nvidia-gpu-operator-group
 namespace: nvidia-gpu-operator
spec:
targetNamespaces:
- nvidia-gpu-operator

3. Create Subscription

Let's get some information first which is required to create the subscription.

CHANNEL=$(oc get packagemanifest gpu-operator-certified -n openshift-marketplace -o jsonpath='{.status.defaultChannel}')
oc get packagemanifests/gpu-operator-certified -n openshift-marketplace -ojson | jq -r '.status.channels[] | select(.name == "'$CHANNEL'") | .currentCSV'
echo $CHANNEL

channel="v23.9"
startingCSV="gpu-operator-certified.v23.9.1"

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
 name: gpu-operator-certified
 namespace: nvidia-gpu-operator
spec:
 channel: "v23.9" # CHANNEL from above
 installPlanApproval: Manual
 name: gpu-operator-certified
 source: certified-operators
 sourceNamespace: openshift-marketplace
 startingCSV: "gpu-operator-certified.v23.9.1" # startingCSV from above output

4. Approve the install plan using CLI or use UI to approve

INSTALL_PLAN=$(oc get installplan -n nvidia-gpu-operator -oname)
oc patch $INSTALL_PLAN -n nvidia-gpu-operator --type merge --patch '{"spec":{"approved":true }}'

5. Verify all components are up and running as follows,

bash-4.4 ~ $ oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-drlvt                           2/2     Running     0          4h7m
gpu-operator-79787ccdfb-qcb7n                         1/1     Running     6          30d
nvidia-container-toolkit-daemonset-7hm2h              1/1     Running     0          4h7m
nvidia-cuda-validator-v6w8r                           0/1     Completed   0          4h6m
nvidia-dcgm-exporter-7ns26                            1/1     Running     0          4h7m
nvidia-dcgm-kgp8q                                     1/1     Running     0          4h7m
nvidia-device-plugin-daemonset-sxjms                  2/2     Running     0          4h7m
nvidia-driver-daemonset-413.92.202312131705-0-hf852   2/2     Running     12         19d
nvidia-node-status-exporter-q569f                     1/1     Running     6          19d
nvidia-operator-validator-rmkj6                       1/1     Running     0          4h7m

6. Also verify whether GPU is available or not

POD=$(oc get pods -l app.kubernetes.io/component=nvidia-driver -n nvidia-gpu-operator -o name)
oc exec -it $POD -n  nvidia-gpu-operator nvdia-smi

You should see output similar to this,

Wed Feb 14 22:06:24 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   28C    P8              15W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Setup and configure Red Hat Openshift AI (RHOAI)

You can install the Red Hat OpenShift Data Science operator using the operator hub as follows,

Navigate to Operators -> OperatorHub
Search for “Red Hat OpenShift Data Science” in OperatorHub
Click “Install” on the “Red Hat Openshift Data” page.

4. From the installation page select the default values and click “Install”

5. Once the operator installation is complete, create a Data Science Cluster.

6. After sometimes you will see the “Red Hat OpenShift AI” navigation menu activated

7. Navigate to the “Red Hat OpenShift AI” link and log in with standard OpenShift credentials, you should see the following dashboard.

Configure Time Slicing

NVIDIA GPU Operator uses Openshift ConfigMap to store possible configurations for the devices. You can create OpenShift ConfigMap as follows.

cat <<EOF | oc apply -f -
---
apiVersion: v1
kind: ConfigMap
metadata:
 name: device-plugin-config
 namespace: nvidia-gpu-operator
data:
 Tesla-T4: |-
   version: v1
   sharing:
     timeSlicing:
       # renameByDefault: true
       failRequestsGreaterThanOne: false
       resources:
         - name: nvidia.com/gpu
           replicas: 4
EOF

Now Patch clusterPolicy, devicePlugin section with the above ConfigMap name

oc patch clusterpolicy gpu-cluster-policy \
   -n nvidia-gpu-operator --type merge \
   -p '{"spec": {"devicePlugin": {"config": {"name": "device-plugin-config"}}}}'

Verify if everything is working

oc get clusterpolicy -o yaml | grep devicePlugin -A 3


    devicePlugin:
      config:
        default: ""
        name: device-plugin-config

Label the node to enable time-slicing

oc label --overwrite node \
   --selector=nvidia.com/gpu.product=Tesla-T4 \
   nvidia.com/device-plugin.config=Tesla-T4

Verify whether time-slicing is enabled or not

oc get node --selector=nvidia.com/gpu.product=Tesla-T4-SHARED -o json | jq '.items[0].status.capacity'

{
  "attachable-volumes-aws-ebs": "39",
  "cpu": "16",
  "ephemeral-storage": "104266732Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "65016996Ki",
  "nvidia.com/gpu": "8",
  "pods": "250"
}

Sample application to verify time-slicing

Use the following sample application to verify whether time-slicing is working or not. The configuration is adjusted to support the OpenShift cluster deployment

## Reference: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.9.1/gpu-sharing.html
apiVersion: v1
kind: ServiceAccount
metadata:
 creationTimestamp: null
 name: time-slicing-verification
---
# oc adm policy add-scc-to-user hostaccess -z time-slicing-verification --dry-run=client -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 creationTimestamp: null
 name: system:openshift:scc:hostaccess
roleRef:
 apiGroup: rbac.authorization.k8s.io
 kind: ClusterRole
 name: system:openshift:scc:hostaccess
subjects:
- kind: ServiceAccount
 name: time-slicing-verification
 namespace: time-slicing
---
apiVersion: apps/v1
kind: Deployment
metadata:
 name: time-slicing-verification
 labels:
   app: time-slicing-verification
spec:
 replicas: 5
 selector:
   matchLabels:
     app: time-slicing-verification
 template:
   metadata:
     labels:
       app: time-slicing-verification
   spec:
     serviceAccountName: time-slicing-verification
     tolerations:
       - key: nvidia.com/gpu
         operator: Exists
         effect: NoSchedule
     hostPID: true
     containers:
       - name: cuda-sample-vector-add
         image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
         command: ["/bin/bash", "-c", "--"]
         args:
           - while true; do /cuda-samples/vectorAdd; done
         resources:
          limits:
            nvidia.com/gpu: 1

Verify Time Slicing using the Jupyter Notebook Server

Now the time slicing is configured and tested as well using sample applications. Let’s see if we are seeing any changes on the Jupyter Notebook server. Let's set up a notebook server as follows,

Navigate to Applications -> Enabled from the “Red Hat OpenShift AI” dashboard

2. Click on the “Launch application” link, you should see the following screen, where you should able to select the Accelerator as “NVIDIA GPU” from the drop-down menu.

3. Now try to change the “Number of Accelerators” to more than 4, you should see a warning message saying “Only 4 accelerators detected”, it’s because we are using the following configuration on time-slicing settings.

Tesla-T4: |-
   version: v1
   sharing:
     timeSlicing:
       # renameByDefault: true
       failRequestsGreaterThanOne: false
       resources:
         - name: nvidia.com/gpu
           replicas: 4

4. Now change the accelerators to 4 and launch the server, once the installation is done you will see the following screen.

5. Click on “Access notebook server”, it will open a notebook server, where you have to authorize the notebook server (first time only), click on “Allow selected permissions”

6. You should see the following Jupyter Notebook dashboard

Cleanup

Follow these steps for a thorough cleanup. This streamlined cleanup process ensures your AWS environment is free of unnecessary resources, keeping it cost-effective and tidy.

openshift-install destroy cluster — dir install-dir/

You should see output similar to the following