A system administrator needs to collect the information below: GPU behavior monitoring GPU configuration management GPU policy oversight GPU health and diagnostics GPU accounting and process statistics NVSwitch configuration and monitoring What single tool should be used?
Correct Answer: C
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The NVIDIA Data Center GPU Manager (DCGM) is the comprehensive management tool that provides all the requested functionalities: monitoring GPU behavior, managing configurations, enforcing policies, health diagnostics, process accounting, and NVSwitch monitoring. DCGM is designed for large-scale GPU management in data centers and AI clusters, providing detailed telemetry and control over NVIDIA GPUs and NVSwitches. nvidia-smi provides GPU monitoring but lacks full policy and NVSwitch management. CUDA Toolkit is for GPU programming and development. Nsight Systems is focused on performance profiling and debugging. Therefore, DCGM is the single tool that meets all the listed requirements.
Question 2
A system administrator notices that jobs are failing intermittently on Base Command Manager due to incorrect GPU configurations in Slurm. The administrator needs to ensure that jobs utilize GPUs correctly. How should they troubleshoot this issue?
Correct Answer: B
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Misconfiguration related to MIG mode can cause Slurm to improperly allocate GPUs, leading to job failures. The administrator should verify whether MIG has been enabled on the GPUs and ensure that Slurm’s configuration matches the hardware setup. If MIG is enabled, Slurm must be configured to recognize and schedule MIG partitions correctly to avoid resource conflicts.
Question 3
You are tasked with deploying a DOCA service on an NVIDIA BlueField DPU in an air-gapped data center environment. The DPU has the required BlueField OS version (3.9.0 or higher) installed, and you have access to the necessary container image from NVIDIA's NGC catalog. However, you need to ensure that the deployment process is successful without an internet connection. Which of the following steps should you take to deploy the DOCA service on the DPU?
Correct Answer: C
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
In an air-gapped environment where the DPU has no internet connectivity, direct pulling of container images from NVIDIA’s NGC catalog is not possible. The recommended approach is to manually download the required container image and YAML deployment files from a connected system, then transfer these files to the DPU. Deployment is then performed using Kubernetes with a standalone Kubelet on the DPU, which can deploy the preloaded container image offline. This ensures the deployment proceeds successfully without internet access.
Demo Practice Mode
You are viewing only the questions marked as Demo.