NVIDIA AI Infrastructure
Last Update May 30, 2026
Total Questions : 123
We are offering FREE NCP-AII NVIDIA exam questions. All you do is to just go and sign up. Give your details, prepare NCP-AII free exam questions and then go for complete pool of NVIDIA AI Infrastructure test questions that will help you more.
During cluster validation, the Cable Validation Tool (CVT) reports " Underperforming (BER) " for an InfiniBand link. Which BER thresholds indicate a critical signal quality issue requiring cable replacement?
Which statement best explains why maintaining high cable signal quality is essential in modern high-speed data centers?
An administrator installs NVIDIA GPU drivers on a DGX H100 system with UEFI Secure Boot enabled. After reboot, the drivers fail to load. What is the first action to resolve this issue?
ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?
ClusterKit ' s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?
You are evaluating the integration of NVIDIA BlueField DPUs into your data center ' s storage architecture to optimize AI workloads. The storage solution chosen has incorporated BlueField DPUs to enhance performance and efficiency. Which of the following benefits directly results from this integration?
Refer to the output:
~ $ sudo nvsm show healthinfo
—Timestamp: Sat Dec 16 16:26:32 2017 -0800
Version: 17.12-5
Checks—BIOS Revision [5.11].........................
DGX Serial Number [YSY72800016)..................
Verify installed DIMM memory sticks........................Healthy
...[output truncated)
Verify Ethernet controllers...........................Healthy
Verify installed GPU ' s..............................Unhealthy
Checking output of ' lspci ' for expected GPU ' s
Missing GPU at PCI address ' 07:00.0 '
Verify installed InfiniBand controllers....................Healthy
Verify PCIe switches..................................Healthy
...[output truncated)
What insights can a system administrator gain regarding the DGX system ' s health?
A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?
If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE Host Channel Adapter to a QSFP port capable of both 100 GbE and 25 GbE, which solution would best meet this requirement?
After NCCL burn-in reports " transport retry count exceeded, " which corrective action addresses the underlying fabric issue?
An administrator needs to add additional GPUs to an existing server. What are the server requirements to check before installing new GPUs?
A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?
You are standing up an NVIDIA DGX system for enterprise production. Stakeholder teams require system reliability, performance consistency under load, and proper escalation processes before release. A recent system in another cluster experienced intermittent GPU failures attributed to missed early-stage validation. Which deployment and validation sequence best addresses production readiness and mitigates the risk of avoidable downtime or performance loss?
A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?
An enterprise IT team has completed the physical installation of an AI Factory with a Spectrum-X Ethernet network connected to all GPU servers. They now need to ensure the environment is ready for scalable AI workload deployment. What is the recommended sequence of validation steps?
A user wants to restrict a Docker container to use only GPUs 0 and 2. Which command achieves this?
After updating BlueField-3 DPU BMC firmware via Redfish, the engineer observes “TaskState: Running” but no progress after 15 minutes. How should they track the update’s completion status?
A user needs to configure NGC CLI to access resources across multiple organizations. What is the recommended command syntax to achieve this?
A system administrator has upgraded the firmware of the DPU. What will be the state of the firmware after the upgrade?
As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware according to NVIDIA’s documentation and recommended operational steps?
An engineer needs to verify NVLink isolation on a single node with 8 GPUs. Which NCCL test configuration stresses switch bisection bandwidth?
A company has a registered NGC account and their server has NGC CLI installed. What step should be taken first to gain access to NGC?
An infrastructure engineer is preparing a new AI cluster for production use, relying on NVIDIA switches and high-speed optical transceivers for node connectivity. The team is finalizing network validation before launching large-scale training jobs. Why is it critical to confirm and align the firmware version on all switch transceivers prior to production?
A system administrator wants to configure MIG for seven slices on an H100 GPU in an NVIDIA HGX system. Which command should be used?
An engineer is tasked with configuring Out-of-Band management for a DGX BasePOD deployment. Which network design will best ensure secure and reliable Out-of-Band management operations?
A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?
When verifying network cable signal integrity during cluster deployment, which measurement result most strongly indicates a cable signal problem?
You are validating the environment of an NVIDIA GPU-accelerated data center during post-deployment checks. Which one action is essential to confirm that power and cooling are sufficient for the stable operation of NVIDIA DGX H100 systems?
One of the nodes in a cluster is not running as fast as the others and the system administrator needs to check the status of the GPUs on that system. What command should be used?
During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?
You are a network administrator responsible for configuring an East-West (E/W) Spectrum-X fabric using SuperNIC. The Bluefield-3 devices in your network should be set to NIC mode with RoCE enabled to optimize data flow between servers. You have access to the Spectrum-X management tools and the necessary documentation. You need to use specific configuration commands to achieve this setup. Which of the following steps and commands are necessary to configure the Bluefield-3 devices in NIC mode for the E/W Spectrum-X fabric using SuperNIC? (Pick the 2 correct responses below)
An administrator needs to perform a comprehensive pre-production stress test on a DGX H100 system. Which command validates GPU, CPU, memory, and storage components while following NVIDIA’s recommended procedure?
An InfiniBand server stops working, and a system administrator runs the " ibstat " command that provides the following output:
CA ' mlx5_1 '
CA type: MT4115
Number of ports: 2
Firmware version: 10.20.1010
Hardware version: 0
Node GUID: 0x0002c90300002f78
System image GUID: 0x0002c90300002f7b
Port 1:
State: Initializing
Physical state: Linkup
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0251086a
Port GUID: 0x0002c90300002f79
Link layer: InfiniBand
What is the cause of the issue?
An enterprise is deploying an AI Factory using NVIDIA DGX BasePOD architecture. The infrastructure team must ensure high availability and efficient data transfer between compute nodes. Which network topology should they implement for the InfiniBand fabric?
A system administrator noticed a failure on a DGX H100 server. After a reboot, only the BMC is available. What could be the reason for this behavior?
You are tasked with setting up High Availability (HA) for NVIDIA Base Command Manager (BCM) in a new GPU cluster. The cluster consists of a primary head node, a secondary head node, and several compute nodes. The requirements are automatic failover of BCM services, minimal disruption to workloads, and proper cluster health monitoring during and after installation. During your BCM HA installation and configuration process, which two of the following actions are mandatory for ensuring a robust and verified HA cluster configuration?
Pick the 2 correct responses below.