Pre-Summer Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: exams65

ExamsBrite Dumps

NVIDIA AI Infrastructure Question and Answers

NVIDIA AI Infrastructure

Last Update May 30, 2026
Total Questions : 123

We are offering FREE NCP-AII NVIDIA exam questions. All you do is to just go and sign up. Give your details, prepare NCP-AII free exam questions and then go for complete pool of NVIDIA AI Infrastructure test questions that will help you more.

NCP-AII pdf

NCP-AII PDF

$36.75  $104.99
NCP-AII Engine

NCP-AII Testing Engine

$43.75  $124.99
NCP-AII PDF + Engine

NCP-AII PDF + Testing Engine

$57.75  $164.99
Questions 1

During cluster validation, the Cable Validation Tool (CVT) reports " Underperforming (BER) " for an InfiniBand link. Which BER thresholds indicate a critical signal quality issue requiring cable replacement?

Options:

A.  

Rx power variance > 3dB between lanes

B.  

Effective BER > 0 during the first 125 minutes of link operation

C.  

Raw BER > 1e-12 or Effective BER > 1.5E-254 for < 6hr measurements

D.  

Temperature > 85°C on transceiver module

Discussion 0
Questions 2

Which statement best explains why maintaining high cable signal quality is essential in modern high-speed data centers?

Options:

A.  

High cable signal quality ensures that cable length and connector type do not play as big a role in deploying new infrastructure in the data center.

B.  

High cable signal quality minimizes bit error rates and supports reliable, high-throughput communication, reducing retransmissions and congestion across the network.

C.  

High cable signal quality reduces electromagnetic interference (EMI) and crosstalk, helping prevent unexpected packet drops during sustained workloads.

D.  

High cable signal quality enables effective use of Forward Error Correction (FEC), which is required for reliable operation at high data rates such as 200GbE and above.

Discussion 0
Questions 3

An administrator installs NVIDIA GPU drivers on a DGX H100 system with UEFI Secure Boot enabled. After reboot, the drivers fail to load. What is the first action to resolve this issue?

Options:

A.  

Disable Secure Boot permanently in BIOS/UEFI settings.

B.  

Delete /etc/X11/xorg.conf to force driver reconfiguration.

C.  

Enroll the Machine Owner Key (MOK) during system reboot and enter the recorded password.

D.  

Reinstall drivers using apt-get install nvidia-driver-550 without rebooting.

Discussion 0
Questions 4

ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Options:

A.  

Critical failure; expected is greater than 390 GB/s for HDR InfiniBand.

B.  

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.  

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

D.  

Inconclusive; rerun with --stress=cpu to validate.

Discussion 0
Questions 5

ClusterKit ' s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Options:

A.  

Optimal performance, indicating healthy fabric and GPUDirect RDM

A.  

B.  

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.  

Critical failure; expected is > 390 GB/s for HDR InfiniBand.

D.  

Inconclusive; rerun with --stress=cpu to validate.

Discussion 0
Questions 6

You are evaluating the integration of NVIDIA BlueField DPUs into your data center ' s storage architecture to optimize AI workloads. The storage solution chosen has incorporated BlueField DPUs to enhance performance and efficiency. Which of the following benefits directly results from this integration?

Options:

A.  

Unlimited scalability by adding more DPUs without architectural changes.

B.  

Elimination of latency issues in data processing tasks.

C.  

Reduced CPU load by offloading data processing tasks to DPUs.

D.  

Enhanced I/O performance with NVMe storage access speeds.

Discussion 0
Questions 7

Refer to the output:

~ $ sudo nvsm show healthinfo

—Timestamp: Sat Dec 16 16:26:32 2017 -0800

Version: 17.12-5

Checks—BIOS Revision [5.11].........................

DGX Serial Number [YSY72800016)..................

Verify installed DIMM memory sticks........................Healthy

...[output truncated)

Verify Ethernet controllers...........................Healthy

Verify installed GPU ' s..............................Unhealthy

Checking output of ' lspci ' for expected GPU ' s

Missing GPU at PCI address ' 07:00.0 '

Verify installed InfiniBand controllers....................Healthy

Verify PCIe switches..................................Healthy

...[output truncated)

What insights can a system administrator gain regarding the DGX system ' s health?

Options:

A.  

A GPU tray upgrade failed.

B.  

A GPU is missing on the DGX system.

C.  

A GPU driver upgrade has failed.

D.  

The system has passed the hardware health check successfully.

Discussion 0
Questions 8

A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?

Options:

A.  

Implement redundant switches with spanning tree protocol.

B.  

MLAG for bonded interfaces across redundant switches.

C.  

Use only one switch for all management and storage traffic.

D.  

Disable VLANs and use unmanaged switches.

Discussion 0
Questions 9

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE Host Channel Adapter to a QSFP port capable of both 100 GbE and 25 GbE, which solution would best meet this requirement?

Options:

A.  

QSA adapter.

B.  

SFP connectors.

C.  

SFP-to-1G BASE-T RJ45 adapter.

D.  

Standard QSFP-to-QSFP DAC cable.

Discussion 0
Questions 10

After NCCL burn-in reports " transport retry count exceeded, " which corrective action addresses the underlying fabric issue?

Options:

A.  

Switch from Ring to Tree algorithms via NCCL_ALGO=TREE

B.  

Reduce message size to decrease network utilization

C.  

Increase NCCL_IB_TIMEOUT to tolerate longer latencies

D.  

Inspect InfiniBand link quality metrics (BER, symbol errors) and replace faulty cables

Discussion 0
Questions 11

An administrator needs to add additional GPUs to an existing server. What are the server requirements to check before installing new GPUs?

Options:

A.  

Sufficient networking, water-cooled racks, adequate rack power, sufficient storage, and rack space.

B.  

Sufficient storage, sufficient networking, adequate rack power, and compatible hardware.

C.  

Sufficient CPU capacity, PCIe slot allocation, sufficient cooling in the data center, and rack space.

D.  

Sufficient cooling in the data center, adequate rack power, compatible hardware, and PCIe slot allocation.

Discussion 0
Questions 12

A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?

Options:

A.  

Navigate to ’Devices " > select a switch > " Cables ' tab to see ASIC firmware and transceiver versions.

B.  

Use " Topology’ view to visually inspect cable icons.

C.  

Run mlxlink -d lid- < LID > -m on each port manually.

D.  

Export all switch logs and grep for ’FW Version " .

Discussion 0
Questions 13

You are standing up an NVIDIA DGX system for enterprise production. Stakeholder teams require system reliability, performance consistency under load, and proper escalation processes before release. A recent system in another cluster experienced intermittent GPU failures attributed to missed early-stage validation. Which deployment and validation sequence best addresses production readiness and mitigates the risk of avoidable downtime or performance loss?

Options:

A.  

Install latest OS images and drivers, confirm OS and container functionality, invite users for a monitored production trial, and collect workload feedback to plan any further diagnostics or updates.

B.  

Complete hardware and cabling, power on the system, update firmware and drivers, run full hardware health checks and stress diagnostics using NVSM, verify all GPU and system sensor logs, and validate GPU accessibility.

C.  

Update network topology, assign static IPs and DNS entries, register the system with NVIDIA, then conduct basic OS-level checks and enable user access after login testing is successful.

D.  

Power on the system, install all AI frameworks, configure the CUDA and library stack, set up user environments, then plan stress tests and diagnostics as part of ongoing routine operations.

Discussion 0
Questions 14

A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?

Options:

A.  

A single VLAN for all types of network traffic.

B.  

Two networks: one for management and one for compute.

C.  

Four networks: compute, storage, out-of-band, and management.

Discussion 0
Questions 15

An enterprise IT team has completed the physical installation of an AI Factory with a Spectrum-X Ethernet network connected to all GPU servers. They now need to ensure the environment is ready for scalable AI workload deployment. What is the recommended sequence of validation steps?

Options:

A.  

Set up Active Directory and LDAP, configure role-based access controls and security settings first, install users, and skip network or hardware performance validation.

B.  

Perform application benchmarking first, use performance logs to identify bottlenecks, update switch and server firmware afterward, and then tune the network using performance tests.

C.  

Validate the software stack, test link connectivity and port health, run network benchmarks, run OSPF, ensure neighbors are exchanging route information, then stage AI workload tests.

D.  

Confirm switch and server firmware configuration, test link connectivity and port health, run network benchmarks, validate the software stack, then stage AI workload tests.

Discussion 0
Questions 16

A user wants to restrict a Docker container to use only GPUs 0 and 2. Which command achieves this?

Options:

A.  

docker run --gpus ' " device=0,2 " ' nvidia/cuda:12.1-base nvidia-smi

B.  

docker run -e NVIDIA_VISIBLE_DEVICES=0,2 nvidia/cuda:12.1-base nvidia-smi

C.  

docker run --gpus all nvidia/cuda:12.1-base nvidia-smi -id=0,2

D.  

docker run --device /dev/nvidia0,/dev/nvidia2 nvidia/cuda:12.1-base nvidia-smi

Discussion 0
Questions 17

After updating BlueField-3 DPU BMC firmware via Redfish, the engineer observes “TaskState: Running” but no progress after 15 minutes. How should they track the update’s completion status?

Options:

A.  

Check /var/log/messages on the DPU operating system for update logs.

B.  

Query the DPU BMC with the Task ID of the installation process.

C.  

Power cycle the DPU immediately to force a rollback.

D.  

Run bfrec --status on the DPU to view flash progress.

Discussion 0
Questions 18

A user needs to configure NGC CLI to access resources across multiple organizations. What is the recommended command syntax to achieve this?

Options:

A.  

export NGC_CLI_ORG=org-name & & ngc config set

B.  

ngc config list to manually edit the JSON configuration file.

C.  

ngc registry login --org org-name

D.  

ngc config set --org org-name --ace ace-name

Discussion 0
Questions 19

A system administrator has upgraded the firmware of the DPU. What will be the state of the firmware after the upgrade?

Options:

A.  

The firmware is installed on the DPU.

B.  

The firmware is deleted from the DPU.

C.  

The firmware is copied to the DPU but not installed.

D.  

The firmware is waiting on reboot to become active.

Discussion 0
Questions 20

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware according to NVIDIA’s documentation and recommended operational steps?

Options:

A.  

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status and reissue update commands if any firmware appears inactive afterward.

B.  

Initiate the required cold reset or power cycle to activate updated firmware, reset the BMC using the recommended command, and perform an AC power cycle when required for EROT and CPLD firmware activation.

C.  

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.

D.  

Execute a single operating system reboot on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node.

Discussion 0
Questions 21

An engineer needs to verify NVLink isolation on a single node with 8 GPUs. Which NCCL test configuration stresses switch bisection bandwidth?

Options:

A.  

Use NCCL_TESTS_SPLIT= " DIV 8 " with point-to-point tests

B.  

Use all_reduce_perf -b 8 -e 16G -f 2 -g 8 with NCCL_TESTS_SPLIT= " AND 0x1 "

C.  

Use reduce_scatter_perf -b 8 -e 16G -f 2 -g 4

D.  

Use all_reduce_perf -b 8 -e 16G -f 2 -g 8 without splits

Discussion 0
Questions 22

A company has a registered NGC account and their server has NGC CLI installed. What step should be taken first to gain access to NGC?

Options:

A.  

ngc config get

B.  

ngc init

C.  

ngc config set

D.  

ngc config update

Discussion 0
Questions 23

An infrastructure engineer is preparing a new AI cluster for production use, relying on NVIDIA switches and high-speed optical transceivers for node connectivity. The team is finalizing network validation before launching large-scale training jobs. Why is it critical to confirm and align the firmware version on all switch transceivers prior to production?

Options:

A.  

To guarantee that hardware inventory tools can report serial numbers and manufacturer codes for asset management, which is critical for future support and troubleshooting.

B.  

To ensure stability, bandwidth, and compatibility across the cluster, avoiding link issues and performance loss.

C.  

To allow the network operating system to automatically discover all connected transceivers with heterogeneous firmware.

D.  

To reduce GPU memory consumption during distributed training jobs.

Discussion 0
Questions 24

A system administrator wants to configure MIG for seven slices on an H100 GPU in an NVIDIA HGX system. Which command should be used?

Options:

A.  

mig-parted

B.  

nvidia-smi

C.  

nvcc

D.  

nvlink-config

Discussion 0
Questions 25

An engineer is tasked with configuring Out-of-Band management for a DGX BasePOD deployment. Which network design will best ensure secure and reliable Out-of-Band management operations?

Options:

A.  

Use a single VLAN for both Out-of-Band management and compute fabric to simplify network design.

B.  

Configure Out-of-Band management interfaces to be accessible from any subnet within the data center for maximum flexibility.

C.  

Connect Out-of-Band management ports to the same switch as user traffic for easier troubleshooting.

D.  

Place all BMC and management interfaces on an isolated Out-of-Band network with access restricted by firewall rules.

Discussion 0
Questions 26

A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?

Options:

A.  

Run a deep learning workload to stress test the GPUs and check whether the issue persists.

B.  

Check the NVIDIA System Management Interface (nvidia-smi) for GPU status and temperatures.

C.  

Power drain then restart the DGX and check if the performance degradation resolves.

D.  

Increase the fan speed to maximum and check whether the performance improves.

Discussion 0
Questions 27

When verifying network cable signal integrity during cluster deployment, which measurement result most strongly indicates a cable signal problem?

Options:

A.  

Repeated CRC errors and intermittent port flapping reported by switch counters.

B.  

Output of ifconfig showing link speed at the expected rate on both ends of the cable.

C.  

Network pings between all cluster nodes return responses with delays under 2 ms on a 100Gb network.

Discussion 0
Questions 28

You are validating the environment of an NVIDIA GPU-accelerated data center during post-deployment checks. Which one action is essential to confirm that power and cooling are sufficient for the stable operation of NVIDIA DGX H100 systems?

Options:

A.  

Confirm the system fans are running at 100% under all workloads to prevent overheating.

B.  

Review the system BIOS to ensure GPU overclocking is enabled for maximum performance.

C.  

Use NVSM to disable unused PCIe devices to reduce overall system heat output.

D.  

Verify that each DGX system is connected to redundant, properly rated PDUs and that all power supplies are reporting nominal input.

Discussion 0
Questions 29

One of the nodes in a cluster is not running as fast as the others and the system administrator needs to check the status of the GPUs on that system. What command should be used?

Options:

A.  

lspci | grep NVIDIA

B.  

nvidia-smi

C.  

nvidia-gpu-status

D.  

iblinkinfo

Discussion 0
Questions 30

During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?

Options:

A.  

Inconclusive; rerun with point-to-point tests.

B.  

Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.

C.  

Critical failure; bus bandwidth exceeds hardware capabilities.

D.  

Suboptimal performance; algorithm bandwidth should match bus bandwidth.

Discussion 0
Questions 31

You are a network administrator responsible for configuring an East-West (E/W) Spectrum-X fabric using SuperNIC. The Bluefield-3 devices in your network should be set to NIC mode with RoCE enabled to optimize data flow between servers. You have access to the Spectrum-X management tools and the necessary documentation. You need to use specific configuration commands to achieve this setup. Which of the following steps and commands are necessary to configure the Bluefield-3 devices in NIC mode for the E/W Spectrum-X fabric using SuperNIC? (Pick the 2 correct responses below)

Options:

A.  

Use the command sudo mlxconfig -d /dev/mst/ < device > set LINK_TYPE_P1=2 to enable Ethernet on the Bluefield-3 devices.

B.  

Use the command sudo mlxconfig -d /dev/mst/ < device > set DISABLE_SPECTRUM_X=1 to reduce overhead.

C.  

Use the command sudo mlxconfig -d /dev/mst/ < device > set INTERNAL_CPU_OFFLOAD_ENGINE=1 to configure the SuperNIC to operate in NIC mode.

D.  

Use the command sudo mlxconfig -d /dev/mst/ < device > set DPU_MODE=1 to set up the Bluefield-3 devices in DPU mode.

Discussion 0
Questions 32

An administrator needs to perform a comprehensive pre-production stress test on a DGX H100 system. Which command validates GPU, CPU, memory, and storage components while following NVIDIA’s recommended procedure?

Options:

A.  

nvidia-smi -q | grep " GPU Stress Test "

B.  

sudo nvsm stress-test --force

C.  

stress --cpu $(nproc) --io $(nproc) --timeout 600

D.  

./gpu_burn 60

Discussion 0
Questions 33

An InfiniBand server stops working, and a system administrator runs the " ibstat " command that provides the following output:

CA ' mlx5_1 '

CA type: MT4115

Number of ports: 2

Firmware version: 10.20.1010

Hardware version: 0

Node GUID: 0x0002c90300002f78

System image GUID: 0x0002c90300002f7b

Port 1:

State: Initializing

Physical state: Linkup

Rate: 100

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x0251086a

Port GUID: 0x0002c90300002f79

Link layer: InfiniBand

What is the cause of the issue?

Options:

A.  

The HCA port is faulty.

B.  

There is no running SM in the fabric.

C.  

The neighboring switch port is faulty.

D.  

The cable is disconnected.

Discussion 0
Questions 34

An enterprise is deploying an AI Factory using NVIDIA DGX BasePOD architecture. The infrastructure team must ensure high availability and efficient data transfer between compute nodes. Which network topology should they implement for the InfiniBand fabric?

Options:

A.  

Simple ring topology connecting all nodes in a loop.

B.  

Fat-Tree topology with rail-optimized design.

C.  

Single flat Ethernet network for all traffic.

D.  

Star topology with all nodes connected to a single central switch.

Discussion 0
Questions 35

A system administrator noticed a failure on a DGX H100 server. After a reboot, only the BMC is available. What could be the reason for this behavior?

Options:

A.  

The network card has no link / connection.

B.  

A boot disk has failed.

C.  

Multiple GPUs have failed.

D.  

There are more than two failed power supplies.

Discussion 0
Questions 36

You are tasked with setting up High Availability (HA) for NVIDIA Base Command Manager (BCM) in a new GPU cluster. The cluster consists of a primary head node, a secondary head node, and several compute nodes. The requirements are automatic failover of BCM services, minimal disruption to workloads, and proper cluster health monitoring during and after installation. During your BCM HA installation and configuration process, which two of the following actions are mandatory for ensuring a robust and verified HA cluster configuration?

Pick the 2 correct responses below.

Options:

A.  

Assign a floating Virtual IP address that can automatically migrate between the primary and secondary head nodes during failover.

B.  

Compute nodes must be powered on and performing work to initiate synchronization of the head nodes.

C.  

After configuration is complete, simulate a failover by stopping BCM services on the active head node to verify that all services are running on the secondary node with no interruption.

D.  

Configure both head nodes to use independent static IP addresses for BCM services instead of relying on a shared virtual IP address.

E.  

During configuration, explicitly synchronize both the configuration and state data directories from the primary to the secondary head node to ensure consistency.

Discussion 0