For a 48-hour NCCL burn-in test, which parameters ensure sustained fabric stress while detecting silent data corruption?
Correct Answer: B
Explanation:
The NVIDIA Collective Communications Library (NCCL) tests are the gold standard for validating the interconnect performance of a GPU cluster. For a long-duration burn-in (48 hours), the goal is not just to measure peak bandwidth, but to stress the fabric under load to catch intermittent hardware failures or " Silent Data Corruption " (SDC). The all_reduce_perf test is the most intensive as it involves bidirectional data flow across all GPUs. The specific parameters in Option B are critical: -b 8G -e 32G sets the message size range to large buffers that saturate the 400G InfiniBand links; -c 1000 ensures a high number of iterations for statistical significance; -z 1 (check) is the most vital flag, as it enables verification of the mathematical result. If a bit flips during transmission due to a faulty transceiver, the -z 1 flag will catch the mismatch and report a failure. Finally, -G 1000 ensures the test runs long enough to reach thermal equilibrium across the switches and HCAs.
Demo Practice Mode
You are viewing only the questions marked as Demo.