vllm-project · youkaichao · Nov 11, 2024 · Nov 11, 2024
diff --git a/docs/source/getting_started/debugging.rst b/docs/source/getting_started/debugging.rst
@@ -122,6 +122,8 @@ If you are testing with multi-nodes, adjust ``--nproc-per-node`` and ``--nnodes`
 
 If the script runs successfully, you should see the message ``sanity check is successful!``.
 
+If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as ``export NCCL_P2P_DISABLE=1`` to see if it helps. Please check `their documentation <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html>`__ for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
+
 .. note::
 
     A multi-node environment is more complicated than a single-node one. If you see errors such as ``torch.distributed.DistNetworkError``, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments: