vllm.distributed.device_communicators.all_reduce_utils ¶
can_actually_p2p ¶
Usually, checking if P2P access is enabled can be done by torch.cuda.can_device_access_peer(src, tgt). However, sometimes the driver might be broken, and torch.cuda.can_device_access_peer(src, tgt) returns True even if P2P access is not actually possible. See https://github.com/vllm-project/vllm/issues/2728 and https://forums.developer.nvidia.com/t/direct-gpu-gpu-communication-does-not-seem-to-work-properly/283264/10 Therefore, we have to perform a real P2P access to check if it is actually possible.
Note on p2p and cuda IPC: Usually, one process uses one GPU: GPU src --> cuda context src --> tensor src --> process src
We need to combine p2p and cuda IPC, so that: GPU src --> cuda context src --> tensor src --> process src |shared| GPU tgt --> cuda context tgt --> tensor tgt --> process tgt That is to say, process src creates a tensor in GPU src, passes IPC handle to process tgt, and process tgt accesses the tensor in GPU tgt. Any operation on the tensor in process tgt will be reflected in the tensor in process src, because they are the same memory segment. It is important to note that process tgt accesses the tensor in GPU tgt, not GPU src. That's why we need p2p access.
The most time-consuming part is the process creation. To avoid creating processes for every pair of GPUs, we use batched testing. We create two processes for testing all pairs of GPUs in batch. The trick is to reset the device after each test (which is not available in PyTorch).
Source code in vllm/distributed/device_communicators/all_reduce_utils.py
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 | |
gpu_p2p_access_check ¶
Check if GPU src can access GPU tgt.
Source code in vllm/distributed/device_communicators/all_reduce_utils.py
should_nccl_symm_mem_allreduce ¶
Determine if NCCL symmetric memory allreduce should be used.
Based on H100 and GB200 benchmarks, NCCL symm_mem is preferred for: - Small tensors (≤16K): Lower latency than custom_AR - Large tensors (≥128K for 8 GPUs, ≥512K for 4 GPUs): Better bandwidth
Custom_AR is preferred for mid-range sizes where its P2P approach has lower overhead than the symm_mem copy-in/copy-out pattern.