Learning PyTorch DDP Performance Tuning on a One-GPU Machine

Learning PyTorch DDP Performance Tuning on a One-GPU Machine How to build real intuition for DistributedDataParallel scaling, stragglers, communication, and synchronization even when you only have one GPU. TL;DR Most DDP performance problems are easier to understand than they first look. In this post I built a small single-machine lab that uses CPU gloo processes to reproduce the part of DDP reasoning that matters most: the slowest rank often sets the pace small per-rank work hurts scaling communication can dominate step time rank-0-only host work becomes everyone’s problem once you synchronize The important numbers from the lab were: ...

February 18, 2026 · 15 min · Duo An

Profiling a PyTorch Training Job End to End

Profiling a PyTorch Training Job End to End How to decide whether a training job is blocked on data, PyTorch overhead, or a hot CUDA kernel, using torch.profiler, Nsight Systems, and Nsight Compute in the right order. TL;DR When a PyTorch training job feels slow, the most expensive mistake is starting at the wrong layer. In this case study, I built a small synthetic training lab and used it to force three common bottlenecks: ...

January 16, 2026 · 12 min · Duo An