Learning PyTorch DDP Performance Tuning on a One-GPU Machine

Learning PyTorch DDP Performance Tuning on a One-GPU Machine How to build real intuition for DistributedDataParallel scaling, stragglers, communication, and synchronization even when you only have one GPU. TL;DR Most DDP performance problems are easier to understand than they first look. In this post I built a small single-machine lab that uses CPU gloo processes to reproduce the part of DDP reasoning that matters most: the slowest rank often sets the pace small per-rank work hurts scaling communication can dominate step time rank-0-only host work becomes everyone’s problem once you synchronize The important numbers from the lab were: ...

February 18, 2026 · 15 min · Duo An

From Scaling Laws to Cluster Size: A Practical Guide to Planning Large-Scale Model Training

Why Capacity Planning Is the Hardest Part of Large Model Training Before you write a single line of training code, you must answer a few brutal questions: How many tokens do I actually need? What sequence length should I train on? How many GPUs will this take? How long will it run? What parallelism strategy makes this feasible? Most teams get this wrong — not because they lack theory, but because they never connect scaling laws → systems constraints. ...

February 2, 2025 · 9 min · Duo An