Why Variable Sequence Length Breaks DDP Throughput

Why Variable Sequence Length Breaks DDP Throughput How to reproduce, measure, and fix token skew in transformer training with length bucketing and token-budget batching. TL;DR In transformer training, DDP can look balanced by sample count while being badly imbalanced by actual work. I built a small one-machine lab that uses a tiny transformer-like model with variable sequence lengths and four distributed ranks. The headline result was simple: uniform 128-token batches: 250,959 tokens/s variable lengths with fixed sample count: 122,006 tokens/s variable lengths with length bucketing: 208,668 tokens/s variable lengths with token-budget batching: 193,289 tokens/s The bad case was not a kernel problem. It was a batching problem: ...

March 12, 2026 · 8 min · Duo An

From Scaling Laws to Cluster Size: A Practical Guide to Planning Large-Scale Model Training

Why Capacity Planning Is the Hardest Part of Large Model Training Before you write a single line of training code, you must answer a few brutal questions: How many tokens do I actually need? What sequence length should I train on? How many GPUs will this take? How long will it run? What parallelism strategy makes this feasible? Most teams get this wrong — not because they lack theory, but because they never connect scaling laws → systems constraints. ...

February 2, 2025 · 9 min · Duo An