Optimizing VLM Training on One GPU: A Five-Layer Recipe

Optimizing VLM Training on One GPU: A Five-Layer Recipe How I got SiQ-VL from 14,713 to 100,923 real tokens per second on a single Blackwell GPU, and the four places that surprised me along the way. TL;DR I trained a small vision-language model (SigLIP-2 vision tower + Qwen2.5 LLM, projector-aligned) on one NVIDIA RTX PRO 6000 Blackwell, ran a 48-configuration sweep across two model sizes and both training stages, and ended up with a recipe that compounds five optimization layers. ...

May 24, 2026 · 14 min · Duo An

Why Variable Sequence Length Breaks DDP Throughput

Why Variable Sequence Length Breaks DDP Throughput How to reproduce, measure, and fix token skew in transformer training with length bucketing and token-budget batching. TL;DR In transformer training, DDP can look balanced by sample count while being badly imbalanced by actual work. I built a small one-machine lab that uses a tiny transformer-like model with variable sequence lengths and four distributed ranks. The headline result was simple: uniform 128-token batches: 250,959 tokens/s variable lengths with fixed sample count: 122,006 tokens/s variable lengths with length bucketing: 208,668 tokens/s variable lengths with token-budget batching: 193,289 tokens/s The bad case was not a kernel problem. It was a batching problem: ...

March 12, 2026 · 8 min · Duo An

Learning PyTorch DDP Performance Tuning on a One-GPU Machine

Learning PyTorch DDP Performance Tuning on a One-GPU Machine How to build real intuition for DistributedDataParallel scaling, stragglers, communication, and synchronization even when you only have one GPU. TL;DR Most DDP performance problems are easier to understand than they first look. In this post I built a small single-machine lab that uses CPU gloo processes to reproduce the part of DDP reasoning that matters most: the slowest rank often sets the pace small per-rank work hurts scaling communication can dominate step time rank-0-only host work becomes everyone’s problem once you synchronize The important numbers from the lab were: ...

February 18, 2026 · 15 min · Duo An

Profiling a PyTorch Training Job End to End

Profiling a PyTorch Training Job End to End How to decide whether a training job is blocked on data, PyTorch overhead, or a hot CUDA kernel, using torch.profiler, Nsight Systems, and Nsight Compute in the right order. TL;DR When a PyTorch training job feels slow, the most expensive mistake is starting at the wrong layer. In this case study, I built a small synthetic training lab and used it to force three common bottlenecks: ...

January 16, 2026 · 12 min · Duo An