PyTorch

torch.compile path from Python bytecode to Triton kernels

torch.compile: The Mental Model That Actually Matters

torch.compile: The Mental Model That Actually Matters Most writeups of torch.compile are either a flag cheat-sheet or a file-by-file museum tour. Neither helps when a training step is only 1.2× faster and you do not know whether to blame graph breaks, recompiles, or Inductor. The useful model is simpler: compile is specialization under recorded assumptions. Dynamo captures a region, Inductor emits kernels tuned to that region, and guards decide whether the specialization still applies. Everything else — FX, AOTAutograd, Triton — is machinery in service of that contract. ...

Cumulative speedup across five VLM training optimization layers

Optimizing VLM Training on One GPU: A Five-Layer Recipe

Optimizing VLM Training on One GPU: A Five-Layer Recipe How I got SiQ-VL from 14,713 to 100,923 real tokens per second on a single Blackwell GPU, and the four places that surprised me along the way. TL;DR I trained a small vision-language model (SigLIP-2 vision tower + Qwen2.5 LLM, projector-aligned) on one NVIDIA RTX PRO 6000 Blackwell, ran a 48-configuration sweep across two model sizes and both training stages, and ended up with a recipe that compounds five optimization layers. ...

DDP throughput under straggler and communication pathologies

Learning PyTorch DDP Performance Tuning on a One-GPU Machine

Learning PyTorch DDP Performance Tuning on a One-GPU Machine How to build real intuition for DistributedDataParallel scaling, stragglers, communication, and synchronization even when you only have one GPU. TL;DR Most DDP performance problems are easier to understand than they first look. In this post I built a small single-machine lab that uses CPU gloo processes to reproduce the part of DDP reasoning that matters most: the slowest rank often sets the pace small per-rank work hurts scaling communication can dominate step time rank-0-only host work becomes everyone’s problem once you synchronize The important numbers from the lab were: ...

End-to-end PyTorch training data pipeline stages

Profiling a PyTorch Training Job End to End

Profiling a PyTorch Training Job End to End How to decide whether a training job is blocked on data, PyTorch overhead, or a hot CUDA kernel, using torch.profiler, Nsight Systems, and Nsight Compute in the right order. TL;DR When a PyTorch training job feels slow, the most expensive mistake is starting at the wrong layer. In this case study, I built a small synthetic training lab and used it to force three common bottlenecks: ...