Welcome to My Tech Blog

👋 Hi! Welcome to my tech blog.

  • Here I share technical notes, development experiences, learning insights, and engineering thoughts.
  • Topics cover software development, machine learning, system design, algorithms, and various technical domains.
  • Feel free to explore and discuss! 💡

Why Variable Sequence Length Breaks DDP Throughput

Why Variable Sequence Length Breaks DDP Throughput How to reproduce, measure, and fix token skew in transformer training with length bucketing and token-budget batching. TL;DR In transformer training, DDP can look balanced by sample count while being badly imbalanced by actual work. I built a small one-machine lab that uses a tiny transformer-like model with variable sequence lengths and four distributed ranks. The headline result was simple: uniform 128-token batches: 250,959 tokens/s variable lengths with fixed sample count: 122,006 tokens/s variable lengths with length bucketing: 208,668 tokens/s variable lengths with token-budget batching: 193,289 tokens/s The bad case was not a kernel problem. It was a batching problem: ...

March 12, 2026 · 8 min · Duo An

Learning PyTorch DDP Performance Tuning on a One-GPU Machine

Learning PyTorch DDP Performance Tuning on a One-GPU Machine How to build real intuition for DistributedDataParallel scaling, stragglers, communication, and synchronization even when you only have one GPU. TL;DR Most DDP performance problems are easier to understand than they first look. In this post I built a small single-machine lab that uses CPU gloo processes to reproduce the part of DDP reasoning that matters most: the slowest rank often sets the pace small per-rank work hurts scaling communication can dominate step time rank-0-only host work becomes everyone’s problem once you synchronize The important numbers from the lab were: ...

February 18, 2026 · 15 min · Duo An

Profiling a PyTorch Training Job End to End

Profiling a PyTorch Training Job End to End How to decide whether a training job is blocked on data, PyTorch overhead, or a hot CUDA kernel, using torch.profiler, Nsight Systems, and Nsight Compute in the right order. TL;DR When a PyTorch training job feels slow, the most expensive mistake is starting at the wrong layer. In this case study, I built a small synthetic training lab and used it to force three common bottlenecks: ...

January 16, 2026 · 12 min · Duo An

Building a Truly Scalable Multimodal Data Pipeline: A Streaming-First View

Most “Scalable” Multimodal Pipelines Don’t Survive Foundation-Model Scale A lot of multimodal pipelines claim to scale. In practice, they often depend on at least one of the following: global shuffles (groupBy/join/repartition), materializing massive intermediate datasets, centralized coordination that becomes a bottleneck, or brittle recovery logic (rerun-the-world on failure). That works for demos. It breaks at foundation-model scale. This series is about a different design point: A streaming-first multimodal pipeline that scales linearly with data and hardware — with no global shuffle, and resumable at partition granularity. ...

December 22, 2025 · 4 min · Duo An

A Minimal Recipe for Training VLMs Under Compute Constraints

The Problem Nobody Likes to Admit Most Vision-Language Models assume one thing: You can always add more GPUs. But many of us can’t. When compute is the hard constraint, the question changes from “How do we scale?” to: “What actually still works?” Based on building SiQ-VL, here is a minimal, battle-tested recipe for training VLMs when GPUs are scarce. The Minimal Recipe (TL;DR) If you remember nothing else, remember this: ...

December 15, 2025 · 3 min · Duo An