Archive

2026 ¹⁰

July ⁵

Foundation Model Kernel Optimization in 2026: A Field Guide Across Dense, MoE, Multimodal, and Diffusion

July 16, 2026 · 36 min · Duo An

Paper Reading: ARGUS — Always-On Tracing at 10,000+ GPU Scale

July 13, 2026 · 15 min · Duo An

The GPU Optimization Playbook: Architecture, Memory, and Balance

July 12, 2026 · 10 min · Duo An

Roofline: The First Step of Any Performance Optimization

July 11, 2026 · 12 min · Duo An

Large MoE Performance: The Three Walls After Sparsity

July 4, 2026 · 12 min · Duo An

May ²

torch.compile: The Mental Model That Actually Matters

May 31, 2026 · 9 min · Duo An

Optimizing VLM Training on One GPU: A Five-Layer Recipe

May 24, 2026 · 14 min · Duo An

March ¹

Why Variable Sequence Length Breaks DDP Throughput

March 12, 2026 · 8 min · Duo An

February ¹

Learning PyTorch DDP Performance Tuning on a One-GPU Machine

February 18, 2026 · 15 min · Duo An

January ¹

Profiling a PyTorch Training Job End to End

January 16, 2026 · 12 min · Duo An

2025 ³

December ²

Building a Truly Scalable Multimodal Data Pipeline: A Streaming-First View

December 22, 2025 · 4 min · Duo An

SiQ-VL: A Curriculum for Small VLMs When Compute Is the Hard Constraint

December 15, 2025 · 6 min · Duo An

February ¹

From Scaling Laws to Cluster Size: Capacity Planning That Survives Contact With GPUs

February 2, 2025 · 4 min · Duo An

2026 10

July 5

Foundation Model Kernel Optimization in 2026: A Field Guide Across Dense, MoE, Multimodal, and Diffusion

Paper Reading: ARGUS — Always-On Tracing at 10,000+ GPU Scale

The GPU Optimization Playbook: Architecture, Memory, and Balance

Roofline: The First Step of Any Performance Optimization

Large MoE Performance: The Three Walls After Sparsity

May 2

torch.compile: The Mental Model That Actually Matters

Optimizing VLM Training on One GPU: A Five-Layer Recipe

March 1

Why Variable Sequence Length Breaks DDP Throughput

February 1

Learning PyTorch DDP Performance Tuning on a One-GPU Machine

January 1

Profiling a PyTorch Training Job End to End

2025 3

December 2

Building a Truly Scalable Multimodal Data Pipeline: A Streaming-First View

SiQ-VL: A Curriculum for Small VLMs When Compute Is the Hard Constraint

February 1

From Scaling Laws to Cluster Size: Capacity Planning That Survives Contact With GPUs

2026 ¹⁰

July ⁵

May ²

March ¹

February ¹

January ¹

2025 ³

December ²

February ¹