Performance

Four layers of foundation-model kernel co-design: on-chip dataflow, operator graph, multi-GPU dataflow, model structure

Foundation Model Kernel Optimization in 2026: A Field Guide Across Dense, MoE, Multimodal, and Diffusion

Foundation Model Kernel Optimization in 2026: A Field Guide Across Dense, MoE, Multimodal, and Diffusion Most “kernel optimization” conversations still start with FLOPs. In 2026 that is usually the wrong first question. Foundation-model runtime is dominated by HBM traffic, KV cache, temporary tensors, collectives, dynamic permutation, and launch overhead. FlashAttention, fused linear–cross-entropy, paged KV, MoE grouped GEMM, and DeepEP-style dispatch all share one essence: do not materialize intermediates, or make each byte travel once. ...

ARGUS progressive diagnosis levels from iteration time to kernel stats

Paper Reading: ARGUS — Always-On Tracing at 10,000+ GPU Scale

Paper Reading: ARGUS — Always-On Tracing at 10,000+ GPU Scale What Tencent built to catch fail-slow training jobs on 10k+ GPU clusters with under 2% overhead — plus Modal remasurements of CUPTI Activity API / torch.profiler / nsys overhead, case-study reproductions, and the KDE + W₁ detection path. Paper: ARGUS: Production-Scale Tracing and Performance Diagnosis for over 10,000-GPU Clusters (Zhou et al., Tencent, arXiv 2606.20374, submitted to ATC 2026) TL;DR Large LLM training jobs are synchronous: one slow rank, link, or host-side stall can waste thousands of GPU-hours without triggering a hard failure. Existing tools split into two camps: ...

A GPU is a throughput coprocessor hanging off the host over PCIe

The GPU Optimization Playbook: Architecture, Memory, and Balance

The GPU Optimization Playbook: Architecture, Memory, and Balance Most “GPU optimization” advice is a bag of tricks: coalesce here, unroll there, add __restrict__ and pray. Tricks are the output of optimization, not the method. The method is smaller and more durable: understand the machine, find the resource that is actually saturated, and rebalance work toward the resources that are idle. Almost every GPU kernel is limited by data movement, not arithmetic. Once you internalize that, the whole catalog of techniques collapses into three questions. ...

Roofline: compute roof meets bandwidth slope at the ridge point

Roofline: The First Step of Any Performance Optimization

Roofline: The First Step of Any Performance Optimization When MFU sits at 20%, most people open a profiler and hunt for a slow kernel. That often starts at the wrong layer. The first question is not which kernel is hot — it is which ceiling you are hitting: compute or memory bandwidth. TL;DR Every GPU has two hard ceilings: peak FLOP/s and peak bandwidth. Arithmetic intensity I = FLOPs / Bytes decides which one binds first. MFU answers “are we compute-bound?” for training. MBU answers “are we bandwidth-bound?” for decode. Both are Roofline ratios, not vibes. Shape matters more than op name: the same matmul can be compute-bound at M=N=K=8192 and memory-bound at M=1. That is why training and decode feel like different worlds. Count MFU/MBU by instrumentation (FlopCounterMode + bytes), not PaLM 6PT — that formula is an LLM shortcut. ResNet / ViT work the same way as any other nn.Module. Model-level Roofline is useful when traffic is homogeneous (decode, dense training GEMMs). It is misleading when time is dominated by a mix of memory-bound and compute-bound ops — then go per-op, then profiler. Reproducible Modal measurements (ops, LLM decode/train sweeps, ResNet/ViT MFU·MBU) live in this page bundle; code in playground/roofline_modal.py. The Most Expensive Mistake The costly failure mode in performance work is not missing the optimal kernel. It is optimizing in the wrong direction. ...

MoE training’s three coupled walls: memory, communication, compute

Large MoE Performance: The Three Walls After Sparsity

Large MoE Performance: The Three Walls After Sparsity Sparsity made MoE cheap on paper. At production scale it made training harder than dense: total parameters grow with E, per-token FLOPs grow with k, and the gap between those two numbers is exactly where systems break. The useful framing is not “optimize the MoE kernel.” It is the one NVIDIA’s Megatron-Core MoE report uses (arXiv:2603.07685): Memory, Communication, and Compute Efficiency are three coupled walls. Push on one and pressure shows up in another. ByteDance’s MegaScale-MoE (arXiv:2505.11432) proves the same thesis from the other direction — on 1,440 Hoppers, communication was ~44% of forward time before their redesign, and fixing parallelism + overlap delivered 1.88× over Megatron-LM. ...