VLM | Duo's Tech Blog

Four layers of foundation-model kernel co-design: on-chip dataflow, operator graph, multi-GPU dataflow, model structure

Foundation Model Kernel Optimization in 2026: A Field Guide Across Dense, MoE, Multimodal, and Diffusion

Foundation Model Kernel Optimization in 2026: A Field Guide Across Dense, MoE, Multimodal, and Diffusion Most “kernel optimization” conversations still start with FLOPs. In 2026 that is usually the wrong first question. Foundation-model runtime is dominated by HBM traffic, KV cache, temporary tensors, collectives, dynamic permutation, and launch overhead. FlashAttention, fused linear–cross-entropy, paged KV, MoE grouped GEMM, and DeepEP-style dispatch all share one essence: do not materialize intermediates, or make each byte travel once. ...

Cumulative speedup across five VLM training optimization layers

Optimizing VLM Training on One GPU: A Five-Layer Recipe

Optimizing VLM Training on One GPU: A Five-Layer Recipe How I got SiQ-VL from 14,713 to 100,923 real tokens per second on a single Blackwell GPU, and the four places that surprised me along the way. TL;DR I trained a small vision-language model (SigLIP-2 vision tower + Qwen2.5 LLM, projector-aligned) on one NVIDIA RTX PRO 6000 Blackwell, ran a 48-configuration sweep across two model sizes and both training stages, and ended up with a recipe that compounds five optimization layers. ...

Streaming-first multimodal data pipeline

Building a Truly Scalable Multimodal Data Pipeline: A Streaming-First View

Most “Scalable” Multimodal Pipelines Don’t Survive Foundation-Model Scale A lot of multimodal pipelines claim to scale. In practice, they often depend on at least one of the following: global shuffles (groupBy/join/repartition), materializing massive intermediate datasets, centralized coordination that becomes a bottleneck, or brittle recovery logic (rerun-the-world on failure). That works for demos. It breaks at foundation-model scale. This series is about a different design point: A streaming-first multimodal pipeline that scales linearly with data and hardware — with no global shuffle, and resumable at partition granularity. ...

Three-stage SiQ-VL curriculum: alignment, instruction, offline CoT

SiQ-VL: A Curriculum for Small VLMs When Compute Is the Hard Constraint

SiQ-VL: A Curriculum for Small VLMs When Compute Is the Hard Constraint Most VLM writeups assume a cluster. SiQ-VL started from the opposite constraint: one (or few) GPUs, and the question was which design choices still buy capability when you cannot buy FLOPs. This post is the consolidated field guide for that project — architecture, token economics, staged training, and offline Chain-of-Thought (CoT) distillation — replacing three earlier notes that said the same thing three ways. Kernel-level throughput (how we pushed Stage-1 from ~15K to ~100K real tokens/s on Blackwell) lives in the companion post: Optimizing VLM Training on One GPU. ...