Optimizing VLM Training on One GPU: A Five-Layer Recipe

Optimizing VLM Training on One GPU: A Five-Layer Recipe How I got SiQ-VL from 14,713 to 100,923 real tokens per second on a single Blackwell GPU, and the four places that surprised me along the way. TL;DR I trained a small vision-language model (SigLIP-2 vision tower + Qwen2.5 LLM, projector-aligned) on one NVIDIA RTX PRO 6000 Blackwell, ran a 48-configuration sweep across two model sizes and both training stages, and ended up with a recipe that compounds five optimization layers. ...

May 24, 2026 · 14 min · Duo An

Building a Truly Scalable Multimodal Data Pipeline: A Streaming-First View

Most “Scalable” Multimodal Pipelines Don’t Survive Foundation-Model Scale A lot of multimodal pipelines claim to scale. In practice, they often depend on at least one of the following: global shuffles (groupBy/join/repartition), materializing massive intermediate datasets, centralized coordination that becomes a bottleneck, or brittle recovery logic (rerun-the-world on failure). That works for demos. It breaks at foundation-model scale. This series is about a different design point: A streaming-first multimodal pipeline that scales linearly with data and hardware — with no global shuffle, and resumable at partition granularity. ...

December 22, 2025 · 4 min · Duo An

A Minimal Recipe for Training VLMs Under Compute Constraints

The Problem Nobody Likes to Admit Most Vision-Language Models assume one thing: You can always add more GPUs. But many of us can’t. When compute is the hard constraint, the question changes from “How do we scale?” to: “What actually still works?” Based on building SiQ-VL, here is a minimal, battle-tested recipe for training VLMs when GPUs are scarce. The Minimal Recipe (TL;DR) If you remember nothing else, remember this: ...

December 15, 2025 · 3 min · Duo An

What Offline CoT Distillation Taught Us About Small Vision-Language Models

Reasoning Is Expensive — But It Doesn’t Have to Be Reasoning is one of the most expensive capabilities to train in Vision-Language Models. Most recent approaches rely on: very large models, long context windows, online teacher–student setups, or reinforcement learning. All of these assume abundant compute. In the SiQ-VL project, we had none of that. What we did have was a question: Can a small VLM learn to reason if we only change how we train it? ...

December 11, 2025 · 4 min · Duo An

SiQ-VL: Training a Reasoning-Capable VLM When You’re GPU-Poor

When You Can’t Afford Scale Most modern Vision-Language Models (VLMs) are built under one assumption: you have access to massive GPU clusters. But what if you don’t? SiQ-VL started from a very practical question: How far can we push a small Vision-Language Model when compute is the hard constraint? Instead of scaling parameters or training end-to-end, we focused on: freezing aggressively, training in stages, and injecting reasoning via offline Chain-of-Thought (CoT) distillation. The result is a lightweight VLM that demonstrates emergent reasoning behavior under strict compute limits. ...

December 5, 2025 · 3 min · Duo An