What Offline CoT Distillation Taught Us About Small Vision-Language Models

Reasoning Is Expensive — But It Doesn’t Have to Be Reasoning is one of the most expensive capabilities to train in Vision-Language Models. Most recent approaches rely on: very large models, long context windows, online teacher–student setups, or reinforcement learning. All of these assume abundant compute. In the SiQ-VL project, we had none of that. What we did have was a question: Can a small VLM learn to reason if we only change how we train it? ...

December 11, 2025 · 4 min · Duo An

SiQ-VL: Training a Reasoning-Capable VLM When You’re GPU-Poor

When You Can’t Afford Scale Most modern Vision-Language Models (VLMs) are built under one assumption: you have access to massive GPU clusters. But what if you don’t? SiQ-VL started from a very practical question: How far can we push a small Vision-Language Model when compute is the hard constraint? Instead of scaling parameters or training end-to-end, we focused on: freezing aggressively, training in stages, and injecting reasoning via offline Chain-of-Thought (CoT) distillation. The result is a lightweight VLM that demonstrates emergent reasoning behavior under strict compute limits. ...

December 5, 2025 · 3 min · Duo An