Qwen | Duo's Tech Blog

When You Can’t Afford Scale Most modern Vision-Language Models (VLMs) are built under one assumption: you have access to massive GPU clusters. But what if you don’t? SiQ-VL started from a very practical question: How far can we push a small Vision-Language Model when compute is the hard constraint? Instead of scaling parameters or training end-to-end, we focused on: freezing aggressively, training in stages, and injecting reasoning via offline Chain-of-Thought (CoT) distillation. The result is a lightweight VLM that demonstrates emergent reasoning behavior under strict compute limits. ...