Optimizing VLM Training on One GPU: A Five-Layer Recipe

Sun, 24 May 2026 00:00:00 +0000

Optimizing VLM Training on One GPU: A Five-Layer Recipe

How I got SiQ-VL from 14,713 to 100,923 real tokens per second on a single Blackwell GPU, and the four places that surprised me along the way.

TL;DR

I trained a small vision-language model (SigLIP-2 vision tower + Qwen2.5 LLM, projector-aligned) on one NVIDIA RTX PRO 6000 Blackwell, ran a 48-configuration sweep across two model sizes and both training stages, and ended up with a recipe that compounds five optimization layers.

BF16 on Duo's Tech Blog

Optimizing VLM Training on One GPU: A Five-Layer Recipe

Optimizing VLM Training on One GPU: A Five-Layer Recipe

TL;DR