<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>BF16 on Duo&#39;s Tech Blog</title>
    <link>https://duoan.github.io/tags/bf16/</link>
    <description>Recent content in BF16 on Duo&#39;s Tech Blog</description>
    <image>
      <title>Duo&#39;s Tech Blog</title>
      <url>https://duoan.github.io/images/papermod-cover.png</url>
      <link>https://duoan.github.io/images/papermod-cover.png</link>
    </image>
    <generator>Hugo -- 0.153.1</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 24 May 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://duoan.github.io/tags/bf16/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Optimizing VLM Training on One GPU: A Five-Layer Recipe</title>
      <link>https://duoan.github.io/posts/optimizing-vlm-training-on-one-gpu/</link>
      <pubDate>Sun, 24 May 2026 00:00:00 +0000</pubDate>
      <guid>https://duoan.github.io/posts/optimizing-vlm-training-on-one-gpu/</guid>
      <description>&lt;h1 id=&#34;optimizing-vlm-training-on-one-gpu-a-five-layer-recipe&#34;&gt;Optimizing VLM Training on One GPU: A Five-Layer Recipe&lt;/h1&gt;
&lt;p&gt;How I got SiQ-VL from &lt;code&gt;14,713&lt;/code&gt; to &lt;code&gt;100,923&lt;/code&gt; real tokens per second on a single Blackwell GPU, and the four places that surprised me along the way.&lt;/p&gt;
&lt;h2 id=&#34;tldr&#34;&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;I trained a small vision-language model (SigLIP-2 vision tower + Qwen2.5 LLM, projector-aligned) on one NVIDIA RTX PRO 6000 Blackwell, ran a 48-configuration sweep across two model sizes and both training stages, and ended up with a recipe that compounds five optimization layers.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
