Welcome to My Tech Blog

👋 Hi! Welcome to my tech blog.

Here I share technical notes, development experiences, learning insights, and engineering thoughts.
Topics cover software development, machine learning, system design, algorithms, and various technical domains.
Feel free to explore and discuss! 💡

Building a Truly Scalable Multimodal Data Pipeline: A Streaming-First View

Most “Scalable” Multimodal Pipelines Don’t Survive Foundation-Model Scale A lot of multimodal pipelines claim to scale. In practice, they often depend on at least one of the following: global shuffles (groupBy/join/repartition), materializing massive intermediate datasets, centralized coordination that becomes a bottleneck, or brittle recovery logic (rerun-the-world on failure). That works for demos. It breaks at foundation-model scale. This series is about a different design point: A streaming-first multimodal pipeline that scales linearly with data and hardware — with no global shuffle, and resumable at partition granularity. ...

A Minimal Recipe for Training VLMs Under Compute Constraints

The Problem Nobody Likes to Admit Most Vision-Language Models assume one thing: You can always add more GPUs. But many of us can’t. When compute is the hard constraint, the question changes from “How do we scale?” to: “What actually still works?” Based on building SiQ-VL, here is a minimal, battle-tested recipe for training VLMs when GPUs are scarce. The Minimal Recipe (TL;DR) If you remember nothing else, remember this: ...

What Offline CoT Distillation Taught Us About Small Vision-Language Models

Reasoning Is Expensive — But It Doesn’t Have to Be Reasoning is one of the most expensive capabilities to train in Vision-Language Models. Most recent approaches rely on: very large models, long context windows, online teacher–student setups, or reinforcement learning. All of these assume abundant compute. In the SiQ-VL project, we had none of that. What we did have was a question: Can a small VLM learn to reason if we only change how we train it? ...

SiQ-VL: Training a Reasoning-Capable VLM When You’re GPU-Poor

When You Can’t Afford Scale Most modern Vision-Language Models (VLMs) are built under one assumption: you have access to massive GPU clusters. But what if you don’t? SiQ-VL started from a very practical question: How far can we push a small Vision-Language Model when compute is the hard constraint? Instead of scaling parameters or training end-to-end, we focused on: freezing aggressively, training in stages, and injecting reasoning via offline Chain-of-Thought (CoT) distillation. The result is a lightweight VLM that demonstrates emergent reasoning behavior under strict compute limits. ...

From Scaling Laws to Cluster Size: A Practical Guide to Planning Large-Scale Model Training

Why Capacity Planning Is the Hardest Part of Large Model Training Before you write a single line of training code, you must answer a few brutal questions: How many tokens do I actually need? What sequence length should I train on? How many GPUs will this take? How long will it run? What parallelism strategy makes this feasible? Most teams get this wrong — not because they lack theory, but because they never connect scaling laws → systems constraints. ...