LLM DCAI Survey

A data-centric AI Survey
Published

October 4, 2025

<!DOCTYPE html> The LLM Data-Centric AI Survey
<div class="min-h-screen lg:grid lg:grid-cols-12">
    <aside class="fixed top-0 left-0 h-full bg-white shadow-md w-64 py-8 px-4 hidden lg:block lg:col-span-3 xl:col-span-2">
        <div class="text-center mb-10">
            <h1 class="text-xl font-bold text-gray-800">LLM DCAI Survey</h1>
            <p class="text-sm text-gray-500">Data-Centric AI Practices</p>
            <p class="text-xs text-gray-400 mt-2">Retrieval: Oct 2025</p>
        </div>
        <nav id="desktop-nav" class="space-y-2">
             <a href="#introduction" class="nav-link active group flex items-center px-3 py-2 text-base font-medium text-gray-700 rounded-md">Introduction</a>
             <a href="#pipeline" class="nav-link group flex items-center px-3 py-2 text-base font-medium text-gray-700 rounded-md">Data Pipeline</a>
             <a href="#datasets" class="nav-link group flex items-center px-3 py-2 text-base font-medium text-gray-700 rounded-md">Datasets</a>
             <a href="#benchmarks" class="nav-link group flex items-center px-3 py-2 text-base font-medium text-gray-700 rounded-md">Benchmarks</a>
             <a href="#analysis" class="nav-link group flex items-center px-3 py-2 text-base font-medium text-gray-700 rounded-md">Comparative Analysis</a>
             <a href="#china" class="nav-link group flex items-center px-3 py-2 text-base font-medium text-gray-700 rounded-md">China-Specific Focus</a>
             <a href="#future" class="nav-link group flex items-center px-3 py-2 text-base font-medium text-gray-700 rounded-md">Open Problems & Future</a>
        </nav>
    </aside>

    <main class="lg:col-span-9 xl:col-span-10 lg:ml-64 px-4 sm:px-6 lg:px-8">
        <div class="max-w-5xl mx-auto py-12">

            <section id="introduction" class="content-section active">
                <div class="text-center">
                    <h1 class="text-4xl font-bold tracking-tight text-gray-900 sm:text-5xl">A Survey of Data Practices for Multimodal Large Language Models</h1>
                    <p class="mt-6 text-lg leading-8 text-gray-600">An interactive overview of the data-centric techniques shaping the frontier of multimodal AI, with a focus on research from the last three years.</p>
                </div>

                <div class="mt-16">
                    <h2 class="text-2xl font-bold text-gray-900">Abstract</h2>
                    <p class="mt-4 text-gray-600">The performance of Multimodal Large Language Models (MLLMs) is critically dependent on the quality, scale, and processing of their training data. This survey provides a comprehensive analysis of the data pipelines used to train state-of-the-art models, synthesizing findings from academic papers, technical reports, and open-source contributions. We introduce a taxonomy of data practices—from sourcing and cleaning to advanced techniques like semantic deduplication, curriculum learning, and synthetic data generation. By examining models from leading global research labs, we identify emerging best practices, persistent challenges, and key open problems, offering a structured perspective on the data-centric AI paradigm in the multimodal era.</p>
                </div>

                <div class="mt-16">
                    <h2 class="text-2xl font-bold text-gray-900">Literature Map: The MLLM Landscape</h2>
                    <p class="mt-4 text-gray-600">The MLLM ecosystem can be categorized by modality and training stage. This map provides a high-level overview of the major model archetypes and their primary data inputs. The field is rapidly evolving from single-modality expertise to truly omni-modal understanding.</p>
                    <div class="mt-8 grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-6">
                        <div class="card p-6">
                            <h3 class="font-semibold text-lg text-gray-800">Image-Text Models</h3>
                            <p class="text-sm text-gray-500 mt-2">Core modality. Models like CLIP, Flamingo, and LLaVA established foundational techniques for aligning vision and language representations through contrastive and generative pre-training.</p>
                            <div class="mt-4 flex flex-wrap gap-2">
                                <span class="tag">Base Models</span><span class="tag">Instruction-Tuned</span>
                            </div>
                        </div>
                         <div class="card p-6">
                            <h3 class="font-semibold text-lg text-gray-800">Video-Text Models</h3>
                            <p class="text-sm text-gray-500 mt-2">Extends image-text understanding to the temporal dimension. Challenges include efficient frame sampling, audio track integration, and modeling long-range dependencies. Examples include Video-LLaMA and InternVideo.</p>
                            <div class="mt-4 flex flex-wrap gap-2">
                                <span class="tag">Temporal Reasoning</span><span class="tag">Audio-Visual</span>
                            </div>
                        </div>
                         <div class="card p-6">
                            <h3 class="font-semibold text-lg text-gray-800">Code-Multimodal Models</h3>
                            <p class="text-sm text-gray-500 mt-2">A frontier integrating natural language, source code, and visual interfaces (e.g., screenshots, GUIs). These models aim to automate complex software engineering tasks. DeepSeek-Coder and an increasing number of vision models are exploring this domain.</p>
                             <div class="mt-4 flex flex-wrap gap-2">
                                <span class="tag">GUI Navigation</span><span class="tag">Code Generation</span>
                            </div>
                        </div>
                    </div>
                </div>
            </section>

            <section id="pipeline" class="content-section">
                <h1 class="text-4xl font-bold tracking-tight text-gray-900">The Multimodal Data Pipeline</h1>
                <p class="mt-6 text-lg leading-8 text-gray-600">Optimizing MLLMs is fundamentally an exercise in data engineering. The following taxonomy breaks down the end-to-end process, from raw data acquisition to the final training batches. Each stage presents unique challenges and opportunities for performance improvement. Click a stage to learn more about the associated techniques.</p>

                <div class="mt-12">
                    <div class="chart-container" style="height: 500px; max-height: 60vh;">
                         <canvas id="pipelineChart"></canvas>
                    </div>
                </div>

                <div id="pipeline-details" class="mt-16 space-y-12">
                    <div class="card p-6">
                        <h3 class="text-xl font-semibold text-gray-800">1. Sourcing & Cleaning</h3>
                        <p class="mt-3 text-gray-600">The foundation of any model. Data is aggregated from web crawls (e.g., Common Crawl), academic datasets (e.g., ImageNet), and licensed private sources. Initial cleaning involves removing low-quality content, NSFW filtering (often using smaller, specialized models), and PII redaction. The scale is massive, but quality is paramount. For example, LAION-5B was built on Common Crawl but required extensive CLIP-based filtering to achieve its quality.</p>
                    </div>
                    <div class="card p-6">
                        <h3 class="text-xl font-semibold text-gray-800">2. Enrichment & Filtering</h3>
                        <p class="mt-3 text-gray-600">Raw data is enriched with better metadata. This includes running Optical Character Recognition (OCR) on images to extract text, Automatic Speech Recognition (ASR) on audio, and improving image captions using more powerful captioning models. Filtering becomes more sophisticated here, with models like Qwen-VL and InternVL using aesthetic classifiers and text-image alignment scores to prune the data, a key technique discussed in the Data-Juicer library.</p>
                    </div>
                    <div class="card p-6">
                        <h3 class="text-xl font-semibold text-gray-800">3. Multilinguality & Deduplication</h3>
                        <p class="mt-3 text-gray-600">To build globally competent models, multilingual data is essential. This stage involves language identification and balancing. More critically, deduplication is performed to improve efficiency and reduce memorization. This has evolved from near-duplicate removal (e.g., MinHash) to advanced semantic deduplication, where embeddings are used to find and remove conceptually similar image-text pairs across different languages.</p>
                    </div>
                     <div class="card p-6">
                        <h3 class="text-xl font-semibold text-gray-800">4. Pre-training Tasks & Objectives</h3>
                        <p class="mt-3 text-gray-600">The processed data is formatted for specific pre-training objectives. Common objectives include next-token prediction (generative), masked language/image modeling (BERT-style), and contrastive alignment (CLIP-style). Models like Qwen-VL and GLM-Vision use a mix of these tasks to learn robust, general-purpose representations.</p>
                    </div>
                     <div class="card p-6">
                        <h3 class="text-xl font-semibold text-gray-800">5. Curriculum & Augmentation</h3>
                        <p class="mt-3 text-gray-600">Instead of feeding data randomly, a curriculum is often designed. This can involve starting with simple, low-resolution images and gradually increasing complexity and resolution (as seen in the InternVL report). Augmentation techniques like rewriting captions for better detail or randomly masking parts of an image are also applied to make the model more robust.</p>
                    </div>
                    <div class="card p-6">
                        <h3 class="text-xl font-semibold text-gray-800">6. Synthetic Data Generation</h3>
                        <p class="mt-3 text-gray-600">A major recent trend is using powerful models (like GPT-4) to generate high-quality training data. This includes self-instruct methods for creating instruction-following examples, using LLMs as judges to rate and filter data (RLAIF - Reinforcement Learning from AI Feedback), and generating synthetic captions, code, or structured data like charts and tables. This is crucial for bootstrapping capabilities in niche domains where data is scarce.</p>
                    </div>
                </div>
            </section>

            <section id="datasets" class="content-section">
                 <h1 class="text-4xl font-bold tracking-tight text-gray-900">Key Open-Source Datasets</h1>
                 <p class="mt-6 text-lg leading-8 text-gray-600">The following datasets are cornerstones of MLLM pre-training and fine-tuning. Understanding their scale, modality, and licensing is crucial for researchers and developers.</p>
                 <div class="mt-12 grid grid-cols-1 md:grid-cols-2 gap-6">
                    <div class="card p-6">
                        <h3 class="font-semibold text-lg text-gray-800">Vision-Text</h3>
                        <ul class="mt-4 space-y-2 text-gray-600 list-disc list-inside">
                            <li><strong>LAION-5B:</strong> 5.85B image-text pairs. The largest open dataset, critical for many foundational models. License: CC-BY 4.0 for text, varies for images.</li>
                            <li><strong>COYO-700M:</strong> 700M image-text pairs. A large-scale alternative to LAION.</li>
                            <li><strong>WebVid-10M:</strong> 10.7M video-text pairs. Widely used for video-language pre-training.</li>
                        </ul>
                    </div>
                     <div class="card p-6">
                        <h3 class="font-semibold text-lg text-gray-800">Document & Chart</h3>
                        <ul class="mt-4 space-y-2 text-gray-600 list-disc list-inside">
                            <li><strong>DocVQA:</strong> Question answering on document images. Crucial for layout understanding.</li>
                            <li><strong>ChartQA:</strong> Question answering on charts and plots. Tests reasoning over structured graphics.</li>
                            <li><strong>AI2D:</strong> Diagram parsing and question answering.</li>
                        </ul>
                    </div>
                    <div class="card p-6">
                        <h3 class="font-semibold text-lg text-gray-800">Code & Code-Multimodal</h3>
                        <ul class="mt-4 space-y-2 text-gray-600 list-disc list-inside">
                            <li><strong>The Stack v2:</strong> A 6.4TB dataset of source code in 600+ languages. Foundation for models like StarCoder.</li>
                            <li><strong>SWE-Bench:</strong> Evaluating code generation on real-world software engineering issues from GitHub.</li>
                            <li><strong>RepoBench:</strong> A benchmark for repository-level code understanding and completion.</li>
                        </ul>
                    </div>
                     <div class="card p-6">
                        <h3 class="font-semibold text-lg text-gray-800">Audio & Speech</h3>
                         <ul class="mt-4 space-y-2 text-gray-600 list-disc list-inside">
                            <li><strong>LibriSpeech:</strong> 1000 hours of English speech. A standard for ASR benchmarking.</li>
                            <li><strong>AudioSet:</strong> 2M human-labeled 10-second sound clips drawn from YouTube videos.</li>
                            <li><strong>Common Voice:</strong> A massive multilingual transcribed speech corpus from Mozilla.</li>
                        </ul>
                    </div>
                 </div>
            </section>

            <section id="benchmarks" class="content-section">
                <h1 class="text-4xl font-bold tracking-tight text-gray-900">Benchmarks & Leaderboards</h1>
                <p class="mt-6 text-lg leading-8 text-gray-600">Evaluating MLLMs is a complex, multi-faceted challenge. A diverse set of benchmarks is used to probe different capabilities, from general perception to specialized reasoning. Leaderboards aggregate these results, providing a competitive snapshot of the field.</p>
                
                <div class="mt-12">
                     <div class="flex items-center justify-between">
                        <h2 class="text-2xl font-bold text-gray-900">VLM Benchmark Performance</h2>
                        <select id="benchmarkSelector" class="rounded-md border-gray-300 shadow-sm focus:border-amber-500 focus:ring-amber-500">
                            <option value="mmbench">MMBench</option>
                            <option value="seed">SEED-Bench</option>
                            <option value="mmmu">MMMU</option>
                        </select>
                     </div>
                    <div class="mt-6 chart-container">
                        <canvas id="benchmarkChart"></canvas>
                    </div>
                    <p class="text-center text-sm text-gray-500 mt-2">Note: Scores are illustrative and based on publicly reported results circa late 2024. Performance varies with model versions and evaluation settings.</p>
                </div>

                <div class="mt-16 grid grid-cols-1 md:grid-cols-2 gap-8">
                    <div>
                       <h3 class="font-semibold text-lg text-gray-800">Key VLM Benchmarks</h3>
                       <ul class="mt-4 space-y-2 text-gray-600 list-disc list-inside">
                           <li><strong>MMBench / MME:</strong> Multi-choice QA benchmarks testing perception and reasoning across many domains.</li>
                           <li><strong>MMMU:</strong> A challenging benchmark requiring college-level subject knowledge and deliberate reasoning.</li>
                           <li><strong>SEED-Bench:</strong> Evaluates complex spatial and temporal understanding in images and videos.</li>
                           <li><strong>MathVista:</strong> Focuses on visual mathematical reasoning, a key challenge for MLLMs.</li>
                       </ul>
                    </div>
                     <div>
                       <h3 class="font-semibold text-lg text-gray-800">Prominent Leaderboards</h3>
                       <ul class="mt-4 space-y-2 text-gray-600 list-disc list-inside">
                           <li><strong>Hugging Face Open LLM/VLM Leaderboard:</strong> Tracks performance on a suite of automated benchmarks.</li>
                           <li><strong>LMSYS Chatbot Arena:</strong> Crowdsourced human evaluation via pairwise model comparisons (Elo rating).</li>
                           <li><strong>OpenCompass:</strong> A comprehensive evaluation framework from Chinese research institutions, featuring benchmarks like CMMLU.</li>
                       </ul>
                    </div>
                </div>
            </section>
            
            <section id="analysis" class="content-section">
                <h1 class="text-4xl font-bold tracking-tight text-gray-900">Comparative Analysis of Data Practices</h1>
                <p class="mt-6 text-lg leading-8 text-gray-600">While most top-performing models use similar data pipeline stages, their specific implementation choices reveal different philosophies and priorities. This section provides a comparative look at the data practices of several major multimodal models.</p>

                <div class="mt-12">
                    <div class="overflow-x-auto bg-white rounded-lg shadow">
                        <table class="min-w-full divide-y divide-gray-200">
                            <thead class="bg-gray-50">
                                <tr>
                                    <th class="px-6 py-3 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Model</th>
                                    <th class="px-6 py-3 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Key Data Source(s)</th>
                                    <th class="px-6 py-3 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Noteworthy Data Technique</th>
                                </tr>
                            </thead>
                            <tbody class="bg-white divide-y divide-gray-200">
                                <tr>
                                    <td class="px-6 py-4 whitespace-nowrap font-medium text-gray-900">GPT-4o (OpenAI)</td>
                                    <td class="px-6 py-4 whitespace-nowrap text-gray-600">Proprietary mix of web data, licensed data (text, images, audio).</td>
                                    <td class="px-6 py-4 whitespace-nowrap text-gray-600">Massive scale, rejection sampling, and likely use of highly advanced synthetic data generation for reasoning.</td>
                                </tr>
                                <tr>
                                    <td class="px-6 py-4 whitespace-nowrap font-medium text-gray-900">Qwen-VL (Alibaba)</td>
                                    <td class="px-6 py-4 whitespace-nowrap text-gray-600">Public web data, internal e-commerce data.</td>
                                    <td class="px-6 py-4 whitespace-nowrap text-gray-600">Strong focus on high-resolution image understanding and OCR-heavy document data.</td>
                                </tr>
                                <tr>
                                    <td class="px-6 py-4 whitespace-nowrap font-medium text-gray-900">InternVL (Shanghai AI Lab)</td>
                                    <td class="px-6 py-4 whitespace-nowrap text-gray-600">Web-scale public datasets, academic datasets.</td>
                                    <td class="px-6 py-4 whitespace-nowrap text-gray-600">Progressive curriculum learning (from low to high resolution) and dynamic image resolution scaling.</td>
                                </tr>
                                <tr>
                                    <td class="px-6 py-4 whitespace-nowrap font-medium text-gray-900">DeepSeek-VL</td>
                                    <td class="px-6 py-4 whitespace-nowrap text-gray-600">Public web and code datasets.</td>
                                    <td class="px-6 py-4 whitespace-nowrap text-gray-600">Heavy emphasis on code-multimodal data, mixing code with text and images for a developer-focused model.</td>
                                </tr>
                                <tr>
                                    <td class="px-6 py-4 whitespace-nowrap font-medium text-gray-900">LLaMA 3 (Meta)</td>
                                    <td class="px-6 py-4 whitespace-nowrap text-gray-600">Public web data (15T+ tokens), carefully filtered.</td>
                                    <td class="px-6 py-4 whitespace-nowrap text-gray-600">Extensive use of smaller models as data filters and judges, aggressive deduplication, and a focus on high-quality text.</td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </div>
            </section>

             <section id="china" class="content-section">
                <h1 class="text-4xl font-bold tracking-tight text-gray-900">Focus: Data Practices in China</h1>
                <p class="mt-6 text-lg leading-8 text-gray-600">Research labs and companies in China are major contributors to the MLLM landscape. Their work often involves unique data sources, multilingual considerations, and specialized benchmarks reflecting regional context.</p>

                <div class="mt-12 grid grid-cols-1 md:grid-cols-2 gap-6">
                    <div class="card p-6">
                        <h3 class="font-semibold text-lg text-gray-800">Key Models & Labs</h3>
                        <ul class="mt-4 space-y-2 text-gray-600 list-disc list-inside">
                           <li><strong>Qwen Series (Alibaba):</strong> Known for strong vision capabilities and multilingualism.</li>
                           <li><strong>InternVL (Shanghai AI Lab):</strong> Pushing the state-of-the-art in academic, open models.</li>
                           <li><strong>GLM Series (Zhipu AI / Tsinghua):</strong> Pioneers in bilingual (Chinese-English) models.</li>
                           <li><strong>Yi-VL (01.AI), Baichuan-Omni, DeepSeek-VL:</strong> Other major players with powerful open models.</li>
                        </ul>
                    </div>
                     <div class="card p-6">
                        <h3 class="font-semibold text-lg text-gray-800">Unique Data Considerations</h3>
                        <ul class="mt-4 space-y-2 text-gray-600 list-disc list-inside">
                            <li><strong>Multilingualism:</strong> A primary focus on building models that are natively proficient in both Chinese and English, requiring careful data balancing.</li>
                            <li><strong>Domain-Specific Data:</strong> Leveraging large-scale, proprietary datasets from e-commerce (Alibaba), social media/short video (ByteDance), and other local industries.</li>
                            <li><strong>Cultural Nuance:</strong> Curation of data to reflect cultural context, essential for performance on benchmarks like CMMLU and C-Eval.</li>
                            <li><strong>Legal & Licensing:</strong> Navigating China's data governance laws, which influences data sourcing and sharing practices.</li>
                       </ul>
                    </div>
                </div>
            </section>
            
            <section id="future" class="content-section">
                <h1 class="text-4xl font-bold tracking-tight text-gray-900">Open Problems & Future Directions</h1>
                <p class="mt-6 text-lg leading-8 text-gray-600">Despite rapid progress, significant challenges remain in the data-centric development of MLLMs. Addressing these issues will define the next generation of models.</p>

                <div class="mt-12 space-y-8">
                    <div>
                       <h3 class="font-semibold text-lg text-gray-800">Key Challenges</h3>
                       <ul class="mt-4 space-y-3 text-gray-600 list-disc list-inside">
                           <li><strong>Data Scarcity in a Sea of Data:</strong> While web-scale data is abundant, high-quality, domain-specific, and culturally diverse data remains a bottleneck.</li>
                           <li><strong>The Cost of Scale:</strong> Processing petabytes of multimodal data (especially video and audio) is computationally and financially prohibitive for many.</li>
                           <li><strong>Benchmark Contamination:</strong> As models train on more of the web, the risk of inadvertently training on benchmark test sets (data leakage) becomes a major threat to reliable evaluation.</li>
                           <li><strong>Synthetic Data Bias:</strong> Over-reliance on synthetic data can lead to models that inherit and amplify the biases and factual inaccuracies of the generator model, creating a feedback loop.</li>
                           <li><strong>Copyright and Licensing:</strong> The legal landscape for web-scraped training data is uncertain and poses a significant risk to both open and proprietary model development.</li>
                       </ul>
                    </div>
                     <div>
                       <h3 class="font-semibold text-lg text-gray-800">Future Directions</h3>
                        <ul class="mt-4 space-y-3 text-gray-600 list-disc list-inside">
                           <li><strong>Automated Data Selection:</strong> Developing methods to automatically identify and prioritize the most valuable data points for training, moving beyond simple filtering to active data selection.</li>
                           <li><strong>Data Governance and Traceability:</strong> Creating frameworks to track data provenance, licensing, and quality throughout the pipeline, enabling more responsible and reliable model building.</li>
                           <li><strong>Online Data Flywheels:</strong> Building systems where models can continuously learn from new, real-world interaction data in a safe and efficient manner.</li>
                           <li><strong>Robust Evaluation:</strong> Moving beyond static benchmarks to more dynamic, interactive, and adversarial evaluation methods that better reflect real-world performance, especially for complex tasks in code and multimodality.</li>
                       </ul>
                    </div>
                </div>
            </section>

        </div>
    </main>
</div>