daVinci-LLM: Towards the Science of Pretraining

daVinci-LLM is an open pretraining research project. We train models from scratch and release everything: data, training process, ablation results, and failed experiments, so you can build on our findings, not repeat our mistakes.

Current release: daVinci-3B matches OLMo3-7B, demonstrating that systematic, evidence-based methodology can unlock greater capability from smaller models.

🚀 Ongoing project: We’re continuously exploring new frontiers and will release models, data, and insights as they mature.

🎁 What We Release

Resource	Description
🤖 Model	daVinci-LLM-3B final checkpoint + all intermediate checkpoints
📊 Training Data	7.5T+ tokens of fully traceable, high-quality pretraining corpus
📄 Technical Report	Complete exploration process: data decisions, training dynamics, systematic ablations, and failed experiments
🔧 Pretraining Pipeline (Coming soon)	Integrated pipeline for data processing, training, and evaluation

🏛️ Three Pillars of Full Openness

daVinci-LLM is structured around three pillars, each contributing to transparency and reproducibility:

1. 📊 Data Transparency — The Data Darwinism Framework

We adopt the Data Darwinism framework to systematically organize data processing from L0 (raw acquisition) to L9 (full synthesis). Our 7.5T+ token corpus combines publicly available datasets with our own processed and openly released data—every source is annotated with its Darwin Level, making processing decisions transparent and enabling researchers to assess quality depth and reuse our data assets.

Level	Operation	What It Does
L0	Data Acquisition	Collect raw data from diverse sources
L1	Format Normalization	Convert heterogeneous formats into unified text
L2	Rule-Based Filtering	Remove duplicates, malformed text, non-target languages
L3	Model-Based Filtering	Assess educational value and domain relevance via classifiers
L4	Generative Refinement	Remove structural noise and repair content while preserving semantics
L5	Cognitive Completion	Make implicit reasoning explicit (e.g., expand compressed logical steps)
L6–L9	Higher-Order Synthesis	Contextual/environment/ecosystem synthesis (theoretical frontier)

📖 For the complete Data Darwinism framework: See Data Darwinism

2. 🎓 Training Transparency — Adaptive Two-Stage Curriculum

daVinci-LLM uses a dynamically monitored, adaptively adjusted two-stage curriculum:

Stage 1 (6T tokens): Builds broad foundations. Continuous evaluation reveals that general knowledge saturates early (~1T tokens) while code and science reasoning sustain growth beyond 4T—prompting progressive reallocation toward reasoning-intensive domains.
Stage 2 (2T tokens): Introduces structured QA data in a progressive curriculum. Stage 2-1 balances across domains to establish stability; Stage 2-2 intensifies QA concentration for targeted reasoning amplification—yielding a +12.14 gain.

3. 🧪 Scientific Transparency — 200+ Controlled Ablations

We transformed key pretraining decisions into systematically verifiable research questions. Through 200+ controlled experiments, we investigated:

📌 Does deeper data processing actually improve capabilities?

L3 filtering: Modest gains on basic tasks (+3.4 on MBPP)
L4 refinement: Substantial gains on complex reasoning (+7.0 on MATH)
L5 synthesis: Strong domain alignment but limited transfer
Insight: Processing depth is a complementary dimension to data volume scaling

📌 How should training adapt as capabilities mature differently?

General knowledge plateaus at ~1T tokens; reasoning grows past 4T
Domain rebalancing works initially, but hits limits
Format shift (introducing QA) unlocks further growth
Insight: No single mixture suffices—monitor and adapt

📌 Can we intensify reasoning without catastrophic forgetting?

Extreme specialization triggers collapse
Progressive strategy: balanced foundation (equal parts QA/code/science) → targeted intensification (70% QA)
Insight: Balance first, then intensify

📌 Are our evaluation metrics reliable?

PPL vs. generative evaluation can produce ranking reversals
High-QA models show protocol-specific artifacts
Insight: Report multiple protocols for complete capability profiles

💡 Full ablation details, configurations, and negative results: See Section 4 of our technical report

📊 Key Results: 3B Matches 7B

Our daVinci-LLM-3B achieves an overall score of 51.72, matching OLMo-3 7B despite having less than half the parameters. Notably, it substantially outperforms on complex reasoning tasks like MATH (62.80 vs. OLMo-3’s 39.60), demonstrating the value of systematic, evidence-based pretraining.

Capability Dimension	daVinci-3B	OLMo-3 7B	LLaMA-3.2-3B	Qwen-2.5-3B
Overall Perfomance	51.72	51.65	37.58	51.44
General Knowledge	52.96	55.13	51.08	55.16
Code Generation	55.99	54.42	32.40	56.13
Scientific Reasoning	48.30	45.98	22.45	44.65
MATH	62.80	39.60	9.00	37.20

📚 Citation

If you find this work helpful, please consider citing:

@misc{qin2026davincillmtowardssciencepretraining,
      title={daVinci-LLM:Towards the Science of Pretraining},
      author={Yiwei Qin and Yixiu Liu and Tiantian Mi and Muhang Xie and Zhen Huang and Weiye Si and Pengrui Lu and Siyuan Feng and Xia Wu and Liming Liu and Ye Luo and Jinlong Hou and Qipeng Guo and Yu Qiao and Pengfei Liu},
      year={2026},
      eprint={2603.27164},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.27164},
}

If you use the Data Darwinism framework, please also cite:

@misc{qin2026datadarwinismiunlocking,
      title={Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training}, 
      author={Yiwei Qin and Zhen Huang and Tiantian Mi and Weiye Si and Chenyang Zhou and Qipeng Guo and Siyuan Feng and Pengfei Liu},
      year={2026},
      eprint={2602.07824},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.07824}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
fig		fig
LICENSE		LICENSE
README.md		README.md
paper.pdf		paper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

daVinci-LLM: Towards the Science of Pretraining

🎁 What We Release

🏛️ Three Pillars of Full Openness

1. 📊 Data Transparency — The Data Darwinism Framework

2. 🎓 Training Transparency — Adaptive Two-Stage Curriculum

3. 🧪 Scientific Transparency — 200+ Controlled Ablations

📊 Key Results: 3B Matches 7B

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

daVinci-LLM: Towards the Science of Pretraining

🎁 What We Release

🏛️ Three Pillars of Full Openness

1. 📊 Data Transparency — The Data Darwinism Framework

2. 🎓 Training Transparency — Adaptive Two-Stage Curriculum

3. 🧪 Scientific Transparency — 200+ Controlled Ablations

📊 Key Results: 3B Matches 7B

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages