daVinci-Agency is a novel long-horizon data synthesis paradigm grounded in the iterative evolutionary process of real-world software development. Unlike existing methods constrained by teacher bounds or prohibitive annotation costs, daVinci-Agency mines structured supervision signals intrinsic to chain of PRs to construct interaction trajectories that explicitly embody key meta-skills: task decomposition via continuous submissions, long-term consistency through unified objectives, and iterative refinement based on authentic bug-fix records.
This repository contains the official implementation of the daVinci-Agency pipeline, capable of synthesizing trajectories that average 85k tokens and 116 tool invocations, enabling models to internalize complex agentic workflows and unlock their intrinsic long-horizon potential.
-
Modeling Project-Level Evolution: Beyond simple instruction following, daVinci-Agency explicitly models the software evolution process. By leveraging the natural structure of chain of PRs, it systematically captures high-level supervision for task decomposition, long-term functional consistency, and authentic iterative refinement. This paradigm enables agents to move from solving isolated tasks to sustaining the full-cycle development of complex, evolving projects.
-
Extreme Scale & Complexity: The pipeline synthesizes substantial interaction trajectories—averaging 85k tokens and 116 tool invocations—designed to stress-test and enhance the context management and planning capabilities of modern LLMs.
-
Proven Effectiveness: Models trained on daVinci-Agency demonstrate superior generalization. Fine-tuning GLM-4.6 on merely 239 daVinci-Agency samples yields broad improvements across general benchmarks, notably achieving a 47% relative gain on Toolathlon.
-
[2026-02-03] 🚀 Paper Released: We are excited to introduce daVinci-Agency, a new paradigm for long-horizon agent training. The paper details our findings on training/inference scaling laws specific to long-horizon tasks.
-
[2026-02-03] 📈 SOTA Results: Our experiments show that daVinci-Agency models significantly outperform baselines, verifying that interaction trajectories embodying cross-stage evolution are decisive for long-horizon agents.
| Training Data | Samples | SWE-bench (SWE-agent) |
Toolathlon |
|
Overall Avg. |
|---|---|---|---|---|---|
| GLM-4.6 | - | 0.608 | 0.157 | 0.675 | 0.441 |
| SWE-Smith | 66000 | 0.404 | 0.093 | 0.586 | 0.373 |
| CC-bench | 260 | 0.618 | 0.000 | 0.697 | 0.436 |
| daVinci-Agency | 239 | 0.632 | 0.231 | 0.707 | 0.475 |
Baseline Models Comparison
| Model | SWE-bench(SWE-agent) | Toolathlon | AgencyBench | Overall Avg. |
|---|---|---|---|---|
| DeepSeek-v3.2 | 0.456 | 0.250 | 11.6 | 0.366 |
| Qwen3-235B | 0.504 | 0.046 | 4.6 | 0.309 |
| Kimi-K2-Thinking | 0.318 | 0.213 | 11.8 | 0.404 |
| GLM-4.6 | 0.608 | 0.157 | 11.9 | 0.441 |
| GLM-4.6-daVinci-Agency | 0.632 | 0.231 | 15.9 | 0.475 |
We employ SWE-agent as the evaluation scaffold for SWE-bench, and utilize GLM-4.6 as the user agent for daVinci-Agency consistently outperforms the aforementioned datasets across all benchmarks. By introducing cross-stage evolution through the modeling of real-world development tasks, daVinci-Agency uniquely achieves significant gains on Toolathlon while maintaining robustness on SWE-bench.
uv venv
source .venv/bin/activate
uv syncAfter syncing, prefer uv run ... for commands (e.g., uv run pytest).
- Populate
config/pipeline.yamlwith your GitHub token (or exportGITHUB_TOKEN/GH_TOKEN). Use--offlineto skip API calls. - Point
pr_combiner.input_pathto your enhanced PR dataset (jsonl). - Run the combiner:
uv run python -m src.cli.run_combiner \
--config config/pipeline.yaml \
--input data/raw_data/results/xxx.jsonl \
--output data/processed/pr_combiner/xxxKey flags:
--limit N: Process only the first N PRs (omit to process all).--resume: Continue from checkpoints stored indata/processed/pr_combiner/.../.checkpoints.--linking-mode content-linking: Force a specific strategy
Outputs land in data/processed/pr_combiner/ (chains, metadata, failed records). A review queue is created when content-linking is enabled.
Generate rollout queries from the synthesized chains:
uv run python -m src.cli.query_constructor_cli \
--chains data/processed/pr_combiner/enhanced_data_XXXX_chains.jsonl \
--staged-output data/synthetic/queries/staged \
--query-output data/synthetic/queries \
--template rollout/default \
--concurrency 4Key flags:
--chains: PR chains JSONL from the combiner.--staged-output: Directory for staged query artifacts grouped by repo.--query-output: Directory for per-chain query artifacts (used by rollout executor).--max-chains N: Optional cap on processed chains;--concurrencyspeeds up generation.
Automated rollout validation uses the sii-agent-sdk bridge:
- Install Node.js ≥ 20.
npm install --global sii-agent-bridgeexport SII_BRIDGE_PATH=$(which sii-agent-bridge)
Configure agent profiles under sii_agent in config/pipeline.yaml. When the bridge is missing, the executor falls back to simulator mode for safe local runs.
Run staged rollouts over a queries file (sii scaffold, default):
uv run python -m src.cli.rollout_executor_cli \
--queries data/synthetic/queries/sample.jsonl \
--config config/pipeline.yaml \
--staged-session \
--score-threshold 0.8Run the lite/mini scaffold (no bridge, supports local concurrency):
uv run python -m src.cli.rollout_executor_cli \
--queries data/synthetic/queries/sample.jsonl \
--scaffold mini \
--concurrency 4 \
--staged-session \
--score-threshold 0.8Outputs (transcripts, registry) are written under data/synthetic/rollouts/rollout_<repo>_<timestamp>/. Use --run-dir to override the destination and --rollouts-path to point at an existing registry when resuming.
- Dataset missing: Ensure
data/raw_data/<file>.jsonlexists; re-sync artifacts if absent. - Permission errors: Confirm
data/processed/pr_combiner/is writable; clean stale outputs if needed.
This project is licensed under the MIT License - see LICENSE for details.
If you find daVinci-Agency useful, please cite our work:
@article{jiang2026davinci,
title={daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently},
author={Mohan Jiang and Dayuan Fu and Junhao Shi and Ji Zeng and Weiye Si and Keyu Li and Xuefeng Li and Yang Xiao and Wenjie Li and Dequan Wang and Pengfei Liu},
journal={arXiv preprint arXiv:2602.02619},
year={2026}
}


