daVinci-Agency: Unlocking Long-Horizon
Agency Data-Efficiently

daVinci-Agency: Unlocking Long-Horizon
Agency Data-Efficiently

daVinci-Agency is a novel long-horizon data synthesis paradigm grounded in the iterative evolutionary process of real-world software development. Unlike existing methods constrained by teacher bounds or prohibitive annotation costs, daVinci-Agency mines structured supervision signals intrinsic to chain of PRs to construct interaction trajectories that explicitly embody key meta-skills: task decomposition via continuous submissions, long-term consistency through unified objectives, and iterative refinement based on authentic bug-fix records.

This repository contains the official implementation of the daVinci-Agency pipeline, capable of synthesizing trajectories that average 85k tokens and 116 tool invocations, enabling models to internalize complex agentic workflows and unlock their intrinsic long-horizon potential.

Highlights

Modeling Project-Level Evolution: Beyond simple instruction following, daVinci-Agency explicitly models the software evolution process. By leveraging the natural structure of chain of PRs, it systematically captures high-level supervision for task decomposition, long-term functional consistency, and authentic iterative refinement. This paradigm enables agents to move from solving isolated tasks to sustaining the full-cycle development of complex, evolving projects.
Extreme Scale & Complexity: The pipeline synthesizes substantial interaction trajectories—averaging 85k tokens and 116 tool invocations—designed to stress-test and enhance the context management and planning capabilities of modern LLMs.
Proven Effectiveness: Models trained on daVinci-Agency demonstrate superior generalization. Fine-tuning GLM-4.6 on merely 239 daVinci-Agency samples yields broad improvements across general benchmarks, notably achieving a 47% relative gain on Toolathlon.

News

[2026-02-03] 🚀 Paper Released: We are excited to introduce daVinci-Agency, a new paradigm for long-horizon agent training. The paper details our findings on training/inference scaling laws specific to long-horizon tasks.
[2026-02-03] 📈 SOTA Results: Our experiments show that daVinci-Agency models significantly outperform baselines, verifying that interaction trajectories embodying cross-stage evolution are decisive for long-horizon agents.

Performance

Training Data Comparison

Training Data	Samples	SWE-bench (SWE-agent)	Toolathlon	$\tau^2$-bench	Overall Avg.
GLM-4.6	-	0.608	0.157	0.675	0.441
SWE-Smith	66000	0.404	0.093	0.586	0.373
CC-bench	260	0.618	0.000	0.697	0.436
daVinci-Agency	239	0.632	0.231	0.707	0.475

Baseline Models Comparison

Model	SWE-bench(SWE-agent)	Toolathlon	AgencyBench	Overall Avg.
DeepSeek-v3.2	0.456	0.250	11.6	0.366
Qwen3-235B	0.504	0.046	4.6	0.309
Kimi-K2-Thinking	0.318	0.213	11.8	0.404
GLM-4.6	0.608	0.157	11.9	0.441
GLM-4.6-daVinci-Agency	0.632	0.231	15.9	0.475

We employ SWE-agent as the evaluation scaffold for SWE-bench, and utilize GLM-4.6 as the user agent for $\tau^2$-bench. The results demonstrate that despite its smaller sample size, daVinci-Agency consistently outperforms the aforementioned datasets across all benchmarks. By introducing cross-stage evolution through the modeling of real-world development tasks, daVinci-Agency uniquely achieves significant gains on Toolathlon while maintaining robustness on SWE-bench.

Quick Start

Environment Setup (uv)

uv venv
source .venv/bin/activate
uv sync

After syncing, prefer uv run ... for commands (e.g., uv run pytest).

Generate Chain of PRs

Populate config/pipeline.yaml with your GitHub token (or export GITHUB_TOKEN / GH_TOKEN). Use --offline to skip API calls.
Point pr_combiner.input_path to your enhanced PR dataset (jsonl).
Run the combiner:

uv run python -m src.cli.run_combiner \
  --config config/pipeline.yaml \
  --input data/raw_data/results/xxx.jsonl \
  --output data/processed/pr_combiner/xxx

Key flags:

--limit N: Process only the first N PRs (omit to process all).
--resume: Continue from checkpoints stored in data/processed/pr_combiner/.../.checkpoints.
--linking-mode content-linking: Force a specific strategy

Outputs land in data/processed/pr_combiner/ (chains, metadata, failed records). A review queue is created when content-linking is enabled.

Query Construction

Generate rollout queries from the synthesized chains:

uv run python -m src.cli.query_constructor_cli \
  --chains data/processed/pr_combiner/enhanced_data_XXXX_chains.jsonl \
  --staged-output data/synthetic/queries/staged \
  --query-output data/synthetic/queries \
  --template rollout/default \
  --concurrency 4

Key flags:

--chains: PR chains JSONL from the combiner.
--staged-output: Directory for staged query artifacts grouped by repo.
--query-output: Directory for per-chain query artifacts (used by rollout executor).
--max-chains N: Optional cap on processed chains; --concurrency speeds up generation.

Rollout Execution

Automated rollout validation uses the sii-agent-sdk bridge:

Install Node.js ≥ 20.
npm install --global sii-agent-bridge
export SII_BRIDGE_PATH=$(which sii-agent-bridge)

Configure agent profiles under sii_agent in config/pipeline.yaml. When the bridge is missing, the executor falls back to simulator mode for safe local runs.

Run staged rollouts over a queries file (sii scaffold, default):

uv run python -m src.cli.rollout_executor_cli \
  --queries data/synthetic/queries/sample.jsonl \
  --config config/pipeline.yaml \
  --staged-session \
  --score-threshold 0.8

Run the lite/mini scaffold (no bridge, supports local concurrency):

uv run python -m src.cli.rollout_executor_cli \
  --queries data/synthetic/queries/sample.jsonl \
  --scaffold mini \
  --concurrency 4 \
  --staged-session \
  --score-threshold 0.8

Outputs (transcripts, registry) are written under data/synthetic/rollouts/rollout_<repo>_<timestamp>/. Use --run-dir to override the destination and --rollouts-path to point at an existing registry when resuming.

Troubleshooting

Dataset missing: Ensure data/raw_data/<file>.jsonl exists; re-sync artifacts if absent.
Permission errors: Confirm data/processed/pr_combiner/ is writable; clean stale outputs if needed.

License

This project is licensed under the MIT License - see LICENSE for details.

Citation

If you find daVinci-Agency useful, please cite our work:

@article{jiang2026davinci,
  title={daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently},
  author={Mohan Jiang and Dayuan Fu and Junhao Shi and Ji Zeng and Weiye Si and Keyu Li and Xuefeng Li and Yang Xiao and Wenjie Li and Dequan Wang and Pengfei Liu},
  journal={arXiv preprint arXiv:2602.02619},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
asset		asset
config		config
data		data
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

daVinci-Agency: Unlocking Long-Horizon
Agency Data-Efficiently

Highlights

News

Performance

Quick Start

Environment Setup (uv)

Generate Chain of PRs

Query Construction

Rollout Execution

Troubleshooting

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

Highlights

News

Performance

Quick Start

Environment Setup (uv)

Generate Chain of PRs

Query Construction

Rollout Execution

Troubleshooting

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

daVinci-Agency: Unlocking Long-Horizon
Agency Data-Efficiently

Packages