Skip to content

GAIR-NLP/daVinci-Agency

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SII ASI

daVinci-Agency: Unlocking Long-Horizon
Agency Data-Efficiently

arXiv Paper   |   Dataset Dataset   |   Model Models

daVinci-Agency is a novel long-horizon data synthesis paradigm grounded in the iterative evolutionary process of real-world software development. Unlike existing methods constrained by teacher bounds or prohibitive annotation costs, daVinci-Agency mines structured supervision signals intrinsic to chain of PRs to construct interaction trajectories that explicitly embody key meta-skills: task decomposition via continuous submissions, long-term consistency through unified objectives, and iterative refinement based on authentic bug-fix records.

This repository contains the official implementation of the daVinci-Agency pipeline, capable of synthesizing trajectories that average 85k tokens and 116 tool invocations, enabling models to internalize complex agentic workflows and unlock their intrinsic long-horizon potential.

Highlights

  • Modeling Project-Level Evolution: Beyond simple instruction following, daVinci-Agency explicitly models the software evolution process. By leveraging the natural structure of chain of PRs, it systematically captures high-level supervision for task decomposition, long-term functional consistency, and authentic iterative refinement. This paradigm enables agents to move from solving isolated tasks to sustaining the full-cycle development of complex, evolving projects.

  • Extreme Scale & Complexity: The pipeline synthesizes substantial interaction trajectories—averaging 85k tokens and 116 tool invocations—designed to stress-test and enhance the context management and planning capabilities of modern LLMs.

  • Proven Effectiveness: Models trained on daVinci-Agency demonstrate superior generalization. Fine-tuning GLM-4.6 on merely 239 daVinci-Agency samples yields broad improvements across general benchmarks, notably achieving a 47% relative gain on Toolathlon.

News

  • [2026-02-03] 🚀 Paper Released: We are excited to introduce daVinci-Agency, a new paradigm for long-horizon agent training. The paper details our findings on training/inference scaling laws specific to long-horizon tasks.

  • [2026-02-03] 📈 SOTA Results: Our experiments show that daVinci-Agency models significantly outperform baselines, verifying that interaction trajectories embodying cross-stage evolution are decisive for long-horizon agents.

Performance

Training Data Comparison
Training Data Samples SWE-bench
(SWE-agent)
Toolathlon $\tau^2$-bench Overall
Avg.
GLM-4.6 - 0.608 0.157 0.675 0.441
SWE-Smith 66000 0.404 0.093 0.586 0.373
CC-bench 260 0.618 0.000 0.697 0.436
daVinci-Agency 239 0.632 0.231 0.707 0.475

Baseline Models Comparison

Model SWE-bench(SWE-agent) Toolathlon AgencyBench Overall
Avg.
DeepSeek-v3.2 0.456 0.250 11.6 0.366
Qwen3-235B 0.504 0.046 4.6 0.309
Kimi-K2-Thinking 0.318 0.213 11.8 0.404
GLM-4.6 0.608 0.157 11.9 0.441
GLM-4.6-daVinci-Agency 0.632 0.231 15.9 0.475

We employ SWE-agent as the evaluation scaffold for SWE-bench, and utilize GLM-4.6 as the user agent for $\tau^2$-bench. The results demonstrate that despite its smaller sample size, daVinci-Agency consistently outperforms the aforementioned datasets across all benchmarks. By introducing cross-stage evolution through the modeling of real-world development tasks, daVinci-Agency uniquely achieves significant gains on Toolathlon while maintaining robustness on SWE-bench.

Quick Start

Environment Setup (uv)

uv venv
source .venv/bin/activate
uv sync

After syncing, prefer uv run ... for commands (e.g., uv run pytest).

Generate Chain of PRs

  1. Populate config/pipeline.yaml with your GitHub token (or export GITHUB_TOKEN / GH_TOKEN). Use --offline to skip API calls.
  2. Point pr_combiner.input_path to your enhanced PR dataset (jsonl).
  3. Run the combiner:
uv run python -m src.cli.run_combiner \
  --config config/pipeline.yaml \
  --input data/raw_data/results/xxx.jsonl \
  --output data/processed/pr_combiner/xxx

Key flags:

  • --limit N: Process only the first N PRs (omit to process all).
  • --resume: Continue from checkpoints stored in data/processed/pr_combiner/.../.checkpoints.
  • --linking-mode content-linking: Force a specific strategy

Outputs land in data/processed/pr_combiner/ (chains, metadata, failed records). A review queue is created when content-linking is enabled.

Query Construction

Generate rollout queries from the synthesized chains:

uv run python -m src.cli.query_constructor_cli \
  --chains data/processed/pr_combiner/enhanced_data_XXXX_chains.jsonl \
  --staged-output data/synthetic/queries/staged \
  --query-output data/synthetic/queries \
  --template rollout/default \
  --concurrency 4

Key flags:

  • --chains: PR chains JSONL from the combiner.
  • --staged-output: Directory for staged query artifacts grouped by repo.
  • --query-output: Directory for per-chain query artifacts (used by rollout executor).
  • --max-chains N: Optional cap on processed chains; --concurrency speeds up generation.

Rollout Execution

Automated rollout validation uses the sii-agent-sdk bridge:

  1. Install Node.js ≥ 20.
  2. npm install --global sii-agent-bridge
  3. export SII_BRIDGE_PATH=$(which sii-agent-bridge)

Configure agent profiles under sii_agent in config/pipeline.yaml. When the bridge is missing, the executor falls back to simulator mode for safe local runs.

Run staged rollouts over a queries file (sii scaffold, default):

uv run python -m src.cli.rollout_executor_cli \
  --queries data/synthetic/queries/sample.jsonl \
  --config config/pipeline.yaml \
  --staged-session \
  --score-threshold 0.8

Run the lite/mini scaffold (no bridge, supports local concurrency):

uv run python -m src.cli.rollout_executor_cli \
  --queries data/synthetic/queries/sample.jsonl \
  --scaffold mini \
  --concurrency 4 \
  --staged-session \
  --score-threshold 0.8

Outputs (transcripts, registry) are written under data/synthetic/rollouts/rollout_<repo>_<timestamp>/. Use --run-dir to override the destination and --rollouts-path to point at an existing registry when resuming.

Troubleshooting

  • Dataset missing: Ensure data/raw_data/<file>.jsonl exists; re-sync artifacts if absent.
  • Permission errors: Confirm data/processed/pr_combiner/ is writable; clean stale outputs if needed.

License

This project is licensed under the MIT License - see LICENSE for details.

Citation

If you find daVinci-Agency useful, please cite our work:

@article{jiang2026davinci,
  title={daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently},
  author={Mohan Jiang and Dayuan Fu and Junhao Shi and Ji Zeng and Weiye Si and Keyu Li and Xuefeng Li and Yang Xiao and Wenjie Li and Dequan Wang and Pengfei Liu},
  journal={arXiv preprint arXiv:2602.02619},
  year={2026}
}

About

daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages