Version 0.1.0 (Beta). AutoData is currently in beta; expect rapid iteration and occasional breaking changes while we harden the multi-agent orchestration stack.
AutoData (v0.1.0 Beta) is a pioneering multi-agent system designed to revolutionize data collection from the open web. Accepted at NeurIPS 2025, AutoData automates the generation of crawlers and the extraction of data from diverse online sources, addressing the complexities of modern web environments.
Traditional web scraping often requires manual script maintenance and struggles with dynamic content. AutoData overcomes these challenges by employing a sophisticated Supervisor-Squad architecture, where specialized agents collaborate to plan, navigate, and extract data efficiently.
- 🤖 Multi-Agent Architecture: Orchestrated by a Supervisor Agent managing specialized Research and Development squads.
- 🧠 OHCache (Oriented Message Hypergraph): A novel context management system that optimizes information flow between agents, reducing token usage and noise.
- 🌐 Open Web Adaptability: Capable of handling complex, dynamic websites using browser automation and intelligent observation.
- 🛠️ Automated Blueprinting: Synthesizes research findings into executable Python crawling code.
The core of AutoData lies in its hierarchical agent design and the OHCache mechanism.
- Supervisor Agent: The central coordinator that manages workflow and hand-offs.
- Research Squad:
- Plan Agent: Formulates high-level strategies.
- Tool Agent: Manages tool utilization.
- Browser Agent: Navigates and observes the web.
- Blueprint Agent: Creates development blueprints from findings.
- Development Squad:
- Engineer Agent: Implements the crawling logic.
- Test Agent: Validates the code against target sites.
- Validate Agent: Ensures data quality and correctness.
- Python 3.11+
- uv (for dependency management)
-
Clone the repository:
git clone https://github.com/Tianyi-Billy-Ma/AutoData.git cd AutoData -
Install dependencies and environment:
uv sync
-
Install browser binaries:
playwright install playwright install-deps
AutoData uses a flexible configuration system. You can set up your environment variables and YAML configs as follows.
Set your API keys in your environment:
# Standard Providers
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"
# OR for OpenRouter
export OPENROUTER_API_KEY="your-openrouter-key"
export OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"Prefer saving keys once in a
.envfile (cp .env.example .env) instead of exporting them in every terminal. AutoData automatically loads.env,.env.local, and~/.autodata/.envbefore parsing configs, souv run python -m autodata.main ...always finds your credentials.
Configure your model in configs/default.yaml:
llm_config:
model: "gpt-4o"
temperature: 0.0To run a sample task using the default configuration:
uv run python -m autodata.main --config configs/default.yamlResults are saved in the outputs/ directory:
ls outputs/default_run/
# ├── summary.json (Metadata & Dataset reference)
# └── artifacts/...- Plugin ecosystem hardening: finalize the third-party plugin spec (typed hooks, richer tool manifests) so integrators can ship domain packs without forking.
- Automated evaluation harness: release reproducible browser traces + artifact expectations to benchmark agent reasoning quality between versions.
- Hosted orchestrator mode: add multi-run scheduling, workspace quotas, and audit logs for teams that want to operate AutoData as a shared service.
We welcome contributions! Please see our CONTRIBUTING.md for guidelines on how to get involved.
Ensure your code meets our standards before committing:
uv run ruff format
uv run ruff check .
uv run pytestIf you use AutoData in your research, please cite our NeurIPS 2025 paper:
@inproceedings{autodata2025,
title={AutoData: A Multi-Agent System for Open Web Data Collection},
author={Ma, Tianyi and Qian, Yiyue and Zhang, Zheyuan and Wang, Zehong and Qian, Xiaoye and Bai, Feifan and Ding, Yifan and Luo, Xuwei and Zhang, Shinan and Murugesan, Keerthiram and others},
booktitle={NeurIPS},
year={2025},
}