AutoData: A Multi-Agent System for Open Web Data Collection

Read the Paper | Documentation | Contributing

Version 0.1.0 (Beta). AutoData is currently in beta; expect rapid iteration and occasional breaking changes while we harden the multi-agent orchestration stack.

📖 Introduction

AutoData (v0.1.0 Beta) is a pioneering multi-agent system designed to revolutionize data collection from the open web. Accepted at NeurIPS 2025, AutoData automates the generation of crawlers and the extraction of data from diverse online sources, addressing the complexities of modern web environments.

Traditional web scraping often requires manual script maintenance and struggles with dynamic content. AutoData overcomes these challenges by employing a sophisticated Supervisor-Squad architecture, where specialized agents collaborate to plan, navigate, and extract data efficiently.

Key Features

🤖 Multi-Agent Architecture: Orchestrated by a Supervisor Agent managing specialized Research and Development squads.
🧠 OHCache (Oriented Message Hypergraph): A novel context management system that optimizes information flow between agents, reducing token usage and noise.
🌐 Open Web Adaptability: Capable of handling complex, dynamic websites using browser automation and intelligent observation.
🛠️ Automated Blueprinting: Synthesizes research findings into executable Python crawling code.

🏗️ Framework

The core of AutoData lies in its hierarchical agent design and the OHCache mechanism.

Agent Hierarchy

Supervisor Agent: The central coordinator that manages workflow and hand-offs.
Research Squad:
- Plan Agent: Formulates high-level strategies.
- Tool Agent: Manages tool utilization.
- Browser Agent: Navigates and observes the web.
- Blueprint Agent: Creates development blueprints from findings.
Development Squad:
- Engineer Agent: Implements the crawling logic.
- Test Agent: Validates the code against target sites.
- Validate Agent: Ensures data quality and correctness.

🚀 Getting Started

Prerequisites

Python 3.11+
uv (for dependency management)

Installation

Clone the repository:

git clone https://github.com/Tianyi-Billy-Ma/AutoData.git
cd AutoData

Install dependencies and environment:
```
uv sync
```

Install browser binaries:

playwright install
playwright install-deps

⚙️ Configuration

AutoData uses a flexible configuration system. You can set up your environment variables and YAML configs as follows.

LLM Provider Setup

Set your API keys in your environment:

# Standard Providers
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"

# OR for OpenRouter
export OPENROUTER_API_KEY="your-openrouter-key"
export OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"

Prefer saving keys once in a .env file (cp .env.example .env) instead of exporting them in every terminal. AutoData automatically loads .env, .env.local, and ~/.autodata/.env before parsing configs, so uv run python -m autodata.main ... always finds your credentials.

Configure your model in configs/default.yaml:

llm_config:
  model: "gpt-4o"
  temperature: 0.0

🏃 Usage

To run a sample task using the default configuration:

uv run python -m autodata.main --config configs/default.yaml

Inspecting Outputs

Results are saved in the outputs/ directory:

ls outputs/default_run/
# ├── summary.json  (Metadata & Dataset reference)
# └── artifacts/...

🧭 Roadmap

Plugin ecosystem hardening: finalize the third-party plugin spec (typed hooks, richer tool manifests) so integrators can ship domain packs without forking.
Automated evaluation harness: release reproducible browser traces + artifact expectations to benchmark agent reasoning quality between versions.
Hosted orchestrator mode: add multi-run scheduling, workspace quotas, and audit logs for teams that want to operate AutoData as a shared service.

🤝 Contributing

We welcome contributions! Please see our CONTRIBUTING.md for guidelines on how to get involved.

Development Tools

Ensure your code meets our standards before committing:

uv run ruff format
uv run ruff check .
uv run pytest

🖊️ Citation

If you use AutoData in your research, please cite our NeurIPS 2025 paper:

@inproceedings{autodata2025,
  title={AutoData: A Multi-Agent System for Open Web Data Collection},
  author={Ma, Tianyi and Qian, Yiyue and Zhang, Zheyuan and Wang, Zehong and Qian, Xiaoye and Bai, Feifan and Ding, Yifan and Luo, Xuwei and Zhang, Shinan and Murugesan, Keerthiram and others},
  booktitle={NeurIPS},
  year={2025},
}

_{Built with ❤️ by the AutoData Team.}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
autodata		autodata
configs		configs
resources		resources
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject copy.toml		pyproject copy.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AutoData: A Multi-Agent System for Open Web Data Collection

📖 Introduction

Key Features

🏗️ Framework

Agent Hierarchy

🚀 Getting Started

Prerequisites

Installation

⚙️ Configuration

LLM Provider Setup

🏃 Usage

Inspecting Outputs

🧭 Roadmap

🤝 Contributing

Development Tools

🖊️ Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Tianyi-Billy-Ma/AutoData

Folders and files

Latest commit

History

Repository files navigation

AutoData: A Multi-Agent System for Open Web Data Collection

📖 Introduction

Key Features

🏗️ Framework

Agent Hierarchy

🚀 Getting Started

Prerequisites

Installation

⚙️ Configuration

LLM Provider Setup

🏃 Usage

Inspecting Outputs

🧭 Roadmap

🤝 Contributing

Development Tools

🖊️ Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages