Skip to content

yhzhu99/MedAgentBoard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

118 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏥 MedAgentBoard

🎉 Our paper has been accepted to the NeurIPS 2025 Datasets & Benchmarks Track! 🎉

arXiv Project Website

📄 Read the Paper → Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Authors: Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, Lequan Yu

We have attached and open-sourced all logs and dataset and experimental results via Google Drive link. For Task 4, please refer to the MedAgentBoard-WorkflowAutomation repository. Please note that the MIMIC-related data/results are not included because access requires PhysioNet authorization.

Overview

MedAgentBoard is a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional (non-LLM) approaches across diverse medical tasks. The rapid advancement of Large Language Models (LLMs) has spurred interest in multi-agent collaboration for complex medical challenges. However, the practical advantages of these multi-agent systems are not yet well understood. Existing evaluations often lack generalizability to diverse real-world clinical tasks and frequently omit rigorous comparisons against both advanced single-LLM baselines and established conventional methods.

MedAgentBoard addresses this critical gap by introducing a benchmark suite covering four distinct medical task categories, utilizing varied data modalities including text, medical images, and structured Electronic Health Records (EHRs):

  1. Medical (Visual) Question Answering: Evaluating systems on answering questions from medical texts and/or medical images.
  2. Lay Summary Generation: Assessing the ability to convert complex medical texts into easily understandable summaries for patients.
  3. Structured EHR Predictive Modeling: Benchmarking predictions of clinical outcomes (e.g., mortality, readmission) using structured patient data.
  4. Clinical Workflow Automation: Evaluating the automation of multi-step clinical data analysis workflows, from data extraction to reporting.

Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios (e.g., enhancing task completeness in clinical workflow automation), it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods, which generally maintain superior performance in tasks like medical VQA and EHR-based prediction.

MedAgentBoard serves as a vital resource, offering actionable insights for researchers and practitioners. It underscores the necessity of a task-specific, evidence-based approach when selecting and developing AI solutions in medicine, highlighting that the inherent complexity and overhead of multi-agent systems must be carefully weighed against tangible performance gains.

All code, datasets, detailed prompts, and experimental results are open-sourced! If you have any questions about this paper, please feel free to contact Yinghao Zhu, yhzhu99@gmail.com.

Key Features & Contributions

  • Comprehensive Benchmark: Provides a platform for rigorous evaluation and extensive comparative analysis of multi-agent collaboration, single LLMs, and conventional methods across diverse medical tasks and data modalities.
  • Addresses Critical Gaps: Directly tackles limitations in current research concerning generalizability and the completeness of baselines by synthesizing prior work with LLM-era evaluations.
  • Clarity on Multi-Agent Efficacy: Offers a unified framework for adjudicating the often conflicting claims about the true advantages of multi-agent approaches in the rapidly evolving field of medical AI.
  • Actionable Insights: Distills experimental findings into practical guidance for researchers and practitioners to make informed decisions about selecting, developing, and deploying AI solutions in various medical settings.

Related Multi-Agent Frameworks and Baselines

The MedAgentBoard benchmark evaluates various approaches, including adaptations or implementations based on principles from the following (and other) influential multi-agent frameworks and related research. The project structure reflects implementations for some of these:

Associated Repositories

Project Structure

medagentboard/
├── ehr/                     # EHR-related multi-agent implementations
│   ├── multi_agent_colacare.py
│   ├── multi_agent_medagent.py
│   ├── multi_agent_reconcile.py
│   ├── preprocess_dataset.py
│   └── run.sh
├── laysummary/              # Lay summary generation components
│   ├── evaluation.py
│   ├── multi_agent_agentsimp.py
│   ├── preprocess_datasets.py
│   ├── run.sh
│   └── single_llm.py
├── medqa/                   # Medical QA system implementations
│   ├── evaluate.py
│   ├── multi_agent_colacare.py
│   ├── multi_agent_mdagents.py
│   ├── multi_agent_medagent.py
│   ├── multi_agent_reconcile.py
│   ├── preprocess_datasets.py
│   ├── run.sh
│   └── single_llm.py
└── utils/                   # Shared utility functions
    ├── encode_image.py
    ├── json_utils.py
    ├── llm_configs.py
    └── llm_scoring.py

Getting Started

Prerequisites

  1. Python 3.10 or higher
  2. uv package manager

Installation

# Install dependencies from uv.lock
uv sync

Environment Setup

Please setup the .env file with your API keys:

DEEPSEEK_API_KEY=sk-xxx
DASHSCOPE_API_KEY=sk-xxx
ARK_API_KEY=sk-xxx
# Add other API keys as needed (e.g., for GPT-4, Gemini, etc.)

Usage

Running Medical QA

# Run all MedQA tasks (example from paper, may need specific setup)
bash medagentboard/medqa/run.sh

# Run specific MedQA task
python -m medagentboard.medqa.multi_agent_colacare --dataset PubMedQA --qa_type mc
# Refer to medqa/run.sh and run_colacare_diverse_llms.sh for more examples

Note: Clinical Workflow Automation tasks involve more complex setups; please refer to the paper and codebase for detailed instructions on reproducing those experiments.

Running Lay Summary Generation

python -m medagentboard.laysummary.multi_agent_agentsimp
# Refer to laysummary/run.sh for more examples

Running EHR Components

python -m medagentboard.ehr.multi_agent_colacare
# Refer to ehr/run.sh for more examples

Citation

If you find MedAgentBoard useful in your research, please cite our paper:

@article{zhu2025medagentboard,
  title={{MedAgentBoard}: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks},
  author={Zhu, Yinghao and He, Ziyi and Hu, Haoran and Zheng, Xiaochen and Zhang, Xichen and Wang, Zixiang and Gao, Junyi and Ma, Liantao and Yu, Lequan},
  journal={arXiv preprint arXiv:2505.12371},
  year={2025}
}

About

[NeurIPS 2025] MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors