Dayeon Ki, Rachel Rudinger, Tianyi Zhou, Marine Carpuat
University of Maryland
This repository contains the code and dataset for our ACL 2025 Main paper
Multiple LLM Agents Debate for Equitable Cultural Alignment.
While previous efforts in cultural alignment have focused on single-model, single-turn approaches, we propose to exploit the complementary strengths of multiple LLMs to promote cultural adaptability. We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision, which improves both (i) overall accuracy and (ii) cultural group parity over single-model baselines.
2025-07-10Our paper has been selected for an oral presentation — top 8% of accepted papers!2025-05-15Our paper is accepted to ACL 2025! See you in Vienna!
How can multiple LLMs collaborate toward equitable alignment across cultures? We investigate a common form of multi-LLM collaboration: debate. We propose a Multi-Agent Debate framework, where two LLM agents debate over the given scenario and collaboratively arrive at a final decision with a judge LLM. We introduce two key variants as illustrated in the above figure:
- Debate-Only: multiple LLM agents exclusively engage in debate with a discussant
- Self-Reflect+Debate: each LLM agent dynamically choose between self-reflection and debating during its turn
For more comprehensive comparison study, we investigate two additional strategies based on single-LLM:
- Single Model: a single LLM generates outputs
- Self-Reflection: an LLM generates verbal self-reflections on its own outputs and incorporate them in subsequent iterations
We use NORMAD-ETI dataset for evaluation, a benchmark designed to assess the cultural adaptability of LLMs. The dataset contains 2.6K stories reflecting social and cultural norms from 75 countries, derived from the social-etiquette norms outlined in the Cultural Atlas. Each story is associated with a country, a rule-of-thumb, and a ternary ground truth label in {Yes, No, Neither} as shown in the figure above. We categorize a total of 75 countries according to the Inglehart-Welzel cultural map and show the label and country distribution for each bin.
- Raw data:
data/normad_raw.csv - Country distribution:
data/normad_country_dist.csv - Refined data:
data/normad.jsonl
We first investigate the effect of adding relevant cultural context in enhancing cultural alignment of LLMs. We test two variants: without and with the rule-of-thumb (RoT) information in the prompts. (single_llm/single_model/)
For running without RoT prompting,
python -u sinlge_llm/single_model/{$LLM}.py \
--input_path $PATH_TO_INPUT_FILE \
--output_path $PATH_TO_OUTPUT_FILE \
--type without_rotFor running with RoT prompting,
python -u sinlge_llm/single_model/{$LLM}.py \
--input_path $PATH_TO_INPUT_FILE \
--output_path $PATH_TO_OUTPUT_FILE \
--type with_rotArguments for the prompting code are as follows:
$LLM: Name of the LLM (specific names can be found in the directory).--input_path: Path to input data file (data/normad.jsonl).--output_path: Save path of output file.--type: Without or with RoT information.
Building on previous works that showed that LLMs can evaluate their outputs and learn from their own feedback, we explore self-reflection for each LLM. (single_llm/self_reflection/)
python -u sinlge_llm/self_reflection/{$LLM}.py \
--input_path $PATH_TO_INPUT_FILE \
--output_path $PATH_TO_OUTPUT_FILE \Arguments for the prompting code are as follows:
$LLM: Name of the LLM (specific names can be found in the directory).--input_path: Path to input data file (data/normad.jsonl).--output_path: Save path of output file.
LLMs often exhibit varying knowledge coverage, with the potential to complement each other due to differences in training data distributions and alignment processes. We tap into this knowledge complementarity through multi-LLM collaboration, debate, where two LLM-based agents debate and collaboratively evaluate the given scenario.
python -u multi_llm/{$FIRST_LLM}_$SECOND_LLM.py \
--input_path $PATH_TO_INPUT_FILE \
--output_path $PATH_TO_OUTPUT_FILE \Arguments for the prompting code are as follows:
$FIRST_LLM: Name of the first participant LLM (specific names can be found in the directory).$SECOND_LLM: Name of the second participant LLM (specific names can be found in the directory).--input_path: Path to input data file (data/normad.jsonl).--output_path: Save path of output file.
-
For evaluating single LLM baselines, use
evaluate/accuracy_single.py. Add the model names to test in theMODEL_NAMESvariable and run the code:python evaluate/accuracy_single.py. -
For evaluating multi LLM baselines, use
evaluate/accuracy_multi.py. Add the name of the first model asFIRST_MODELand the name of the second model asSECOND_MODELvariables and run the code:python evaluate/accuracy_multi.py.
If you find our work useful in your research, please consider citing our work:
@inproceedings{ki-etal-2025-multiple,
title = "Multiple {LLM} Agents Debate for Equitable Cultural Alignment",
author = "Ki, Dayeon and
Rudinger, Rachel and
Zhou, Tianyi and
Carpuat, Marine",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.1210/",
doi = "10.18653/v1/2025.acl-long.1210",
pages = "24841--24877",
ISBN = "979-8-89176-251-0",
}
For questions, issues, or collaborations, please reach out to dayeonki@umd.edu.


