APO is a research tool for automatically optimizing prompts used in creating persona-based Large Language Model (LLM) agents. This repository contains the implementation for our paper submission to the Commonsense Persona-grounded Dialogue Challenge 2025 at EMNLP 2025.
This project addresses the challenge of automatically optimizing prompts for persona-based LLM agents through an iterative optimization framework that leverages gradient-based feedback from language models. The goal is to improve the performance of agents in persona-grounded dialogue tasks without manual prompt engineering.
Agents must correctly identify and call appropriate functions based on user requests in persona-grounded scenarios. The optimization focuses on improving function selection accuracy and parameter extraction. This task requires executing necessary functions depending on the situation while maintaining natural conversation flow.
Agents must generate contextually appropriate and persona-consistent responses in multi-turn conversations without the need for function execution. The optimization targets response quality, persona adherence, and contextual relevance for natural, human-like interactions.
- Automatically generates improvement feedback from failed samples
- Uses language models to analyze performance gaps
- Provides actionable suggestions for prompt refinement
- Task 1: Function calling optimization for persona-based agents
- Task 2: Dialogue generation optimization
- Support for optimizing both tasks simultaneously
- Beam Search: Explores multiple prompt candidates
- Monte Carlo Selection: Stochastic sampling for robust optimization
- Gradient Memory: Maintains optimization history across iterations
- Mini-batch Processing: Efficient handling of large datasets
- Automated evaluation using language model judges
- Comprehensive scoring across multiple dimensions
- Checkpointing for long-running optimizations
apo/
βββ configs/ # Configuration files for experiments
β βββ function_calling.yaml # Main experiment configuration
βββ data/ # Competition dataset (from starter pack)
β βββ task1_sample.json # Task 1 sample data
β βββ task1_train.json # Task 1 training data
β βββ task2_sample.json # Task 2 sample data
β βββ task2_train.json # Task 2 training data
βββ src/
β βββ optimization/ # Core optimization framework
β β βββ optimizer.py # Main optimization orchestrator
β β βββ gradient_generator.py # Gradient generation from failed samples
β β βββ prompt_editor.py # Prompt editing based on gradients
β β βββ config.py # Configuration management
β β βββ checkpoint_manager.py # Optimization state persistence
β β βββ report_generator.py # Results and analysis generation
β βββ optimize_prompts.py # CLI entry point for optimization
β βββ agents/ # LLM agent implementations
β βββ function_calls/ # Function calling utilities
β βββ tasks/ # Task runners (from competition starter pack)
β βββ npcdataset/ # Dataset utilities (from starter pack)
βββ env.example # Environment configuration template
βββ pyproject.toml # Project dependencies and metadata
βββ README.md # This file
- Python 3.12 or higher
- OpenAI API key (or other supported LLM provider)
# Clone the repository
git clone https://github.com/scb-10x/apo.git
cd apo
# Install dependencies using uv
uv sync# Copy the example environment file
cp env.example .env
# Edit .env with your actual API keys and configuration
# Required: OPENAI_API_KEY
# Optional: Adjust model settings, optimization parameters, and other configurationsImportant: Never commit your .env file to version control. The env.example file contains all available configuration options with sensible defaults.
# Run optimization with default configuration
uv run src/optimize_prompts.py run --task task1 --data data/task1_train.json --output results/
# Run with custom configuration file
uv run src/optimize_prompts.py run --config configs/function_calling.yaml- Evaluation: Assess current prompt performance on target task
- Gradient Generation: Analyze failed samples to identify improvement areas
- Prompt Editing: Apply gradient-based feedback to refine prompts
- Validation: Test improved prompts and measure performance gains
- Iteration: Repeat until convergence or maximum iterations reached
- Language Model Gradients: Uses LLMs to generate improvement feedback
- Multi-Strategy Optimization: Combines beam search, Monte Carlo, and memory
- Automated Evaluation: Eliminates need for manual prompt assessment
- Checkpointing: Enables resumption of long-running optimizations
The optimization can be configured through YAML files or command-line arguments:
# configs/function_calling.yaml
task: "task1"
data_path: "data/task1_train.json"
output_path: "results/function_calling"
# Optimization parameters
score_threshold: 0.95
max_iterations: 30
min_improvement_threshold: 0.01
# Model configuration
gradient_model: "gpt-4.1-mini"
editor_model: "gpt-4.1"
evaluator_model: "gpt-4.1-mini"
# Advanced features
enable_beam_search: true
beam_width: 3
enable_gradient_memory: true# View all available options
uv run src/optimize_prompts.py run --help
# Run with specific parameters
uv run src/optimize_prompts.py run \
--task task1 \
--data data/task1_train.json \
--output results/ \
--score-threshold 0.9 \
--max-iterations 20 \
--gradient-model gpt-4o-miniThe project uses environment variables for configuration. Copy env.example to .env and customize the settings:
- Required:
OPENAI_API_KEYfor LLM access - Optional: Model selection, optimization parameters, logging settings
- Advanced: Performance tuning, feature toggles, and debugging options
See env.example for the complete list of available configuration options.
prompt_types:
- "function" # Function calling prompts
- "dialogue" # Dialogue generation prompts# Enable advanced optimization features
enable_beam_search: true
enable_gradient_memory: true
enable_prompt_candidates: true
enable_gradient_mini_batch: true# Different models for different optimization stages
gradient_model: "gpt-4o-mini" # For gradient generation
editor_model: "gpt-4o" # For prompt editing
evaluator_model: "gpt-4o-mini" # For evaluationThe optimization process generates comprehensive reports including:
- Performance metrics across iterations
- Gradient analysis and improvement patterns
- Prompt evolution tracking
- Failure case analysis
- Optimization convergence statistics
Results are saved in the specified output directory with detailed logging and checkpoint files.
This project is licensed under the MIT License - see the LICENSE file for details.
- Commonsense Persona-grounded Dialogue Challenge 2025: For organizing this shared task at EMNLP 2025 and providing the starter pack and dataset
- Microsoft LMOps: For the original Prompt Optimization with Textual Gradients (ProTeGi) implementation that inspired some of our optimization approaches
Note: The data/ directory and src/tasks/, src/agents/, src/function_calls/, and src/npcdataset/ modules are from the competition's starter pack. The core research contribution is in the src/optimization/ module and src/optimize_prompts.py.