- 2025.07: We released a github repo to record papers related with reasoning economy. Feel free to cite or open pull requests.
- A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
- 1. Introduction
- 2. Definitions and Foundations
- 3. What to Evolve?
- 4. When to Evolve?
- 5. How to Evolve
- 6. Where to Evolve?
- 7. Evaluation of Self-evolving Agents
- 8. Future Directions
-
Curriculum Learning for Cooperation in Multi-Agent Reinforcement Learning
-
Lifelong Learning of Large Language Model-based Agents: A Roadmap
-
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
-
Reflexion: Language Agents with Verbal Reinforcement Learning
-
AdaPlanner: Adaptive Planning from Feedback with Language Models
-
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
-
DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning
-
Self-evolving Agents with reflective and memory-augmented abilities
-
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
-
Large Language Models Are Semi-Parametric Reinforcement Learning Agents
-
Automatic Prompt Optimization with "Gradient Descent" and Beam Search
-
PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization
-
REVOLVE: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
-
Advanced Tool Learning and Selection System (ATLASS): A Closed-Loop Framework Using LLM
-
From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions
-
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
-
AgentSquare: Automatic LLM Agent Search in Modular Design Space
-
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
-
Towards Completeness-Oriented Tool Retrieval for Large Language Models
-
AgentSquare: Automatic LLM Agent Search in Modular Design Space
-
Gödel Agent: A Self-Referential Framework for Agents Recursively Self-Improvement
-
AlphaEvolve: A coding agent for scientific and algorithmic discovery
-
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
-
AutoFlow: Automated Workflow Generation for Large Language Model Agents
-
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
-
ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning
-
AdaPlanner: Adaptive Planning from Feedback with Language Models
-
Reflexion: Language Agents with Verbal Reinforcement Learning
-
Test-Time Training on Nearest Neighbors for Large Language Models
-
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
-
LADDER: Self-Improving LLMs Through Recursive Problem Decomposition
-
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
-
SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning
-
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
-
DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning
-
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
-
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
-
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
-
Training Language Models to Self-Correct via Reinforcement Learning
-
PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
-
Self-ensemble: Mitigating Confidence Distortion for Large Language Models
-
Scalable Best-of-N Selection for Large Language Models via Self-Certainty
-
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
-
Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning
-
SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development
-
Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback
-
AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning
-
Reward Is Enough: LLMs Are In-Context Reinforcement Learners
-
Generalist Reward Models: Found Inside Large Language Models
-
AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners
-
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
-
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
-
SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning
-
Bridging the Gap: Self-Optimized Fine-Tuning for LLM-based Recommender Systems
-
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
-
Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models
-
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
-
Nature-Inspired Population-Based Evolution of Large Language Models
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
-
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning
-
Language Models can Self-Improve at State-Value Estimation for Better Search
-
elf-Evolving Multi-Agent Collaboration Networks for Software Development
-
Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions
-
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
-
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
-
WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model
-
MobileSteward: Integrating Multiple App‑Oriented Agents with Self‑Evolution
-
Intelligent Virtual Assistants with LLM-based Process Automation
-
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
-
Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance
-
AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment
-
Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions
-
LLMs Can Simulate Standardized Patients via Agent Coevolution
-
SEW: Self-Evolving Agentic Workflows for Automated Code Generation
-
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
-
QuantAgent: Seeking Holy Grail in Trading by Self-Improving Large Language Model
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
-
Learning to Be A Doctor: Searching for Effective Medical Agent Architectures
-
Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents
-
One Size Doesn’t Fit All: A Personalized Conversational Tutoring Agent for Mathematics Instruction
-
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
-
DSBENCH: HOW FAR ARE DATA SCIENCE AGENTS FROM BECOMING DATA SCIENCE EXPERTS?
ICLR 2025 -
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
ICLR 2025Code -
AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction
EMNLP 2025 -
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
-
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
-
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
-
WebArena: A Realistic Web Environment for Building Autonomous Agents
-
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
-
xbench: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations
-
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
-
LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
-
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
-
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs code
-
Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark
-
T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step
ACL 2024 -
StoryBench: A Multifaceted Benchmark for Continuous Story Visualization
-
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
ACL 2025 -
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models
-
AutoPal: Autonomous Adaptation to Users for Personal AI Companionship
-
Bleu: a Method for Automatic Evaluation of Machine Translation
-
Position: Scaling LLM Agents Requires Asymptotic Analysis with LLM Primitives
-
Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios
-
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
-
Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios
-
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
-
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
-
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
-
Self-Evolving Multi-Agent Collaboration Networks for Software Development
-
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
To cite the research paper, you could use the following BibTeX entries.
@misc{gao2025surveyselfevolvingagentspath,
title={A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence},
author={Huan-ang Gao and Jiayi Geng and Wenyue Hua and Mengkang Hu and Xinzhe Juan and Hongzhang Liu and Shilong Liu and Jiahao Qiu and Xuan Qi and Yiran Wu and Hongru Wang and Han Xiao and Yuhang Zhou and Shaokun Zhang and Jiayi Zhang and Jinyu Xiang and Yixiong Fang and Qiwen Zhao and Dongrui Liu and Qihan Ren and Cheng Qian and Zhenghailong Wang and Minda Hu and Huazheng Wang and Qingyun Wu and Heng Ji and Mengdi Wang},
year={2025},
eprint={2507.21046},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.21046},
}
