Skip to content

[ACL 2025 Main] (๐Ÿ† Outstanding Paper Award) Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

License

Notifications You must be signed in to change notification settings

MraDonkey/rethinking_prompting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

32 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

Yexiang Liu1,2โ€ƒ Zekun Li3โ€ƒ Zhi Fang1,2โ€ƒ Nan Xu1,4โ€ƒ Ran He1,2*โ€ƒ Tieniu Tan1,2,5โ€ƒ
1MAIS, Institute of Automation, Chinese Academy of Sciencesโ€ƒ
2School of Artificial Intelligence, University of Chinese Academy of Sciencesโ€ƒ
3University of California, Santa Barbaraโ€ƒ
4Beijing Wenge Technology Co., Ltdโ€ƒ 5Nanjing University
*Corresponding Author
ACL 2025 Main ๐Ÿ† Outstanding Paper Award

Conference arXiv License: MIT Python 3.11+

๐Ÿ“‘ Brief Introduction

Abstract

Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance.

Contributions

  1. Comprehensive experiments. Our study covers a wide range - 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks, providing sufficient evidence and context to fully support the claim.
  2. Valuable findings breaking the conventional wisdom. Our extensive experiments consistently demonstrate that a complex prompting strategy with higher pass@1 accuracy may not always be better as test-time scaling, while simple CoT/DiP gradually dominates even if with an initial inferior performance.
  3. Rigorous theoretical analysis. We provide an in-depth probability theoretic backed explanation of what leads to more rapid improvements with scale.
    • 3.1 Definition of easy and hard questions by answer distribution. The difficulty of the question is not only related to pass@1 accuracy, but determined by the probability distribution of all possible answer outputs. The accuracy of easy questions increases as scaling while hard questions reduces.
    • 3.2 Disturbed peaks of wrong answer distribution. Scaling performance is affected by enormous answer distribution, and we quantify this with our theory.
  4. Practical $O(1)$ approach to predict scaling performance without resource-intensive inference.
  5. Two effective and general methods to significantly improve scaling performance verified on multiple models and datasets. Combining the two methods will lead to much more improvements, e.g., improving Majority@10 accuracy from 15.2% to 61.0% with LLaMA-3-8B-Instruct on MATH-500.
    • 5.1 Adaptively scaling based on the question difficulty.
    • 5.2 Dynamically selecting the optimal prompting strategy based on our theory.

๐Ÿ” Features

๐Ÿ› ๏ธ Installation

  1. Clone the repository:
git clone https://github.com/MraDonkey/rethinking_prompting.git
cd rethinking_prompting
  1. Create conda environment and install dependencies:
conda create -n rethinking_prompting  python=3.11
conda activate rethinking_prompting
pip install -r requirements.txt

โš™๏ธ Configuration

Before running the framework, you need to set up your API keys for different LLM providers:

  • For vllm models:
    • You may need to login huggingface to get access to some LLMs.
  • For OpenAI models or other OpenAI-like API-based models:
    • Set api_key openai_api_key
    • Set base_url openai_base_url
  • For Google Gemini:
    • Set google_api_key

Complete the variables hf_token in main.py and base_path in dataset.py.

๐Ÿช› Usage

For example, to get the inference results of all prompting strategies with Qwen2.5-7B-Instruct on GSM8K, you can run this script.

bash scripts/Qwen_GSM8K.sh

You can further customize hyperparameters to suit your specific requirements.

๐Ÿ”ฌ Evaluation

To evaluate the performance of all tested prompting strategies:

python eval_csv_N.py    --model_name "your_model" --dataset "your_dataset"
python eval_csv_cost.py --model_name "your_model" --dataset "your_dataset"

You can customize the variable sampling_times to adjust the points in the figure, in the style of Figure 1 and 2 in our paper.

alt text

โœ’๏ธ Citation

Should you find our work beneficial to your research, we would appreciate citations to our paper and GitHub stars to support ongoing development. โญ

@inproceedings{liu-etal-2025-rethinking,
    title = "Rethinking the Role of Prompting Strategies in {LLM} Test-Time Scaling: A Perspective of Probability Theory",
    author = "Liu, Yexiang  and
      Li, Zekun  and
      Fang, Zhi  and
      Xu, Nan  and
      He, Ran  and
      Tan, Tieniu",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.1356/",
    doi = "10.18653/v1/2025.acl-long.1356",
    pages = "27962--27994",
    ISBN = "979-8-89176-251-0"
}

About

[ACL 2025 Main] (๐Ÿ† Outstanding Paper Award) Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published