Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

Yexiang Liu^1,2 Zekun Li³ Zhi Fang^1,2 Nan Xu^1,4 Ran He^1,2* Tieniu Tan^1,2,5

¹MAIS, Institute of Automation, Chinese Academy of Sciences
²School of Artificial Intelligence, University of Chinese Academy of Sciences
³University of California, Santa Barbara
⁴Beijing Wenge Technology Co., Ltd ⁵Nanjing University
*Corresponding Author

ACL 2025 Main 🏆 Outstanding Paper Award

📑 Brief Introduction

Abstract

Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance.

Contributions

Comprehensive experiments. Our study covers a wide range - 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks, providing sufficient evidence and context to fully support the claim.
Valuable findings breaking the conventional wisdom. Our extensive experiments consistently demonstrate that a complex prompting strategy with higher pass@1 accuracy may not always be better as test-time scaling, while simple CoT/DiP gradually dominates even if with an initial inferior performance.
Rigorous theoretical analysis. We provide an in-depth probability theoretic backed explanation of what leads to more rapid improvements with scale.
- 3.1 Definition of easy and hard questions by answer distribution. The difficulty of the question is not only related to pass@1 accuracy, but determined by the probability distribution of all possible answer outputs. The accuracy of easy questions increases as scaling while hard questions reduces.
- 3.2 Disturbed peaks of wrong answer distribution. Scaling performance is affected by enormous answer distribution, and we quantify this with our theory.
Practical $O(1)$ approach to predict scaling performance without resource-intensive inference.
Two effective and general methods to significantly improve scaling performance verified on multiple models and datasets. Combining the two methods will lead to much more improvements, e.g., improving Majority@10 accuracy from 15.2% to 61.0% with LLaMA-3-8B-Instruct on MATH-500.
- 5.1 Adaptively scaling based on the question difficulty.
- 5.2 Dynamically selecting the optimal prompting strategy based on our theory.

🔍 Features

Support for multiple LLM backends (VLLM, Gemini, OpenAI and other API-based models)
- You can specify any model according to your needs.
Various reasoning prompting strategies:
- Non-Iterative:
  - DiP: Direct Prompting
  - CoT: Chain of Thought Prompting
  - L2M: Least-to-Most Prompting
  - SBP: Step-Back Prompting
  - AnP: Analogous Prompting
- Iterative:
  - ToT: Tree of Thoughts
  - S-RF: Self-Refine
  - MAD: Multi-Agent Debate
Extensive dataset support:
- Mathematical reasoning:
  - GSM8K
  - GSM-Hard
  - MATH
  - AIME_2024
- Scientific reasoning:
  - GPQA
  - MMLU
    - MMLU-high_school_physics
    - MMLU-high_school_chemistry
    - MMLU-high_school_biology
Two different budgets for evaluation:
- Sampling time
- Computation overhead (Cost)

🛠️ Installation

Clone the repository:

git clone https://github.com/MraDonkey/rethinking_prompting.git
cd rethinking_prompting

Create conda environment and install dependencies:

conda create -n rethinking_prompting  python=3.11
conda activate rethinking_prompting
pip install -r requirements.txt

⚙️ Configuration

Before running the framework, you need to set up your API keys for different LLM providers:

For vllm models:
- You may need to login huggingface to get access to some LLMs.
For OpenAI models or other OpenAI-like API-based models:
- Set api_key openai_api_key
- Set base_url openai_base_url
For Google Gemini:
- Set google_api_key

Complete the variables hf_token in main.py and base_path in dataset.py.

🪛 Usage

For example, to get the inference results of all prompting strategies with Qwen2.5-7B-Instruct on GSM8K, you can run this script.

bash scripts/Qwen_GSM8K.sh

You can further customize hyperparameters to suit your specific requirements.

🔬 Evaluation

To evaluate the performance of all tested prompting strategies:

python eval_csv_N.py    --model_name "your_model" --dataset "your_dataset"
python eval_csv_cost.py --model_name "your_model" --dataset "your_dataset"

You can customize the variable sampling_times to adjust the points in the figure, in the style of Figure 1 and 2 in our paper.

✒️ Citation

Should you find our work beneficial to your research, we would appreciate citations to our paper and GitHub stars to support ongoing development. ⭐

@inproceedings{liu-etal-2025-rethinking,
    title = "Rethinking the Role of Prompting Strategies in {LLM} Test-Time Scaling: A Perspective of Probability Theory",
    author = "Liu, Yexiang  and
      Li, Zekun  and
      Fang, Zhi  and
      Xu, Nan  and
      He, Ran  and
      Tan, Tieniu",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.1356/",
    doi = "10.18653/v1/2025.acl-long.1356",
    pages = "27962--27994",
    ISBN = "979-8-89176-251-0"
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
prompts		prompts
scripts		scripts
LICENSE		LICENSE
dataset.py		dataset.py
eval_csv_N.py		eval_csv_N.py
eval_csv_cost.py		eval_csv_cost.py
image.png		image.png
main.py		main.py
model.py		model.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

📑 Brief Introduction

Abstract

Contributions

🔍 Features

🛠️ Installation

⚙️ Configuration

🪛 Usage

🔬 Evaluation

✒️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

MraDonkey/rethinking_prompting

Folders and files

Latest commit

History

Repository files navigation

Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

📑 Brief Introduction

Abstract

Contributions

🔍 Features

🛠️ Installation

⚙️ Configuration

🪛 Usage

🔬 Evaluation

✒️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages