Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory
2School of Artificial Intelligence, University of Chinese Academy of Sciencesโ
3University of California, Santa Barbaraโ
4Beijing Wenge Technology Co., Ltdโ 5Nanjing University
*Corresponding Author
Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs
-
Comprehensive experiments. Our study covers a wide range - 6 LLMs
$\times$ 8 prompting strategies$\times$ 6 benchmarks, providing sufficient evidence and context to fully support the claim. - Valuable findings breaking the conventional wisdom. Our extensive experiments consistently demonstrate that a complex prompting strategy with higher pass@1 accuracy may not always be better as test-time scaling, while simple CoT/DiP gradually dominates even if with an initial inferior performance.
-
Rigorous theoretical analysis. We provide an in-depth probability theoretic backed explanation of what leads to more rapid improvements with scale.
- 3.1 Definition of easy and hard questions by answer distribution. The difficulty of the question is not only related to pass@1 accuracy, but determined by the probability distribution of all possible answer outputs. The accuracy of easy questions increases as scaling while hard questions reduces.
- 3.2 Disturbed peaks of wrong answer distribution. Scaling performance is affected by enormous answer distribution, and we quantify this with our theory.
-
Practical
$O(1)$ approach to predict scaling performance without resource-intensive inference. -
Two effective and general methods to significantly improve scaling performance verified on multiple models and datasets. Combining the two methods will lead to much more improvements, e.g., improving Majority@10 accuracy from 15.2% to 61.0% with LLaMA-3-8B-Instruct on MATH-500.
- 5.1 Adaptively scaling based on the question difficulty.
- 5.2 Dynamically selecting the optimal prompting strategy based on our theory.
- Support for multiple LLM backends (VLLM, Gemini, OpenAI and other API-based models)
- You can specify any model according to your needs.
- Various reasoning prompting strategies:
- Non-Iterative:
- DiP: Direct Prompting
- CoT: Chain of Thought Prompting
- L2M: Least-to-Most Prompting
- SBP: Step-Back Prompting
- AnP: Analogous Prompting
- Iterative:
- ToT: Tree of Thoughts
- S-RF: Self-Refine
- MAD: Multi-Agent Debate
- Non-Iterative:
- Extensive dataset support:
- Two different budgets for evaluation:
- Sampling time
- Computation overhead (Cost)
- Clone the repository:
git clone https://github.com/MraDonkey/rethinking_prompting.git
cd rethinking_prompting- Create conda environment and install dependencies:
conda create -n rethinking_prompting python=3.11
conda activate rethinking_prompting
pip install -r requirements.txtBefore running the framework, you need to set up your API keys for different LLM providers:
- For vllm models:
- You may need to login huggingface to get access to some LLMs.
- For OpenAI models or other OpenAI-like API-based models:
- Set api_key
openai_api_key - Set base_url
openai_base_url
- Set api_key
- For Google Gemini:
- Set
google_api_key
- Set
Complete the variables hf_token in main.py and base_path in dataset.py.
For example, to get the inference results of all prompting strategies with Qwen2.5-7B-Instruct on GSM8K, you can run this script.
bash scripts/Qwen_GSM8K.shYou can further customize hyperparameters to suit your specific requirements.
To evaluate the performance of all tested prompting strategies:
python eval_csv_N.py --model_name "your_model" --dataset "your_dataset"
python eval_csv_cost.py --model_name "your_model" --dataset "your_dataset"You can customize the variable sampling_times to adjust the points in the figure, in the style of Figure 1 and 2 in our paper.
Should you find our work beneficial to your research, we would appreciate citations to our paper and GitHub stars to support ongoing development. โญ
@inproceedings{liu-etal-2025-rethinking,
title = "Rethinking the Role of Prompting Strategies in {LLM} Test-Time Scaling: A Perspective of Probability Theory",
author = "Liu, Yexiang and
Li, Zekun and
Fang, Zhi and
Xu, Nan and
He, Ran and
Tan, Tieniu",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.1356/",
doi = "10.18653/v1/2025.acl-long.1356",
pages = "27962--27994",
ISBN = "979-8-89176-251-0"
}