PolyMath is a curated dataset of 11,090 high-difficulty mathematical problems designed for training reasoning models. Built for the AIMO Math Corpus Prize. Existing math datasets (NuminaMath-1.5, OpenMathReasoning) suffer from high noise rates in their hardest samples and largely unusable proof-based problems. PolyMath addresses both issues through:
- Data scraping: problems sourced from official competition PDFs absent from popular datasets, using a human-in-the-loop pipeline
- Proof-to-answer conversion: automated pipeline converting proof-based math problems into verifiable final-answer format
- Apex filtering: multi-round solve-and-filter pipeline and manual inspection to remove easy problems and noise
- Problem revision: automated pipeline introducing background stories that increase complexity and reduce memorization effects
The dataset is curated from nvidia/OpenMathReasoning, AI-MO/NuminaMath-1.5, and >1.9k original contributions.
We use UV to manage dependencies. Dependencies are managed via pyproject.toml. Use uv run to run the scripts. Install UV through:
- macOS and Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh - Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
The dataset is constructed in four stages. All stages can be run automatically using bash scripts/filter.sh, which runs all the necessary scripts in the correct order. Below is a brief overview of each stage and how to run them individually. Importantly, you should set the following environment variables for LLM queries in the pipeline: OPENAI_API_KEY, GOOGLE_API_KEY, and TOGETHER_API_KEY.
The data collection process downloads and cleans data from NuminaMath-1.5 and OpenMathReasoning, and combines them into a single dataset together with our original contributions. The cleaned and combined dataset is stored in outputs/current.jsonl.
uv run scripts/data_collection/clean_numina.py
uv run scripts/data_collection/clean_openmath.py
uv run scripts/data_collection/combine.pyConvert proof-based problems into a standardized format with parseable LaTeX answers.
uv run python scripts/problem_conversion/run.py
cp outputs/conversion/converted.jsonl outputs/current.jsonlThen, run problem revision to increase difficulty and diversity:
uv run scripts/revision/revise.pyThese problems are stored in outputs/revision.jsonl.
Solves problems with LLMs and grades outputs to remove easy samples. Uses LLM-assisted filters to remove wrong problems.
The full solve-filter loop is automated in scripts/apex_filter/run_pipeline.sh.
Additionally, several other filters are executed:
- Parser check:
scripts/apex_filter/parser_check.py - Equal wrong check:
scripts/apex_filter/equal_wrong_check.pyThe first runs an LLM judge to make sure the automated parser did not result in any false negatives. If so, it is further filtered. The second script checks whether the wrong solutions are all the same, which indicates that there might be a problem with the problem statement.
We built a Flask app for browsing, inspecting, and editing datasets. It displays all problems in the dataset, supporting filters across all fields. The app runs locally on port 7860.
uv run app --dataset /path/to/data.jsonl --output-path /path/to/edited.jsonlThis deep manual filtering lead to several rounds of additional filtering:
uv run python scripts/data_filter/llm_check.py: runs an LLM check to further filter out answers that do not match the problem statement and cannot possibly be correct.uv run python scripts/data_filter/dedup.py: A simple deduplication scrip based on fuzzy matching.uv run python scripts/data_filter/second_llm_check.py: runs a second round of LLM check to ensure that the solution and answer say the same thing, which is a common issue we observed in the dataset.uv run python scripts/data_filter/decontaminate.py: decontamination against our own evaluation suite.uv run python scripts/data_filter/identify_multipart.py: Identifies and removes multipart problems that ask for multiple answers, which are not suitable for training current LLMs.uv run python scripts/data_filter/final_filter.py: final round of manual inspection to remove any remaining noise. -uv run python scripts/data_filter/merge_datasets.py`: ensures the revisions also go through the same filtering process.
The final datasets are stored in outputs/current.jsonl and outputs/revisions.jsonl.
In our pipeline, you need to query LLMs. src/corpus_prize/api_client.py supports most common API queries. You can add or edit the model configs in configs/models. Before running the pipeline, set environment variables for the LLM providers you use (OPENAI_API_KEY, GOOGLE_API_KEY, ANTHROPIC_API_KEY, TOGETHER_API_KEY, OPENROUTER_API_KEY, XAI_API_KEY, DEEPSEEK_API_KEY, GLM_API_KEY). For local vLLM, set VLLM_API_KEY=EMPTY.
When using local vLLM, you can use the following command to start a vLLM server:
uv run --extra vllm vllm serve openai/gpt-oss-120b \
--host 127.0.0.1 \
--port 8000Evaluate a model on a specific dataset. Persists results and gives summary of solve rates of different problems. Model configs live under configs/models. Dataset path can be local path to a .jsonl file or a HuggingFace repo (like AIMO-Corpus/PolyMath-eval)
bash scripts/eval/run_eval.sh --dataset /path/to/data.jsonl --output-path /path/to/output/dir --model configs/models/path/to/config.yaml --attempts 4
Key locations:
scripts/: data pipeline and filters scriptssrc/: core library code for dataset processing and LLM utilitiesconfigs/: pipeline and model configuration filesoutputs/: generated datasets and intermediate artifacts
