PolyMath

PolyMath

PolyMath is a curated dataset of 11,090 high-difficulty mathematical problems designed for training reasoning models. Built for the AIMO Math Corpus Prize. Existing math datasets (NuminaMath-1.5, OpenMathReasoning) suffer from high noise rates in their hardest samples and largely unusable proof-based problems. PolyMath addresses both issues through:

Data scraping: problems sourced from official competition PDFs absent from popular datasets, using a human-in-the-loop pipeline
Proof-to-answer conversion: automated pipeline converting proof-based math problems into verifiable final-answer format
Apex filtering: multi-round solve-and-filter pipeline and manual inspection to remove easy problems and noise
Problem revision: automated pipeline introducing background stories that increase complexity and reduce memorization effects

The dataset is curated from nvidia/OpenMathReasoning, AI-MO/NuminaMath-1.5, and >1.9k original contributions.

Environment Setup

We use UV to manage dependencies. Dependencies are managed via pyproject.toml. Use uv run to run the scripts. Install UV through:

macOS and Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Dataset Construction Pipeline

The dataset is constructed in four stages. All stages can be run automatically using bash scripts/filter.sh, which runs all the necessary scripts in the correct order. Below is a brief overview of each stage and how to run them individually. Importantly, you should set the following environment variables for LLM queries in the pipeline: OPENAI_API_KEY, GOOGLE_API_KEY, and TOGETHER_API_KEY.

1. Data Collection

The data collection process downloads and cleans data from NuminaMath-1.5 and OpenMathReasoning, and combines them into a single dataset together with our original contributions. The cleaned and combined dataset is stored in outputs/current.jsonl.

uv run scripts/data_collection/clean_numina.py
uv run scripts/data_collection/clean_openmath.py
uv run scripts/data_collection/combine.py

2. Problem Conversion

Convert proof-based problems into a standardized format with parseable LaTeX answers.

uv run python scripts/problem_conversion/run.py
cp outputs/conversion/converted.jsonl outputs/current.jsonl

Then, run problem revision to increase difficulty and diversity:

uv run scripts/revision/revise.py

These problems are stored in outputs/revision.jsonl.

3. APEX Filter

Solves problems with LLMs and grades outputs to remove easy samples. Uses LLM-assisted filters to remove wrong problems.

The full solve-filter loop is automated in scripts/apex_filter/run_pipeline.sh.

Additionally, several other filters are executed:

Parser check: scripts/apex_filter/parser_check.py
Equal wrong check: scripts/apex_filter/equal_wrong_check.py The first runs an LLM judge to make sure the automated parser did not result in any false negatives. If so, it is further filtered. The second script checks whether the wrong solutions are all the same, which indicates that there might be a problem with the problem statement.

4. Deep Manual Filter

Visualization App

We built a Flask app for browsing, inspecting, and editing datasets. It displays all problems in the dataset, supporting filters across all fields. The app runs locally on port 7860.

uv run app --dataset /path/to/data.jsonl --output-path /path/to/edited.jsonl

Filtering passes

This deep manual filtering lead to several rounds of additional filtering:

uv run python scripts/data_filter/llm_check.py: runs an LLM check to further filter out answers that do not match the problem statement and cannot possibly be correct.
uv run python scripts/data_filter/dedup.py: A simple deduplication scrip based on fuzzy matching.
uv run python scripts/data_filter/second_llm_check.py: runs a second round of LLM check to ensure that the solution and answer say the same thing, which is a common issue we observed in the dataset.
uv run python scripts/data_filter/decontaminate.py: decontamination against our own evaluation suite.
uv run python scripts/data_filter/identify_multipart.py: Identifies and removes multipart problems that ask for multiple answers, which are not suitable for training current LLMs.
uv run python scripts/data_filter/final_filter.py: final round of manual inspection to remove any remaining noise. -uv run python scripts/data_filter/merge_datasets.py`: ensures the revisions also go through the same filtering process.

The final datasets are stored in outputs/current.jsonl and outputs/revisions.jsonl.

LLM Query

In our pipeline, you need to query LLMs. src/corpus_prize/api_client.py supports most common API queries. You can add or edit the model configs in configs/models. Before running the pipeline, set environment variables for the LLM providers you use (OPENAI_API_KEY, GOOGLE_API_KEY, ANTHROPIC_API_KEY, TOGETHER_API_KEY, OPENROUTER_API_KEY, XAI_API_KEY, DEEPSEEK_API_KEY, GLM_API_KEY). For local vLLM, set VLLM_API_KEY=EMPTY.

When using local vLLM, you can use the following command to start a vLLM server:

uv run --extra vllm vllm serve openai/gpt-oss-120b \
    --host 127.0.0.1 \
    --port 8000

Evaluation

Evaluate a model on a specific dataset. Persists results and gives summary of solve rates of different problems. Model configs live under configs/models. Dataset path can be local path to a .jsonl file or a HuggingFace repo (like AIMO-Corpus/PolyMath-eval)

bash scripts/eval/run_eval.sh --dataset /path/to/data.jsonl --output-path /path/to/output/dir --model configs/models/path/to/config.yaml --attempts 4

Project Structure

Key locations:

scripts/: data pipeline and filters scripts
src/: core library code for dataset processing and LLM utilities
configs/: pipeline and model configuration files
outputs/: generated datasets and intermediate artifacts

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
configs		configs
images		images
scripts		scripts
src/corpus_prize		src/corpus_prize
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PolyMath

PolyMath

Table of Contents

Environment Setup

Dataset Construction Pipeline

1. Data Collection

2. Problem Conversion

3. APEX Filter

4. Deep Manual Filter

Visualization App

Filtering passes

LLM Query

Evaluation

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

eth-sri/aimo-corpus

Folders and files

Latest commit

History

Repository files navigation

PolyMath

PolyMath

Table of Contents

Environment Setup

Dataset Construction Pipeline

1. Data Collection

2. Problem Conversion

3. APEX Filter

4. Deep Manual Filter

Visualization App

Filtering passes

LLM Query

Evaluation

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages