A machine learning solution for the Kaggle AI Mathematical Olympiad (AIMO) competition, focused on solving complex mathematical problems using AI.
This repository contains a submission for the AIMO Progress Prize competition on Kaggle. The solution implements mathematical problem-solving capabilities with LaTeX processing and symbolic computation.
- Python 3.8+
- macOS, Linux, or Windows
- Clone this repository:
git clone <repository-url>
cd ai-mathematical-olympiad-progress-prize-3- Create a virtual environment:
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- For Mac M-series (Metal GPU acceleration) - Install llama-cpp-python with Metal support:
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir- Create a
.envfile with your HuggingFace token (optional, for gated models):
echo "HF_TOKEN=your_token_here" > .envThe project supports three backends: GGUF (quantized), Transformers (full precision), and Remote (HTTP inference).
This is the default configuration. Uses ~4.4GB RAM and runs on Mac M-series via Metal.
In src/config.py:
MODEL_NAME = "bartowski/NuminaMath-7B-TIR-GGUF"
MODEL_FILE = "NuminaMath-7B-TIR-Q4_K_M.gguf"
MODEL_BACKEND = "gguf"For systems with dedicated GPU and sufficient VRAM.
In src/config.py:
MODEL_NAME = "AI-MO/NuminaMath-7B-TIR"
MODEL_BACKEND = "transformers"Use this if you host the model remotely (e.g., Scaleway) and expose an OpenAI-compatible API.
Set environment variables (recommended via .env):
MODEL_BACKEND=remote
REMOTE_BASE_URL=https://<your-host>/v1
REMOTE_MODEL=<served-model-name>
REMOTE_API_KEY=<optional>
# Optional: defaults shown
REMOTE_API_STYLE=chat # or completions
REMOTE_TIMEOUT=120
REMOTE_VERIFY_SSL=trueNotes:
- For
REMOTE_API_STYLE=chat, the code callsPOST {REMOTE_BASE_URL}/chat/completions. - For
REMOTE_API_STYLE=completions, it callsPOST {REMOTE_BASE_URL}/completions.
The project uses Python logging for progress/debug output.
- Logs print to stdout.
- Logs also write to
logs/app.logby default.
Environment variables:
LOG_LEVEL=INFO
LOG_TO_FILE=true
LOG_DIR=logs
LOG_FILE=app.logWhen running local evaluation with save_traces=True, each per-problem trace JSON also includes an info_logs field containing captured INFO logs for that solve.
When running local evaluation, correct solves are also exported as JSONL under:
training_data/correct_attempts.jsonl
Each JSONL line contains the problem text, expected/predicted answers, and a subset of attempts that produced the correct final answer (preferably code-backed or boxed), for later fine-tuning.
Notes:
- This export can be enabled even when
save_traces=False(it will still run the trace-enabled solver to collect attempt texts, but wonβt write per-problem trace JSON files).
- numpy - Numerical computing
- pandas - Data manipulation and analysis
- sympy - Symbolic mathematics
- scikit-learn - Machine learning utilities
- matplotlib - Data visualization
- jupyter - Interactive notebooks
- requests - HTTP library for API calls
- transformers - HuggingFace model loading
- llama-cpp-python - GGUF quantized model support
- huggingface_hub - Model downloading
.
βββ app.py # Main entry point
βββ requirements.txt # Python dependencies
βββ reference.csv # Reference data for problems
βββ test.csv # Test dataset
βββ sample_submission.csv # Sample submission format
βββ src/ # Source code modules
β βββ __init__.py # Package exports
β βββ config.py # Configuration settings
β βββ model.py # Model loading (GGUF & Transformers)
β βββ solver.py # Core problem solver
β βββ prompt.py # Prompt templates
β βββ answer.py # Answer extraction utilities
β βββ voting.py # Majority voting for self-consistency
β βββ code_executor.py # Python code execution (TIR)
β βββ latex.py # LaTeX processing
β βββ data.py # Data loading utilities
β βββ logger.py # Logging configuration
β βββ validation.py # Answer validation and constraints
β βββ classification.py # Problem type classification
β βββ evaluation.py # Local testing utilities
βββ kaggle_evaluation/ # Kaggle evaluation module
β βββ aimo_3_gateway.py
β βββ aimo_3_inference_server.py
β βββ core/ # Core evaluation logic
βββ README.md # This file
Main solver with multi-attempt weighted voting:
solve_math_problem(problem)- Solves a problem with classification, validation, and weighted votingsolve_single_attempt(problem)- Single solution attempt with code execution and metadata
Model loading supporting two backends:
- GGUF - Quantized models via llama-cpp-python (recommended for 16GB RAM)
- Transformers - Full precision HuggingFace models
Tool-Integrated Reasoning (TIR) support:
- Extracts Python code blocks from model output
- Safely executes code with SymPy for verification
- Returns computed answers from code execution
Prompt templates optimized for math olympiad:
build_prompt_tir()- Optimized for NuminaMath-TIR models with optional hintsbuild_prompt_classified()- Type-specific prompts based on classificationbuild_prompt()- Standard math solving promptbuild_prompt_with_cot()- Chain-of-thought prompt
Self-consistency via majority and weighted voting:
majority_vote()- Simple majority votingweighted_majority_vote()- Confidence-based weighted voting:- Code execution success (+0.5 weight)
- Code and text answer agreement (+0.3)
- Greedy decoding first attempt (+0.2)
- Boxed answer found (+0.2)
Problem type classification for specialized strategies:
- Detects: number_theory, algebra, combinatorics, geometry, sequence, probability
classify_problem()- Returns problem typeget_problem_hint()- Returns specialized hints for promptsget_difficulty_estimate()- Estimates problem complexity
Answer validation and constraint checking:
validate_and_fix_answer()- Validates and fixes answersextract_constraints()- Extracts modulo/range constraints from problemscheck_answer_agreement()- Measures agreement across attempts
Answer extraction with multiple strategies:
\boxed{}notation (highest priority)- "Final answer" statements
- Last integer fallback
LaTeX processing utilities:
clean_latex(text)- Normalizes mathematical expressions
- reference.csv - Reference problems and solutions
- test.csv - Test dataset for evaluation
- sample_submission.csv - Template for submission format
The AI Mathematical Olympiad (AIMO) challenges AI systems to solve challenging mathematical problems across multiple domains including:
- Algebra
- Geometry
- Number Theory
- Combinatorics
- Analysis
- Data Preparation - Load and preprocess problem data from CSV files
- LaTeX Processing - Clean and normalize mathematical expressions
- Problem Solving - Apply mathematical algorithms and ML models
- Evaluation - Validate solutions against test cases
- Submission - Generate formatted submission file
from src import solve_math_problem, clean_latex
# Solve a math problem
problem = r"Find the sum of all positive integers $n$ such that $n^2 + 12n - 2007$ is a perfect square."
answer = solve_math_problem(problem)
print(f"Answer: {answer}")
# Clean LaTeX expressions
equation = r"$$\begin{equation}x^2 + y^2 = z^2\end{equation}$$"
cleaned = clean_latex(equation)
print(cleaned) # Output: $$x^2 + y^2 = z^2$$- Problem Input - Receives math problem in LaTeX format
- Problem Classification - Automatically classifies into type (algebra, number theory, etc.)
- Model Loading - Uses cached model for efficiency across multiple problems
- Prompt Building - Creates optimized prompt with type-specific hints
- Multi-Attempt Generation - Generates 3 solutions (configurable)
- First attempt: Greedy decoding (deterministic, +0.2 weight)
- Subsequent attempts: Temperature sampling (diverse)
- Code Execution - If model outputs Python code, executes it with SymPy (+0.5 weight)
- Enhanced retry: Automatically fixes common errors and retries
- Answer Extraction - Extracts answer from
\boxed{}(+0.2 weight) or code output - Answer Validation - Validates against problem constraints (modulo, range)
- Iterative Refinement - If agreement < 50%, adds refinement attempts with feedback
- Weighted Voting - Returns highest-confidence answer using metadata weights
Test the solver on validation problems:
python app.pyThis runs test_locally(num_samples=5) by default, which:
- Loads problems from
reference.csv - Splits into train/validation sets
- Tests on 5 validation problems
- Reports accuracy
For actual competition submission, modify app.py:
if __name__ == "__main__":
main() # Instead of test_locally()Then run:
python app.pyEdit src/config.py to tune the solver:
# Number of solution attempts per problem (majority voting)
NUM_ATTEMPTS = 3
# Temperature for diverse sampling (0 = greedy, >0 = sampling)
TEMPERATURE = 0.7
# Maximum tokens to generate
MAX_NEW_TOKENS = 2048
# Enable/disable Python code execution
ENABLE_CODE_EXECUTION = True
# Code execution timeout (seconds)
CODE_TIMEOUT = 10
# Enhanced code retry with automatic error fixing
ENABLE_CODE_RETRY = True
CODE_MAX_RETRIES = 2
# Iterative refinement when answers disagree
ENABLE_ITERATIVE_REFINEMENT = True
REFINEMENT_THRESHOLD = 0.5 # Trigger if agreement below 50%
MAX_REFINEMENT_ATTEMPTS = 2The model is automatically cached after the first problem to improve performance when solving multiple problems:
from src import get_cached_model_and_tokenizer, clear_model_cache
# Get cached model (loads once, reuses thereafter)
model, tokenizer = get_cached_model_and_tokenizer()
# Clear cache if needed (e.g., to free memory)
clear_model_cache()For questions or issues regarding this solution, please refer to the competition guidelines on Kaggle.
This project is submitted as part of the Kaggle AIMO Progress Prize competition.
Last Updated: December 2025