- Python 3.12. Install with
pip install -e .(orpip install .). - Dependencies:
openai,tiktoken,python-dotenv,pymupdf,pymupdf4llm,pymupdf-layout,python-docx,beautifulsoup4,lxml. - Optional:
pip install openaireview[mistral]for Mistral OCR,pip install openaireview[deepseek]for DeepSeek OCR. - Dev dependencies (for benchmarks):
pytest. Install withpip install -e ".[dev]". - API key and model overrides in
.env(see.env.example).
__init__.py—__version__ = "0.1.0".cli.py— CLI entry point (openaireview review <file>,openaireview serve).parsers.py— Document parsers (PDF, DOCX, TEX, TXT/MD) returning(title, full_text).serve.py— Local HTTP server for review visualization.models.py—Comment(withparagraph_index) andReviewResultdataclasses.client.py— Multi-provider LLM client (OpenRouter, OpenAI, Anthropic, Gemini, Mistral). Auto-detects from API keys or explicit--provider/REVIEW_PROVIDER.utils.py—count_tokens,chunk_text,parse_comments_from_response,locate_comment_in_document,assign_paragraph_indices.method_zero_shot.py— single-prompt review; chunks paper if > 100K tokens.method_local.py— deep-checks each chunk with surrounding window context.method_progressive.py— sequential processing with running summary + consolidation.evaluate.py— recall/precision/F1 via fuzzy quote similarity, location-based recall, cost estimation.viz/index.html— Visualization UI (served byserve.py).
openaireview review paper.pdf # default: progressive, auto-detect provider
openaireview review paper.pdf --provider anthropic # explicit provider
openaireview review paper.pdf --method zero_shot
openaireview serve --results-dir ./review_results --port 8080Five providers supported, all via OpenAI-compatible endpoints (no extra SDKs needed for chat):
| Provider | Env var | Base URL | Model prefix stripped |
|---|---|---|---|
| OpenRouter | OPENROUTER_API_KEY |
openrouter.ai/api/v1 |
(none — passes model as-is) |
| OpenAI | OPENAI_API_KEY |
(default) | (none) |
| Anthropic | ANTHROPIC_API_KEY |
api.anthropic.com/v1/ |
anthropic/ |
| Gemini | GEMINI_API_KEY |
generativelanguage.googleapis.com/v1beta/openai/ |
google/ |
| Mistral | MISTRAL_API_KEY |
api.mistral.ai/v1 |
mistralai/ |
Resolution order (4 tiers):
--providerCLI flag (explicit)REVIEW_PROVIDERenv var (persistent config in.env)- Model-prefix match:
anthropic/claude-*→ Anthropic API if key available - Fallback: first available key in priority order (openrouter > openai > anthropic > gemini > mistral)
Reasoning tokens are provider-specific:
- OpenRouter:
extra_body.reasoning.max_tokens - Anthropic:
extra_body.thinking(budget_tokens) - OpenAI:
reasoning_effortstring - Gemini:
extra_body.thinking(budget_tokens) - Mistral: not supported
Empty response handling: if reasoning consumes all output tokens, auto-retries up to 3x with doubled max_tokens.
OCR is separate: Mistral OCR uses MISTRAL_API_KEY via mistral-ocr-cli (not through client.py). Controlled by --ocr flag, not --provider.
SKILL.md— Skill definition for/openaireviewcommand in Claude Code. Runs a multi-agent pipeline with section-level sub-agents and cross-cutting agents, producing severity-tiered findings (major/moderate/minor).scripts/prepare_workspace.py— Parses paper, splits into sections, writes workspace files.scripts/consolidate_comments.py— Merges sub-agent comment JSONs, deduplicates, and assigns severity tiers.scripts/save_viz_json.py— Builds viz-compatible JSON foropenaireview serve.references/criteria.md— Review criteria copied into workspace for sub-agents.references/subagent_templates.md— Prompt templates for sub-agents.- Install with
openaireview install-skill, then use/openaireview <path-or-url>in Claude Code.
benchmarks/data/— Raw HTML pages from refine.ink + parsedbenchmark.jsonl.benchmarks/scripts/— Parsing, experiment, and evaluation scripts.benchmarks/run_benchmark.py— Benchmark runner.benchmarks/REPORT.md— Experiment results.benchmarks/viz_data/— Pre-generated viz data for benchmark papers.- 52 ground-truth comments across 4 papers (43 technical, 9 logical).
- Mathematical / formula errors (9.6%): Wrong formulas, sign errors, missing factors, incorrect derivations.
- Notation inconsistencies (11.5%): Symbols used differently than defined, undefined notation introduced without explanation.
- Logical / conceptual inconsistencies (7.7%): Statements that contradict each other or the paper's own framework.
- Incorrect cross-references (only when numbered references exist): E.g. theorem/equation number mismatches.
- Insufficient justification / missing proof steps (15.4%): Compressed or hand-waved proof steps.
- Parameter / numerical inconsistencies (9.6%): Stated values contradict what can be derived elsewhere.
- Inconsistency between text and formal definitions (7.7%): Prose disagrees with math.
- Questionable claims / overclaiming (9.6%): Misleadingly strong or imprecise statements.
- Ambiguity / unclear exposition (7.7%): Could mislead a careful reader.
- Sign / direction errors (5.8%): Wrong sign conventions or reversed inequality directions.
- Subscript / index errors (3.8%): Wrong subscript in a specific term.
- Recall is low across all methods (4-6%) using similarity-based matching.
- The main issue is the matching metric: models find the correct passages but paraphrase them.
- RAG generates more comments but at lower precision; the filtering step is too permissive.
- Defs-scoped variants (rag_nofilter_defs): avg_loc_recall=0.58, avg_llm_recall=0.19, avg_wide_recall=0.37.