datalab-to · VikParuchuri · Jan 29, 2025 · Jan 27, 2025 · Jan 27, 2025 · Jan 28, 2025
diff --git a/.github/workflows/benchmarks.yml b/.github/workflows/benchmarks.yml
@@ -37,4 +37,8 @@ jobs:
       - name: Run table recognition benchmark
         run: |
           poetry run python benchmark/table_recognition.py --max_rows 5
-          poetry run python benchmark/utils/verify_benchmark_scores.py results/benchmark/table_rec_bench/results.json --bench_type table_recognition
+          poetry run python benchmark/utils/verify_benchmark_scores.py results/benchmark/table_rec_bench/results.json --bench_type table_recognition
+      - name: Run texify benchmark
+        run: |
+          poetry run python benchmark/texify.py --max_rows 5
+          poetry run python benchmark/utils/verify_benchmark_scores.py results/benchmark/texify_bench/results.json --bench_type texify
diff --git a/.github/workflows/scripts.yml b/.github/workflows/scripts.yml
@@ -25,10 +25,16 @@ jobs:
       - name: Test detection
         run: poetry run surya_detect benchmark_data/pdfs/switch_trans.pdf --page_range 0
       - name: Test OCR
+        env:
+          RECOGNITION_MAX_TOKENS: 25
         run: poetry run surya_ocr benchmark_data/pdfs/switch_trans.pdf --page_range 0
       - name: Test layout
         run: poetry run surya_layout benchmark_data/pdfs/switch_trans.pdf --page_range 0
       - name: Test table
         run: poetry run surya_table benchmark_data/pdfs/switch_trans.pdf --page_range 0
+      - name: Test texify
+        env:
+          TEXIFY_MAX_TOKENS: 25
+        run: poetry run surya_latex_ocr benchmark_data/pdfs/switch_trans.pdf --page_range 0
       - name: Test detection folder
         run: poetry run surya_detect benchmark_data/pdfs --page_range 0
diff --git a/README.md b/README.md
@@ -7,6 +7,7 @@ Surya is a document OCR toolkit that does:
 - Layout analysis (table, image, header, etc detection)
 - Reading order detection
 - Table recognition (detecting rows/columns)
+- LaTeX OCR
 
 It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
 
@@ -19,9 +20,9 @@ It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmar
 |:------------------------------------------------------------------:|:--------------------------------------------------------------------------:|
 | <img src="static/images/excerpt_layout.png" width="500px"/> | <img src="static/images/excerpt_reading.jpg" width="500px"/> |
 
-|                       Table Recognition                       |     |
-|:-------------------------------------------------------------:|:----------------:|
-| <img src="static/images/scanned_tablerec.png" width="500px"/> | <img width="500px"/> |
+|                       Table Recognition                       |                       LaTeX OCR                        |
+|:-------------------------------------------------------------:|:------------------------------------------------------:|
+| <img src="static/images/scanned_tablerec.png" width="500px"/> | <img src="static/images/latex_ocr.png" width="500px"/> |
 
 
 Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who has universal vision.
@@ -284,10 +285,48 @@ from surya.table_rec import TableRecPredictor
 image = Image.open(IMAGE_PATH)
 table_rec_predictor = TableRecPredictor()
 
-# list of dicts, one per image
 table_predictions = table_rec_predictor([image])
 ```
 
+## LaTeX OCR
+
+This command will write out a json file with the LaTeX of the equations.  You must pass in images that are already cropped to the equations.  You can do this by running the layout model, then cropping, if you want.
+
+```shell
+surya_latex_ocr DATA_PATH
+```
+
+- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
+- `--output_dir` specifies the directory to save results to instead of the default
+- `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
+
+The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions.  Each value will be a list of dictionaries, one per page of the input document.  Each page dictionary contains:
+
+- `text` - the detected LaTeX text - it will be in KaTeX compatible LaTeX, with `<math display="block">...</math>` and `<math>...</math>` as delimiters.
+- `confidence` - the prediction confidence from 0-1.
+- `page` - the page number in the file
+
+### From python
+
+```python
+from PIL import Image
+from surya.texify import TexifyPredictor
+
+image = Image.open(IMAGE_PATH)
+predictor = TexifyPredictor()
+
+predictor([image])
+```
+
+### Interactive app
+
+You can also run a special interactive app that lets you select equations and OCR them (kind of like MathPix snip) with:
+
+```shell
+pip install streamlit==1.40 streamlit-drawable-canvas-jsretry
+texify_gui
+```
+
 # Limitations
 
 - This is specialized for document OCR.  It will likely not work on photos or other images.
@@ -413,6 +452,14 @@ Higher is better for intersection, which the percentage of the actual row/column
 
 The benchmark uses a subset of [Fintabnet](https://developer.ibm.com/exchanges/data/all/fintabnet/) from IBM.  It has labeled rows and columns.  After table recognition is run, the predicted rows and columns are compared to the ground truth.  There is an additional penalty for predicting too many or too few rows/columns.
 
+## LaTeX OCR
+
+| Method | edit ⬇   | time taken (s) ⬇ |
+|--------|----------|------------------|
+| texify | 0.122617 | 35.6345          |
+
+This inferences texify on a ground truth set of LaTeX, then does edit distance.  This is a bit noisy, since 2 LaTeX strings that render the same can have different symbols in them.
+
 ## Running your own benchmarks
 
 You can benchmark the performance of surya on your machine.  
@@ -482,6 +529,15 @@ python benchmark/table_recognition.py --max_rows 1024 --tatr
 - `--results_dir` will let you specify a directory to save results to instead of the default one
 - `--tatr` specifies whether to also run table transformer
 
+**LaTeX OCR**
+
+```shell
+python benchmark/texify.py --max_rows 128
+```
+
+- `--max_rows` controls how many images to process for the benchmark
+- `--results_dir` will let you specify a directory to save results to instead of the default one
+
 # Training
 
 Text detection was trained on 4x A6000s for 3 days.  It used a diverse set of images as training data.  It was trained from scratch using a modified efficientvit architecture for semantic segmentation.

diff --git a/benchmark/texify.py b/benchmark/texify.py
@@ -0,0 +1,95 @@
+import argparse
+import os.path
+import random
+import re
+import time
+from functools import partial
+from pathlib import Path
+from typing import List
+
+import click
+import datasets
+from tabulate import tabulate
+from bs4 import BeautifulSoup
+
+from surya.settings import settings
+from surya.texify import TexifyPredictor, TexifyResult
+import json
+import io
+from rapidfuzz.distance import Levenshtein
+
+def normalize_text(text):
+    soup = BeautifulSoup(text, "html.parser")
+    text = soup.get_text()
+    text = re.sub(r"\n", " ", text)
+    text = re.sub(r"\s+", " ", text)
+    return text.strip()
+
+
+def score_text(predictions, references):
+    lev_dist = []
+    for p, r in zip(predictions, references):
+        p = normalize_text(p)
+        r = normalize_text(r)
+        lev_dist.append(Levenshtein.normalized_distance(p, r))
+
+    return sum(lev_dist) / len(lev_dist)
+
+
+def inference_texify(source_data, predictor):
+    texify_predictions: List[TexifyResult] = predictor([sd["image"] for sd in source_data])
+    out_data = [
+        {"text": texify_predictions[i].text, "equation": source_data[i]["equation"]}
+        for i in range(len(texify_predictions))
+    ]
+
+    return out_data
+
+
+def image_to_bmp(image):
+    img_out = io.BytesIO()
+    image.save(img_out, format="BMP")
+    return img_out
+
+@click.command(help="Benchmark the performance of texify.")
+@click.option("--ds_name", type=str, help="Path to dataset file with source images/equations.", default=settings.TEXIFY_BENCHMARK_DATASET)
+@click.option("--results_dir", type=str, help="Path to JSON file with benchmark results.", default=os.path.join(settings.RESULT_DIR, "benchmark"))
+@click.option("--max_rows", type=int, help="Maximum number of images to benchmark.", default=None)
+def main(ds_name: str, results_dir: str, max_rows: int):
+    predictor = TexifyPredictor()
+    ds = datasets.load_dataset(ds_name, split="train")
+
+    if max_rows:
+        ds = ds.filter(lambda x, idx: idx < max_rows, with_indices=True)
+
+    start = time.time()
+    predictions = inference_texify(ds, predictor)
+    time_taken = time.time() - start
+
+    text = [p["text"] for p in predictions]
+    references = [p["equation"] for p in predictions]
+    scores = score_text(text, references)
+
+    write_data = {
+        "scores": scores,
+        "text": [{"prediction": p, "reference": r} for p, r in zip(text, references)]
+    }
+
+    score_table = [
+        ["texify", write_data["scores"], time_taken]
+    ]
+    score_headers = ["edit", "time taken (s)"]
+    score_dirs = ["⬇", "⬇"]
+
+    score_headers = [f"{h} {d}" for h, d in zip(score_headers, score_dirs)]
+    print()
+    print(tabulate(score_table, headers=["Method", *score_headers]))
+
+    result_path = Path(results_dir) / "texify_bench"
+    result_path.mkdir(parents=True, exist_ok=True)
+    with open(result_path / "results.json", "w") as f:
+        json.dump(write_data, f, indent=4)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmark/utils/verify_benchmark_scores.py b/benchmark/utils/verify_benchmark_scores.py
@@ -37,6 +37,11 @@ def verify_table_rec(data):
     if row_score < 0.75 or col_score < 0.75:
         raise ValueError("Scores do not meet the required threshold")
 
+def verify_texify(data):
+    edit_dist = data["scores"]
+    if edit_dist > .2:
+        raise ValueError("Scores do not meet the required threshold")
+
 
 @click.command(help="Verify benchmark scores")
 @click.argument("file_path", type=str)
@@ -55,6 +60,8 @@ def main(file_path, bench_type):
         verify_order(data)
     elif bench_type == "table_recognition":
         verify_table_rec(data)
+    elif bench_type == "texify":
+        verify_texify(data)
     else:
         raise ValueError("Invalid benchmark type")
 

diff --git a/detect_layout.py b/detect_layout.py
@@ -1,4 +1,4 @@
-from surya.scripts import detect_layout_cli
+from surya.scripts.detect_layout import detect_layout_cli
 
 if __name__ == "__main__":
     detect_layout_cli()
diff --git a/detect_text.py b/detect_text.py
@@ -1,4 +1,4 @@
-from surya.scripts import detect_text_cli
+from surya.scripts.detect_text import detect_text_cli
 
 if __name__ == "__main__":
     detect_text_cli()

diff --git a/ocr_app.py b/ocr_app.py
@@ -1,4 +1,4 @@
-from surya.scripts import streamlit_app_cli
+from surya.scripts.run_streamlit_app import streamlit_app_cli
 
 if __name__ == "__main__":
     streamlit_app_cli()
diff --git a/ocr_latex.py b/ocr_latex.py
@@ -0,0 +1,4 @@
+from surya.scripts.ocr_latex import ocr_latex_cli
+
+if __name__ == "__main__":
+    ocr_latex_cli()
diff --git a/ocr_text.py b/ocr_text.py
@@ -1,4 +1,4 @@
-from surya.scripts import ocr_text_cli
+from surya.scripts.ocr_text import ocr_text_cli
 
 if __name__ == "__main__":
     ocr_text_cli()