Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .github/workflows/benchmarks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,8 @@ jobs:
- name: Run table recognition benchmark
run: |
poetry run python benchmark/table_recognition.py --max_rows 5
poetry run python benchmark/utils/verify_benchmark_scores.py results/benchmark/table_rec_bench/results.json --bench_type table_recognition
poetry run python benchmark/utils/verify_benchmark_scores.py results/benchmark/table_rec_bench/results.json --bench_type table_recognition
- name: Run texify benchmark
run: |
poetry run python benchmark/texify.py --max_rows 5
poetry run python benchmark/utils/verify_benchmark_scores.py results/benchmark/texify_bench/results.json --bench_type texify
6 changes: 6 additions & 0 deletions .github/workflows/scripts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,16 @@ jobs:
- name: Test detection
run: poetry run surya_detect benchmark_data/pdfs/switch_trans.pdf --page_range 0
- name: Test OCR
env:
RECOGNITION_MAX_TOKENS: 25
run: poetry run surya_ocr benchmark_data/pdfs/switch_trans.pdf --page_range 0
- name: Test layout
run: poetry run surya_layout benchmark_data/pdfs/switch_trans.pdf --page_range 0
- name: Test table
run: poetry run surya_table benchmark_data/pdfs/switch_trans.pdf --page_range 0
- name: Test texify
env:
TEXIFY_MAX_TOKENS: 25
run: poetry run surya_latex_ocr benchmark_data/pdfs/switch_trans.pdf --page_range 0
- name: Test detection folder
run: poetry run surya_detect benchmark_data/pdfs --page_range 0
64 changes: 60 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Surya is a document OCR toolkit that does:
- Layout analysis (table, image, header, etc detection)
- Reading order detection
- Table recognition (detecting rows/columns)
- LaTeX OCR

It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks) for more details).

Expand All @@ -19,9 +20,9 @@ It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmar
|:------------------------------------------------------------------:|:--------------------------------------------------------------------------:|
| <img src="static/images/excerpt_layout.png" width="500px"/> | <img src="static/images/excerpt_reading.jpg" width="500px"/> |

| Table Recognition | |
|:-------------------------------------------------------------:|:----------------:|
| <img src="static/images/scanned_tablerec.png" width="500px"/> | <img width="500px"/> |
| Table Recognition | LaTeX OCR |
|:-------------------------------------------------------------:|:------------------------------------------------------:|
| <img src="static/images/scanned_tablerec.png" width="500px"/> | <img src="static/images/latex_ocr.png" width="500px"/> |


Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who has universal vision.
Expand Down Expand Up @@ -284,10 +285,48 @@ from surya.table_rec import TableRecPredictor
image = Image.open(IMAGE_PATH)
table_rec_predictor = TableRecPredictor()

# list of dicts, one per image
table_predictions = table_rec_predictor([image])
```

## LaTeX OCR

This command will write out a json file with the LaTeX of the equations. You must pass in images that are already cropped to the equations. You can do this by running the layout model, then cropping, if you want.

```shell
surya_latex_ocr DATA_PATH
```

- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
- `--output_dir` specifies the directory to save results to instead of the default
- `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.

The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:

- `text` - the detected LaTeX text - it will be in KaTeX compatible LaTeX, with `<math display="block">...</math>` and `<math>...</math>` as delimiters.
- `confidence` - the prediction confidence from 0-1.
- `page` - the page number in the file

### From python

```python
from PIL import Image
from surya.texify import TexifyPredictor

image = Image.open(IMAGE_PATH)
predictor = TexifyPredictor()

predictor([image])
```

### Interactive app

You can also run a special interactive app that lets you select equations and OCR them (kind of like MathPix snip) with:

```shell
pip install streamlit==1.40 streamlit-drawable-canvas-jsretry
texify_gui
```

# Limitations

- This is specialized for document OCR. It will likely not work on photos or other images.
Expand Down Expand Up @@ -413,6 +452,14 @@ Higher is better for intersection, which the percentage of the actual row/column

The benchmark uses a subset of [Fintabnet](https://developer.ibm.com/exchanges/data/all/fintabnet/) from IBM. It has labeled rows and columns. After table recognition is run, the predicted rows and columns are compared to the ground truth. There is an additional penalty for predicting too many or too few rows/columns.

## LaTeX OCR

| Method | edit ⬇ | time taken (s) ⬇ |
|--------|----------|------------------|
| texify | 0.122617 | 35.6345 |

This inferences texify on a ground truth set of LaTeX, then does edit distance. This is a bit noisy, since 2 LaTeX strings that render the same can have different symbols in them.

## Running your own benchmarks

You can benchmark the performance of surya on your machine.
Expand Down Expand Up @@ -482,6 +529,15 @@ python benchmark/table_recognition.py --max_rows 1024 --tatr
- `--results_dir` will let you specify a directory to save results to instead of the default one
- `--tatr` specifies whether to also run table transformer

**LaTeX OCR**

```shell
python benchmark/texify.py --max_rows 128
```

- `--max_rows` controls how many images to process for the benchmark
- `--results_dir` will let you specify a directory to save results to instead of the default one

# Training

Text detection was trained on 4x A6000s for 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified efficientvit architecture for semantic segmentation.
Expand Down
95 changes: 95 additions & 0 deletions benchmark/texify.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
import argparse
import os.path
import random
import re
import time
from functools import partial
from pathlib import Path
from typing import List

import click
import datasets
from tabulate import tabulate
from bs4 import BeautifulSoup

from surya.settings import settings
from surya.texify import TexifyPredictor, TexifyResult
import json
import io
from rapidfuzz.distance import Levenshtein

def normalize_text(text):
soup = BeautifulSoup(text, "html.parser")
text = soup.get_text()
text = re.sub(r"\n", " ", text)
text = re.sub(r"\s+", " ", text)
return text.strip()


def score_text(predictions, references):
lev_dist = []
for p, r in zip(predictions, references):
p = normalize_text(p)
r = normalize_text(r)
lev_dist.append(Levenshtein.normalized_distance(p, r))

return sum(lev_dist) / len(lev_dist)


def inference_texify(source_data, predictor):
texify_predictions: List[TexifyResult] = predictor([sd["image"] for sd in source_data])
out_data = [
{"text": texify_predictions[i].text, "equation": source_data[i]["equation"]}
for i in range(len(texify_predictions))
]

return out_data


def image_to_bmp(image):
img_out = io.BytesIO()
image.save(img_out, format="BMP")
return img_out

@click.command(help="Benchmark the performance of texify.")
@click.option("--ds_name", type=str, help="Path to dataset file with source images/equations.", default=settings.TEXIFY_BENCHMARK_DATASET)
@click.option("--results_dir", type=str, help="Path to JSON file with benchmark results.", default=os.path.join(settings.RESULT_DIR, "benchmark"))
@click.option("--max_rows", type=int, help="Maximum number of images to benchmark.", default=None)
def main(ds_name: str, results_dir: str, max_rows: int):
predictor = TexifyPredictor()
ds = datasets.load_dataset(ds_name, split="train")

if max_rows:
ds = ds.filter(lambda x, idx: idx < max_rows, with_indices=True)

start = time.time()
predictions = inference_texify(ds, predictor)
time_taken = time.time() - start

text = [p["text"] for p in predictions]
references = [p["equation"] for p in predictions]
scores = score_text(text, references)

write_data = {
"scores": scores,
"text": [{"prediction": p, "reference": r} for p, r in zip(text, references)]
}

score_table = [
["texify", write_data["scores"], time_taken]
]
score_headers = ["edit", "time taken (s)"]
score_dirs = ["⬇", "⬇"]

score_headers = [f"{h} {d}" for h, d in zip(score_headers, score_dirs)]
print()
print(tabulate(score_table, headers=["Method", *score_headers]))

result_path = Path(results_dir) / "texify_bench"
result_path.mkdir(parents=True, exist_ok=True)
with open(result_path / "results.json", "w") as f:
json.dump(write_data, f, indent=4)


if __name__ == "__main__":
main()
7 changes: 7 additions & 0 deletions benchmark/utils/verify_benchmark_scores.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,11 @@ def verify_table_rec(data):
if row_score < 0.75 or col_score < 0.75:
raise ValueError("Scores do not meet the required threshold")

def verify_texify(data):
edit_dist = data["scores"]
if edit_dist > .2:
raise ValueError("Scores do not meet the required threshold")


@click.command(help="Verify benchmark scores")
@click.argument("file_path", type=str)
Expand All @@ -55,6 +60,8 @@ def main(file_path, bench_type):
verify_order(data)
elif bench_type == "table_recognition":
verify_table_rec(data)
elif bench_type == "texify":
verify_texify(data)
else:
raise ValueError("Invalid benchmark type")

Expand Down
2 changes: 1 addition & 1 deletion detect_layout.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from surya.scripts import detect_layout_cli
from surya.scripts.detect_layout import detect_layout_cli

if __name__ == "__main__":
detect_layout_cli()
2 changes: 1 addition & 1 deletion detect_text.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from surya.scripts import detect_text_cli
from surya.scripts.detect_text import detect_text_cli

if __name__ == "__main__":
detect_text_cli()
Expand Down
2 changes: 1 addition & 1 deletion ocr_app.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from surya.scripts import streamlit_app_cli
from surya.scripts.run_streamlit_app import streamlit_app_cli

if __name__ == "__main__":
streamlit_app_cli()
4 changes: 4 additions & 0 deletions ocr_latex.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from surya.scripts.ocr_latex import ocr_latex_cli

if __name__ == "__main__":
ocr_latex_cli()
2 changes: 1 addition & 1 deletion ocr_text.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from surya.scripts import ocr_text_cli
from surya.scripts.ocr_text import ocr_text_cli

if __name__ == "__main__":
ocr_text_cli()
Loading