A FiftyOne plugin for loading PDFs as images.
pdf-loader.mp4
If you haven't already, install FiftyOne:
pip install fiftyoneThen install the plugin and its dependencies:
fiftyone plugins download https://github.com/brimoor/pdf-loader
brew install poppler
pip install pdf2image- Launch the App:
import fiftyone as fo
dataset = fo.Dataset()
session = fo.launch_app(dataset)-
Press
`or click theBrowse operationsicon above the grid -
Run the
pdf_loaderoperator
You can use the plugin programmatically from Python:
import fiftyone as fo
import fiftyone.operators as foo
import requests
import os
# Download a PDF from a URL (optional - you can use any local PDF)
url = "https://arxiv.org/pdf/2309.11419"
filename = url.split('/')[-1] + ".pdf" # Add .pdf extension
response = requests.get(url)
if response.status_code == 200:
with open(filename, 'wb') as f:
f.write(response.content)
print(f"Downloaded {filename}")
else:
print(f"Failed to download {filename}. Status code: {response.status_code}")
# Load the PDF loader operator
pdf_loader = foo.get_operator("@brimoor/pdf-loader/pdf_loader")
# Create a dataset for the PDF pages
pdf_dataset = fo.Dataset("pdf_dataset")
# Convert PDF to images and add to dataset
pdf_loader(
pdf_dataset,
input_path="./2309.11419.pdf", # Path to your PDF file
output_dir="./pdf_images", # Directory to save the images
dpi=200, # Image quality in DPI
fmt="png", # Image format (png or jpg)
tags=None, # Optional tags for samples
delegate=False # Set to True for async execution
)Install the PyTesseract OCR and Semantic Document Search plugins to make your documents searchable!
brian-thesis-search.mp4
- Install the plugins and their dependencies:
fiftyone plugins download https://github.com/jacobmarks/pytesseract-ocr-plugin
pip install pytesseract
https://github.com/jacobmarks/semantic-document-search-plugin
pip install qdrant_client
pip install sentence_transformers- Launch a Qdrant server:
docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant
-
Run the
run_ocr_engineoperator to detect text blocks -
Run the
create_semantic_document_indexoperator to generate a semantic index for the text blocks -
Run the
semantically_search_documentsoperator to perform arbitrary searches against the index!
This plugin is a basically a wrapper around the following code:
import os
from pdf2image import convert_from_path
INPUT_PATH = "/path/to/your.pdf"
OUTPUT_DIR = "/path/for/page/images"
os.makedirs(OUTPUT_DIR, exist_ok=True)
convert_from_path(INPUT_PATH, output_folder=OUTPUT_DIR, fmt="jpg")
dataset.add_images_dir(OUTPUT_DIR)