PDF Loader Plugin

A FiftyOne plugin for loading PDFs as images.

pdf-loader.mp4

Installation

If you haven't already, install FiftyOne:

pip install fiftyone

Then install the plugin and its dependencies:

fiftyone plugins download https://github.com/brimoor/pdf-loader

brew install poppler
pip install pdf2image

Usage

Using the App UI

Launch the App:

import fiftyone as fo

dataset = fo.Dataset()
session = fo.launch_app(dataset)

Press ` or click the Browse operations icon above the grid
Run the pdf_loader operator

Using the SDK

You can use the plugin programmatically from Python:

import fiftyone as fo
import fiftyone.operators as foo

import requests
import os

# Download a PDF from a URL (optional - you can use any local PDF)
url = "https://arxiv.org/pdf/2309.11419"
filename = url.split('/')[-1] + ".pdf"  # Add .pdf extension

response = requests.get(url)

if response.status_code == 200:
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded {filename}")
else:
    print(f"Failed to download {filename}. Status code: {response.status_code}")

# Load the PDF loader operator
pdf_loader = foo.get_operator("@brimoor/pdf-loader/pdf_loader")

# Create a dataset for the PDF pages
pdf_dataset = fo.Dataset("pdf_dataset")

# Convert PDF to images and add to dataset
pdf_loader(
    pdf_dataset,
    input_path="./2309.11419.pdf",  # Path to your PDF file
    output_dir="./pdf_images",     # Directory to save the images
    dpi=200,                        # Image quality in DPI
    fmt="png",                      # Image format (png or jpg)
    tags=None,                      # Optional tags for samples
    delegate=False                  # Set to True for async execution
)

What next?

Install the PyTesseract OCR and Semantic Document Search plugins to make your documents searchable!

brian-thesis-search.mp4

Install the plugins and their dependencies:

fiftyone plugins download https://github.com/jacobmarks/pytesseract-ocr-plugin
pip install pytesseract

https://github.com/jacobmarks/semantic-document-search-plugin
pip install qdrant_client
pip install sentence_transformers

Launch a Qdrant server:

docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant

Run the run_ocr_engine operator to detect text blocks
Run the create_semantic_document_index operator to generate a semantic index for the text blocks
Run the semantically_search_documents operator to perform arbitrary searches against the index!

Implementation

This plugin is a basically a wrapper around the following code:

import os
from pdf2image import convert_from_path

INPUT_PATH = "/path/to/your.pdf"
OUTPUT_DIR = "/path/for/page/images"

os.makedirs(OUTPUT_DIR, exist_ok=True)
convert_from_path(INPUT_PATH, output_folder=OUTPUT_DIR, fmt="jpg")

dataset.add_images_dir(OUTPUT_DIR)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
__init__.py		__init__.py
fiftyone.yml		fiftyone.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Loader Plugin

Installation

Usage

Using the App UI

Using the SDK

What next?

Implementation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Loader Plugin

Installation

Usage

Using the App UI

Using the SDK

What next?

Implementation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages