Skip to content

brimoor/pdf-loader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Loader Plugin

A FiftyOne plugin for loading PDFs as images.

pdf-loader.mp4

Installation

If you haven't already, install FiftyOne:

pip install fiftyone

Then install the plugin and its dependencies:

fiftyone plugins download https://github.com/brimoor/pdf-loader

brew install poppler
pip install pdf2image

Usage

Using the App UI

  1. Launch the App:
import fiftyone as fo

dataset = fo.Dataset()
session = fo.launch_app(dataset)
  1. Press ` or click the Browse operations icon above the grid

  2. Run the pdf_loader operator

Using the SDK

You can use the plugin programmatically from Python:

import fiftyone as fo
import fiftyone.operators as foo

import requests
import os

# Download a PDF from a URL (optional - you can use any local PDF)
url = "https://arxiv.org/pdf/2309.11419"
filename = url.split('/')[-1] + ".pdf"  # Add .pdf extension

response = requests.get(url)

if response.status_code == 200:
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded {filename}")
else:
    print(f"Failed to download {filename}. Status code: {response.status_code}")

# Load the PDF loader operator
pdf_loader = foo.get_operator("@brimoor/pdf-loader/pdf_loader")

# Create a dataset for the PDF pages
pdf_dataset = fo.Dataset("pdf_dataset")

# Convert PDF to images and add to dataset
pdf_loader(
    pdf_dataset,
    input_path="./2309.11419.pdf",  # Path to your PDF file
    output_dir="./pdf_images",     # Directory to save the images
    dpi=200,                        # Image quality in DPI
    fmt="png",                      # Image format (png or jpg)
    tags=None,                      # Optional tags for samples
    delegate=False                  # Set to True for async execution
)

What next?

Install the PyTesseract OCR and Semantic Document Search plugins to make your documents searchable!

brian-thesis-search.mp4
  1. Install the plugins and their dependencies:
fiftyone plugins download https://github.com/jacobmarks/pytesseract-ocr-plugin
pip install pytesseract

https://github.com/jacobmarks/semantic-document-search-plugin
pip install qdrant_client
pip install sentence_transformers
  1. Launch a Qdrant server:
docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant
  1. Run the run_ocr_engine operator to detect text blocks

  2. Run the create_semantic_document_index operator to generate a semantic index for the text blocks

  3. Run the semantically_search_documents operator to perform arbitrary searches against the index!

Implementation

This plugin is a basically a wrapper around the following code:

import os
from pdf2image import convert_from_path

INPUT_PATH = "/path/to/your.pdf"
OUTPUT_DIR = "/path/for/page/images"

os.makedirs(OUTPUT_DIR, exist_ok=True)
convert_from_path(INPUT_PATH, output_folder=OUTPUT_DIR, fmt="jpg")

dataset.add_images_dir(OUTPUT_DIR)

About

A FiftyOne plugin for loading PDF documents as images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages