Skip to content

klncgty/pdfXtractor

Repository files navigation

Complex Table Extraction from PDFs to CSV-JSON - AI Web Application

Automatically extract complex tables from your PDF documents and convert them into clean, ready-to-use CSV or JSON formats. Powered by advanced AI, this application streamlines the data extraction process and allows you to ask questions about the tables via an agent.

🚀 Key Features

  • AI-Powered Extraction: Processes complex PDF tables with high accuracy.
  • User-Friendly Interface: Upload a PDF, select pages, and instantly download as CSV or JSON.
  • Smart Error Handling: Delivers clean and reliable outputs.
  • Powerful Integrations: Hugging Face models, gmft library, agentic AI, and vector database support.

🔧 Technology Stack

  • Backend: Python (FastAPI), Hugging Face Transformers
  • Frontend: React.js
  • PDF Processing: gmft, PyPDF2
  • Database: PostgreSQL
  • AI Features: Custom NER and summarization models

🛠 Use Cases

  • Complex Table Extraction

    Accurately handles nested headers, merged cells, and similar structures. Suitable for financial reports, academic studies, and official documents.

  • RAG (Retrieval-Augmented Generation) Integration

Installation

1️⃣ Clone the Repository

Run the following command in your terminal or command prompt to clone the repository to your local machine:

git clone https://github.com/klncgty/pdfXtractor.git

2️⃣ Install Python Dependencies

To install the required Python packages for the API, use the requirements.txt file located in the project’s root directory:

pip install -r requirements.txt

3️⃣ Run the API

The API code is located in the api folder. Navigate to the api folder in your terminal and start the FastAPI application:

cd ../api
uvicorn main:app --reload

4️⃣ Run the Frontend

The frontend code is located in the src folder. Navigate to the src folder in your terminal and run the following commands:

cd src
npm install
npm run dev

After running the command, you will see an output similar to this in your terminal:

➜  Local:   http://localhost:port/

click on terminal to visualize app http://localhost:port/

⚠️ Important Notes

  • ** PDF uploading:**
    Uploaded PDF files will be saved in the uploads folder in your local directory.

  • ** Outputs:**
    Outputs generated from processed PDF files are saved in the outputs folder. Ensure this folder exists and has write permissions.

  • ** CORS error:**
    If you encounter an error in the browser like this:

    Access to XMLHttpRequest at 'http://localhost:8000/upload' from origin 'http://localhost:5173' has been blocked by CORS policy
    

    You can solve erorr adding this allow_origins=["*"] to api/main.py

    from fastapi.middleware.cors import CORSMiddleware
    
    app.add_middleware(
        CORSMiddleware,
        allow_origins=["*"],  # Tüm domainlere izin verir
        allow_credentials=True,
        allow_methods=["*"],
        allow_headers=["*"],
    )

📌 Add.


If you encounter any issues, please report them via GitHub Issues.

License and Usage

PDFXtractor is for personal use only. It cannot be used for commercial purposes, redistributed, or offered as a service to others.

Licensed under the Creative Commons BY-NC 4.0 License.
More info: https://creativecommons.org/licenses/by-nc/4.0/

About

table extarctor web app ai powered.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages