Automatically extract complex tables from your PDF documents and convert them into clean, ready-to-use CSV or JSON formats. Powered by advanced AI, this application streamlines the data extraction process and allows you to ask questions about the tables via an agent.
- AI-Powered Extraction: Processes complex PDF tables with high accuracy.
- User-Friendly Interface: Upload a PDF, select pages, and instantly download as CSV or JSON.
- Smart Error Handling: Delivers clean and reliable outputs.
- Powerful Integrations: Hugging Face models, gmft library, agentic AI, and vector database support.
- Backend: Python (FastAPI), Hugging Face Transformers
- Frontend: React.js
- PDF Processing: gmft, PyPDF2
- Database: PostgreSQL
- AI Features: Custom NER and summarization models
-
Complex Table Extraction
Accurately handles nested headers, merged cells, and similar structures. Suitable for financial reports, academic studies, and official documents.
-
RAG (Retrieval-Augmented Generation) Integration
Run the following command in your terminal or command prompt to clone the repository to your local machine:
git clone https://github.com/klncgty/pdfXtractor.gitTo install the required Python packages for the API, use the requirements.txt file located in the project’s root directory:
pip install -r requirements.txtThe API code is located in the api folder. Navigate to the api folder in your terminal and start the FastAPI application:
cd ../api
uvicorn main:app --reloadThe frontend code is located in the src folder. Navigate to the src folder in your terminal and run the following commands:
cd src
npm install
npm run devAfter running the command, you will see an output similar to this in your terminal:
➜ Local: http://localhost:port/
click on terminal to visualize app http://localhost:port/
-
** PDF uploading:**
Uploaded PDF files will be saved in the uploads folder in your local directory. -
** Outputs:**
Outputs generated from processed PDF files are saved in the outputs folder. Ensure this folder exists and has write permissions. -
** CORS error:**
If you encounter an error in the browser like this:Access to XMLHttpRequest at 'http://localhost:8000/upload' from origin 'http://localhost:5173' has been blocked by CORS policyYou can solve erorr adding this
allow_origins=["*"]toapi/main.pyfrom fastapi.middleware.cors import CORSMiddleware app.add_middleware( CORSMiddleware, allow_origins=["*"], # Tüm domainlere izin verir allow_credentials=True, allow_methods=["*"], allow_headers=["*"], )
-
Frontend and API Communication:
The frontend interacts with the API to upload and process PDF files. Ensure both are running simultaneously. -
Development:
.... -
Models
gmft: https://github.com/conjuncts/gmft and pandasai base model: https://arxiv.org/abs/2110.00061
If you encounter any issues, please report them via GitHub Issues.
PDFXtractor is for personal use only. It cannot be used for commercial purposes, redistributed, or offered as a service to others.
Licensed under the Creative Commons BY-NC 4.0 License.
More info: https://creativecommons.org/licenses/by-nc/4.0/
