🚀 Key Features

Complex Table Extraction from PDFs to CSV-JSON - AI Web Application

Automatically extract complex tables from your PDF documents and convert them into clean, ready-to-use CSV or JSON formats. Powered by advanced AI, this application streamlines the data extraction process and allows you to ask questions about the tables via an agent.

🚀 Key Features

AI-Powered Extraction: Processes complex PDF tables with high accuracy.
User-Friendly Interface: Upload a PDF, select pages, and instantly download as CSV or JSON.
Smart Error Handling: Delivers clean and reliable outputs.
Powerful Integrations: Hugging Face models, gmft library, agentic AI, and vector database support.

🔧 Technology Stack

Backend: Python (FastAPI), Hugging Face Transformers
Frontend: React.js
PDF Processing: gmft, PyPDF2
Database: PostgreSQL
AI Features: Custom NER and summarization models

🛠 Use Cases

Complex Table Extraction

Accurately handles nested headers, merged cells, and similar structures. Suitable for financial reports, academic studies, and official documents.
RAG (Retrieval-Augmented Generation) Integration

Installation

1️⃣ Clone the Repository

Run the following command in your terminal or command prompt to clone the repository to your local machine:

git clone https://github.com/klncgty/pdfXtractor.git

2️⃣ Install Python Dependencies

To install the required Python packages for the API, use the requirements.txt file located in the project’s root directory:

pip install -r requirements.txt

3️⃣ Run the API

The API code is located in the api folder. Navigate to the api folder in your terminal and start the FastAPI application:

cd ../api
uvicorn main:app --reload

4️⃣ Run the Frontend

The frontend code is located in the src folder. Navigate to the src folder in your terminal and run the following commands:

cd src
npm install
npm run dev

After running the command, you will see an output similar to this in your terminal:

➜  Local:   http://localhost:port/

click on terminal to visualize app http://localhost:port/

⚠️ Important Notes

** PDF uploading:**
Uploaded PDF files will be saved in the uploads folder in your local directory.
** Outputs:**
Outputs generated from processed PDF files are saved in the outputs folder. Ensure this folder exists and has write permissions.

** CORS error:**
If you encounter an error in the browser like this:

Access to XMLHttpRequest at 'http://localhost:8000/upload' from origin 'http://localhost:5173' has been blocked by CORS policy

You can solve erorr adding this allow_origins=["*"] to api/main.py

from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Tüm domainlere izin verir
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

📌 Add.

Frontend and API Communication:
The frontend interacts with the API to upload and process PDF files. Ensure both are running simultaneously.
Development:
....
Models
gmft: https://github.com/conjuncts/gmft and pandasai base model: https://arxiv.org/abs/2110.00061

If you encounter any issues, please report them via GitHub Issues.

License and Usage

PDFXtractor is for personal use only. It cannot be used for commercial purposes, redistributed, or offered as a service to others.

Licensed under the Creative Commons BY-NC 4.0 License.
More info: https://creativecommons.org/licenses/by-nc/4.0/

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github/workflows		.github/workflows
api		api
dist		dist
src		src
.env		.env
LICENSE		LICENSE
README.md		README.md
eslint.config.js		eslint.config.js
example.test.ts		example.test.ts
headers		headers
index.html		index.html
netfly.toml		netfly.toml
package-lock.json		package-lock.json
package.json		package.json
packages.txt		packages.txt
pandasai.log		pandasai.log
postcss.config.js		postcss.config.js
requirements.txt		requirements.txt
table_detect.ipynb		table_detect.ipynb
tailwind.config.js		tailwind.config.js
to_do		to_do
tsconfig.app.json		tsconfig.app.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Complex Table Extraction from PDFs to CSV-JSON - AI Web Application

🚀 Key Features

🔧 Technology Stack

🛠 Use Cases

Installation

1️⃣ Clone the Repository

2️⃣ Install Python Dependencies

3️⃣ Run the API

4️⃣ Run the Frontend

⚠️ Important Notes

📌 Add.

License and Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

klncgty/pdfXtractor

Folders and files

Latest commit

History

Repository files navigation

Complex Table Extraction from PDFs to CSV-JSON - AI Web Application

🚀 Key Features

🔧 Technology Stack

🛠 Use Cases

Installation

1️⃣ Clone the Repository

2️⃣ Install Python Dependencies

3️⃣ Run the API

4️⃣ Run the Frontend

⚠️ Important Notes

📌 Add.

License and Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages