Skip to content

klncgty/Project_BEAM

Repository files navigation

Project BEAM

In this project, i guide you through the process of converting PDF files to JSON format, performing data retrieval using Ragie.ai, and implementing similar operations locally. Below are the key steps and processes involved.

📝 PDF to JSON Conversion

The first step involves converting our PDF files into JSON format.Because json format more suitable to reach all the priceses of materials correctly because of its dict format. When we search in the descr. retriver can get its price together like below. image

We used the camelot_convert_to_json.ipynb notebook located in the advanced_file_ext folder to achieve this. This notebook provides all the necessary code to extract tables from PDF files and convert them into JSON format.

In camelot.ipynb, you can see and apply converting process pdf to csv. you can rearrange the whole process.

Steps:

  1. Extract Tables: We extracted tables from the PDF using the Camelot library.
  2. Save as JSON: Each table was saved as a temporary JSON file.
  3. Combine JSONs: All JSON files were combined into a single all_tables.json file, ensuring that paragraph descriptions were kept intact without splitting sentences. !! However, some paragraphs may fail to put it together. If the resource is more configured, this problem may not occur. Or maybe, some manupilations on csv tables, got from camelot , can be useful for not facing this issue. But ı can state that this is not a big problem. !!

🚀 Operations in Ragie.ai

Once the PDF data is converted to JSON, we uploaded this JSON file to Ragie.ai for advanced data retrieval operations. Below are the files and scripts used for these operations.

JSON Uploading

  • Script: raige_uploading.py
  • Purpose: This script uploads the JSON data to Ragie.ai, making it accessible for subsequent queries.

Accessing Data

  • Script: reach_document.py
  • Purpose: Allows you to access and retrieve data from the uploaded JSON file on Ragie.ai.
  • You will have an output as below.
      {  "id": "64ca0a7b-2604-42dc-b22e-e54204ebb96d",
          "created_at": "2024-08-01T14:04:41.746519Z",
          "updated_at": "2024-08-01T14:17:07.294238Z",
          "status": "ready",
          "name": "EXAMPLE.pdf",
          "metadata": { "title": "YOUR_TITLE", "environment": "YOUR_ENV_NAME"},
          "chunk_count": 1942,
          "external_id": null}
    

Query Execution

  • Script: query.py
  • Purpose: This script is used to execute queries against the JSON data on Ragie.ai, retrieving the most relevant results based on the search scores.

Evaluating Results

  • Tutorial: Ragie.ai Tutorial
  • Note: The retrieved results are ranked based on relevance. You can enhance accuracy through trial and error by adjusting the top_k parameter.

🔧 Local Operations

In addition to the operations performed on Ragie.ai, similar processes can be executed locally using open-source tools and libraries. Below are the key scripts and methods used for local retrieval.

Local Retriever Method

  • Folder: searchin_locally
  • Notebook: vector_search.ipynb
  • Purpose: This notebook replicates the Ragie.ai operations locally. It uses open-source tools to perform vector searches, allowing for more customizable and experimental retrieval processes.

Customization

Note

Flexibility: The local approach allows for more parameter tweaks and the use of alternative technologies to achieve the best possible results. This method is ideal for testing and optimizing retrieval strategies.

Results based on the type of source file.

When working on PDFs, search results are not efficient in terms of price.
Because the unit price of the target material cannot be reached.
However, when working on json as a source, thanks to the dictionary structure,it is possible to bring the relevant material in the form of a dict and easily access the price information of that related material. 

🔍 More Vector Stores and Embedding Models Alternatives ((For Local))

When working with retrieval-augmented generation (RAG) or other similar tasks, choosing the right vector store and embedding model is crucial. Below is a list of popular vector stores and embedding models, along with information about their availability (open-source vs. paid).

📂 Vector Stores

  1. Faiss

    • Description: Developed by Facebook AI, Faiss is a library for efficient similarity search and clustering of dense vectors.
    • License: Open Source
    • Link: Faiss GitHub Repository
  2. Annoy

    • Description: Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings that supports fast approximate nearest neighbor search.
    • License: Open Source
    • Link: Annoy GitHub Repository
  3. Milvus

    • Description: Milvus is an open-source vector database built to power embedding similarity search and AI applications.
    • License: Open Source
    • Link: Milvus GitHub Repository
  4. Weaviate

    • Description: Weaviate is an open-source vector search engine that stores both objects and vectors, allowing for semantic search.
    • License: Open Source (Community Edition) / Paid (Enterprise Edition)
    • Link: Weaviate Documentation
  5. Pinecone

    • Description: Pinecone is a managed vector database service for high-performance similarity search.
    • License: Paid
    • Link: Pinecone Website
  6. Vectara

    • Description: Vectara is a neural search-as-a-service platform that offers vector search capabilities.
    • License: Paid
    • Link: Vectara Website
  7. Qdrant

    • Description: Qdrant is a vector search engine that provides fast and scalable vector similarity search.
    • License: Open Source (Community Edition) / Paid (Cloud Service)
    • Link: Qdrant GitHub Repository

🧠 Embedding Models

  1. BERT (Bidirectional Encoder Representations from Transformers)

    • Description: BERT is a transformer-based model developed by Google, designed for natural language understanding tasks.
    • License: Open Source
    • Link: BERT GitHub Repository
  2. GPT-3

    • Description: GPT-3 is a powerful language model developed by OpenAI, known for its capabilities in text generation and understanding.
    • License: Paid (via API access)
    • Link: OpenAI GPT-3
  3. Sentence-BERT (SBERT)

    • Description: Sentence-BERT is a modification of the BERT network that uses siamese and triplet networks to derive semantically meaningful sentence embeddings.
    • License: Open Source
    • Link: SBERT GitHub Repository
  4. Universal Sentence Encoder

    • Description: Developed by Google, this model encodes text into high-dimensional vectors for tasks such as text classification, clustering, and semantic search.
    • License: Open Source
    • Link: Universal Sentence Encoder
  5. CLIP (Contrastive Language-Image Pretraining)

    • Description: Developed by OpenAI, CLIP is a model that can understand images and text jointly by using a multimodal approach.
    • License: Open Source
    • Link: CLIP GitHub Repository
  6. GloVe (Global Vectors for Word Representation)

    • Description: GloVe is an unsupervised learning algorithm for obtaining vector representations for words.
    • License: Open Source
    • Link: GloVe GitHub Repository

Tip

Huggingface-embeddings and many other integration tools like llamaindex - embeddings or langchain-embeddings has open source embedding models also.

Outcome

Observed during the project :

  • Ragie.ai contains file parsing, embedding and vector store operations in itself.
  • Ragie.ai useful for vector search in uncomplicated sources.
  • Ragie.ai Not an experimental environment.
  • Vector store search has become more efficient when the pdf file is converted to a json file.
  • JSON format is more efficient for complex data when using Ragie.ai.
  • If we search the vector store locally, with code we write from scratch, we can be more free.
  • VertexAI could be an alternative to Ragie.ai. Must try!

This README provides a comprehensive overview of the steps involved in the Project BEAM, from PDF to JSON conversion to advanced data retrieval operations, both on Ragie.ai and locally.


About

a rag project with pdf including complex tables

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published