Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
5ac57a5
Dataprep Multimodal Redis README fixes (#1330)
dmsuehir Feb 26, 2025
a153f55
Refine synchronized I/O in asynchronous functions (#1300)
XinyaoWa Feb 27, 2025
67c673b
fix NLTK_DATA path issue due to llama_index version update (#1340)
chensuyue Feb 27, 2025
7a49c96
fix NLTK_DATA path issue due to llama_index version update (#1343)
chensuyue Feb 27, 2025
c5bd553
Changes to checkin text2graph microservice
intelsharath Mar 4, 2025
3f0ff1f
cleaning up auto checking
intelsharath Mar 5, 2025
781b097
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 4, 2025
8b9f041
doc: spell check update (#1342)
madvimer Mar 1, 2025
b92433a
doc: spell check update (#1341)
madvimer Mar 1, 2025
c3c2bc6
enriched input parameters of text2image and image2image. (#1339)
XinyuYe-Intel Mar 1, 2025
7e9c6c8
Revert wrong handling of stream (#1354)
Spycsh Mar 3, 2025
3e190d1
Build and upstream latest base image on push event (#1355)
chensuyue Mar 3, 2025
42a4bba
Add timeout param for DocSum and FaqGen to deal with long context (#1…
XinyaoWa Mar 4, 2025
fa3232f
Megaservice / orchestrator metric testing + fixes (#1348)
eero-t Mar 4, 2025
268a2fe
update image push machine (#1361)
chensuyue Mar 5, 2025
9d55775
cleaning up some security issues flagged
intelsharath Mar 5, 2025
db16e6d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 5, 2025
14f1a77
Merge branch 'main' into text2graph_checkin
intelsharath Mar 5, 2025
77f4777
fix for security replace os.system to osmakedirs
intelsharath Mar 6, 2025
052d5b5
fix for security replace os.system to osmakedirs
intelsharath Mar 6, 2025
7d9349f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 6, 2025
532a725
security fix one more attempt
intelsharath Mar 6, 2025
9d1f7ae
Merge branch 'main' into text2graph_checkin
intelsharath Mar 7, 2025
08e2f1b
changes include deleted commented code and MODEL_NAME change to use v…
intelsharath Mar 7, 2025
39f86e8
changes for merge
intelsharath Mar 7, 2025
365b659
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .github/workflows/docker/compose/text2graph-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# this file should be run in the root of the repo
services:
text2graph:
build:
dockerfile: comps/text2graph/src/Dockerfile
image: ${REGISTRY:-opea}/text2graph:${TAG:-latest}
1 change: 1 addition & 0 deletions comps/cores/mega/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ class ServiceType(Enum):
ANIMATION = 17
IMAGE2IMAGE = 18
TEXT2SQL = 19
TEXT2GRAPH = 20


class MegaServiceEndpoint(Enum):
Expand Down
Empty file.
29 changes: 29 additions & 0 deletions comps/text2graph/deployment/docker_compose/compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

services:
text2graph:
image: opea/text2graph:latest
container_name: text2graph
ports:
- ${TEXT2GRAPH_PORT:-8090}:8090
environment:
- no_proxy=${no_proxy}
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- LLM_MODEL_ID=${LLM_MODEL_ID:-"Babelscape/rebel-large"}
- HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
ipc: host
restart: always

text2graph-gaudi:
image: opea/text2graph:${TAG:-latest}
container_name: text2graph-gaudi-server
ports:
- ${TEXT2GRAPH_PORT:-9090}:8080
environment:
- TGI_LLM_ENDPOINT=${TGI_LLM_ENDPOINT:-8080}:8080

networks:
default:
driver: bridge
47 changes: 47 additions & 0 deletions comps/text2graph/src/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM ubuntu:22.04

WORKDIR /home/graph_extract

FROM python:3.11-slim
ENV LANG=C.UTF-8
ARG ARCH=cpu

RUN apt-get update -y && apt-get install vim -y && apt-get install -y --no-install-recommends --fix-missing \
build-essential

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip setuptools && \
if [ ${ARCH} = "cpu" ]; then \
pip install --no-cache-dir --extra-index-url https://download.pytorch.org/whl/cpu -r /home/user/comps/text2graph/src/requirements.txt; \
else \
pip install --no-cache-dir -r /home/user/comps/text2graph/src/requirements.txt; \
fi

ENV https_proxy=${https_proxy}
ENV http_proxy=${http_proxy}
ENV no_proxy=${no_proxy}
ENV LLM_ID=${LLM_ID:-"Babelscape/rebel-large"}
ENV SPAN_LENGTH=${SPAN_LENGTH:-"1024"}
ENV OVERLAP=${OVERLAP:-"100"}
ENV MAX_LENGTH=${MAX_NEW_TOKENS:-"256"}
ENV HUGGINGFACEHUB_API_TOKEN=${HF_TOKEN}
ENV HF_TOKEN=${HF_TOKEN}
ENV LLM_MODEL_ID=${LLM_ID}
ENV TGI_PORT=8008
ENV PYTHONPATH="/home/user/":$PYTHONPATH

USER user

WORKDIR /home/user/comps/text2graph/src/

RUN bash -c 'source /home/user/comps/text2graph/src/setup_service_env.sh'

ENTRYPOINT ["python", "opea_text2graph_microservice.py"]
118 changes: 118 additions & 0 deletions comps/text2graph/src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Text to graph triplet extractor

Creating graphs from text is about converting unstructured text into structured data is challenging.
It's gained significant traction with the advent of Large Language Models (LLMs), bringing it more into the mainstream. There are two main approaches to extract graph triplets depending on the types of LLM architectures like decode and encoder-decoder models.

## Decoder Models

Decoder-only models are faster during inference as they skip the encoding. This is ideal for tasks where the
input-output mapping is simpler or where multitasking is required. It is suitable for generating outputs based on
prompts or when computational efficiency is a priority. In certain cases, the decoder only models struggle with
tasks requiring deep contextual understanding or when input-output structures are highly heterogeneous.

## Encoder-decoder models

This microservice provides an encoder decoder architecture approach to graph triplet extraction. Models like REBEL, is based on the BART family/like model and fine-tuned for relation extraction and classification tasks. The approach works better when handling complex relations applications and data source. Encoder decoder models often achieve high performance on benchmarks due to their ability to encode contextual information effectively. It is suitable for tasks requiring detailed parsing of text into structured formats, such as knowledge graph construction from unstructured data.

# Features

Input text from a document or string(s) in text format and the graph triplets and nodes are identified.
Subsequent processing needs to be done such as performing entity disambiguation to merge duplicate entities
before generating cypher code

## Implementation

The text-to-graph microservice able to extract from unstructured text in document, textfile, or string formats
The service is hosted in a docker. The text2graph extraction requires logic and LLMs to be hosted.
LLM hosting is done with TGI for Gaudi's and natively running on CPUs for CPU.

# 🚀1. Start Microservice with Docker

Option 1 running on CPUs

## Install Requirements

```bash
pip install -r requirements.txt
```

## Environment variables : Configure LLM Parameters based on the model selected.

```
export LLM_ID=${LLM_ID:-"Babelscape/rebel-large"}
export SPAN_LENGTH=${SPAN_LENGTH:-"1024"}
export OVERLAP=${OVERLAP:-"100"}
export MAX_LENGTH=${MAX_NEW_TOKENS:-"256"}
export HUGGINGFACEHUB_API_TOKEN=""
export LLM_MODEL_ID=${LLM_ID}
export TGI_PORT=8008
```

##Echo env variables

```
echo "Extractor details"
echo LLM_ID=${LLM_ID}
echo SPAN_LENGTH=${SPAN_LENGTH}
echo OVERLAP=${OVERLAP}
echo MAX_LENGTH=${MAX_LENGTH}
```

### Start TGI Service

```bash
export HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
export LLM_MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
export TGI_PORT=8008

docker run -d --name="text2graph-tgi-endpoint" --ipc=host -p $TGI_PORT:80 -v ./data:/data --shm-size 1g -e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} -e model=${LLM_MODEL_ID} ghcr.io/huggingface/text-generation-inference:2.1.0 --model-id $LLM_MODEL_ID
```

### Verify the TGI Service

```bash
export your_ip=$(hostname -I | awk '{print $1}')
curl http://${your_ip}:${TGI_PORT}/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
-H 'Content-Type: application/json'
```

### Setup Environment Variables to host TGI

```bash
export TGI_LLM_ENDPOINT="http://${your_ip}:${TGI_PORT}"
```

### Start Text2Graph Microservice with Docker

Command to build text2graph microservice

```bash
docker build -f Dockerfile -t user_name:graph_extractor ../../../
```

Command to launch text2graph microservice

```bash
docker run -i -t --net=host --ipc=host -p 8090 user_name:graph_extractor
```

The docker launches the text2graph microservice. To run it interactive.

# Validation and testing

## Text to triplets

Test directory is under GenAIComps/tests/text2graph/
There are two files in this directory.

- example_from_file.py : Example python script that downloads a text file and extracts triplets

- test_text2graph_opea.sh : The main script that checks for health and builds docker, extracts and generates triplets.

## Check if services are up

### Setup validation process

For set up use http://localhost:8090/docs for swagger documentation, list of commands, interactive GUI.
Loading
Loading