Skip to content

Commit 5116c9c

Browse files
mhbuehlersmguggen
authored andcommitted
MultimodalQnA image query, pdf, and dynamic ports (opea-project#1134)
According to the RFC's Phase 2 plan, this PR adds image query support, PDF ingestion support, and dynamic ports to the microservices used by MultimodalQnA. This PR goes with this one in GenAIExamples. Signed-off-by: dmsuehir <[email protected]> Signed-off-by: Melanie Buehler <[email protected]>
1 parent b38b6aa commit 5116c9c

16 files changed

Lines changed: 486 additions & 67 deletions

File tree

comps/cores/proto/docarray.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,7 @@ class GraphDoc(BaseDoc):
278278

279279

280280
class LVMDoc(BaseDoc):
281-
image: str
281+
image: Union[str, List[str]]
282282
prompt: str
283283
max_new_tokens: conint(ge=0, le=1024) = 512
284284
top_k: int = 10

comps/dataprep/multimodal/redis/langchain/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ This `dataprep` microservice accepts the following from the user and ingests the
55
- Videos (mp4 files) and their transcripts (optional)
66
- Images (gif, jpg, jpeg, and png files) and their captions (optional)
77
- Audio (wav files)
8+
- PDFs (with text and images)
89

910
## 🚀1. Start Microservice with Python(Option 1)
1011

@@ -111,18 +112,19 @@ docker container logs -f dataprep-multimodal-redis
111112

112113
## 🚀4. Consume Microservice
113114

114-
Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert images and videos and their transcripts (optional) to embeddings and save to the Redis vector store.
115+
Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert images, videos, text, and PDF files to embeddings and save to the Redis vector store.
115116

116117
This microservice provides 3 different ways for users to ingest files into Redis vector store corresponding to the 3 use cases.
117118

118119
### 4.1 Consume _ingest_with_text_ API
119120

120-
**Use case:** This API is used when videos are accompanied by transcript files (`.vtt` format) or images are accompanied by text caption files (`.txt` format).
121+
**Use case:** This API is used for videos accompanied by transcript files (`.vtt` format), images accompanied by text caption files (`.txt` format), and PDF files containing a mix of text and images.
121122

122123
**Important notes:**
123124

124125
- Make sure the file paths after `files=@` are correct.
125126
- Every transcript or caption file's name must be identical to its corresponding video or image file's name (except their extension - .vtt goes with .mp4 and .txt goes with .jpg, .jpeg, .png, or .gif). For example, `video1.mp4` and `video1.vtt`. Otherwise, if `video1.vtt` is not included correctly in the API call, the microservice will return an error `No captions file video1.vtt found for video1.mp4`.
127+
- It is assumed that PDFs will contain at least one image. Each image in the file will be embedded along with the text that appears on the same page as the image.
126128

127129
#### Single video-transcript pair upload
128130

@@ -157,6 +159,7 @@ curl -X POST \
157159
-F "files=@./image1.txt" \
158160
-F "files=@./image2.jpg" \
159161
-F "files=@./image2.txt" \
162+
-F "files=@./example.pdf" \
160163
http://localhost:6007/v1/ingest_with_text
161164
```
162165

comps/dataprep/multimodal/redis/langchain/prepare_videodoc_redis.py

Lines changed: 143 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,16 @@
11
# Copyright (C) 2024 Intel Corporation
22
# SPDX-License-Identifier: Apache-2.0
33

4+
import base64
5+
import json
46
import os
57
import shutil
68
import time
79
import uuid
810
from pathlib import Path
911
from typing import Any, Dict, Iterable, List, Optional, Type, Union
1012

13+
import pymupdf
1114
from config import EMBED_MODEL, INDEX_NAME, INDEX_SCHEMA, LVM_ENDPOINT, REDIS_URL, WHISPER_MODEL
1215
from fastapi import File, HTTPException, UploadFile
1316
from langchain_community.utilities.redis import _array_to_buffer
@@ -301,7 +304,53 @@ def prepare_data_and_metadata_from_annotation(
301304
return text_list, image_list, metadatas
302305

303306

304-
def ingest_multimodal(videoname, data_folder, embeddings):
307+
def prepare_pdf_data_from_annotation(annotation, path_to_files, title):
308+
"""PDF data processing has some key differences from videos and images.
309+
310+
1. Neighboring transcripts are not currently considered relevant.
311+
We are only taking the text located on the same page as the image.
312+
2. The images within PDFs are indexed by page and image-within-page
313+
indices, as opposed to a single frame index.
314+
3. Instead of time of frame in ms, we return the PDF page index through
315+
the pre-existing time_of_frame_ms metadata key to maintain compatibility.
316+
"""
317+
text_list = []
318+
image_list = []
319+
metadatas = []
320+
for item in annotation:
321+
page_index = item["frame_no"]
322+
image_index = item["sub_video_id"]
323+
path_to_image = os.path.join(path_to_files, f"page{page_index}_image{image_index}.png")
324+
caption_for_ingesting = item["caption"]
325+
caption_for_inference = item["caption"]
326+
327+
pdf_id = item["video_id"]
328+
b64_img_str = item["b64_img_str"]
329+
embedding_type = "pair" if b64_img_str else "text"
330+
source = item["video_name"]
331+
332+
text_list.append(caption_for_ingesting)
333+
334+
if b64_img_str:
335+
image_list.append(path_to_image)
336+
337+
metadatas.append(
338+
{
339+
"content": caption_for_ingesting,
340+
"b64_img_str": b64_img_str,
341+
"video_id": pdf_id,
342+
"source_video": source,
343+
"time_of_frame_ms": page_index, # For PDFs save the page number
344+
"embedding_type": embedding_type,
345+
"title": title,
346+
"transcript_for_inference": caption_for_inference,
347+
}
348+
)
349+
350+
return text_list, image_list, metadatas
351+
352+
353+
def ingest_multimodal(filename, data_folder, embeddings, is_pdf=False):
305354
"""Ingest text image pairs to Redis from the data/ directory that consists of frames and annotations."""
306355
data_folder = os.path.abspath(data_folder)
307356
annotation_file_path = os.path.join(data_folder, "annotations.json")
@@ -310,10 +359,15 @@ def ingest_multimodal(videoname, data_folder, embeddings):
310359
annotation = load_json_file(annotation_file_path)
311360

312361
# prepare data to ingest
313-
text_list, image_list, metadatas = prepare_data_and_metadata_from_annotation(annotation, path_to_frames, videoname)
362+
if is_pdf:
363+
text_list, image_list, metadatas = prepare_pdf_data_from_annotation(annotation, path_to_frames, filename)
364+
else:
365+
text_list, image_list, metadatas = prepare_data_and_metadata_from_annotation(
366+
annotation, path_to_frames, filename
367+
)
314368

315369
MultimodalRedis.from_text_image_pairs_return_keys(
316-
texts=[f"From {videoname}. " + text for text in text_list],
370+
texts=[f"From {filename}. " + text for text in text_list],
317371
images=image_list,
318372
embedding=embeddings,
319373
metadatas=metadatas,
@@ -335,7 +389,10 @@ def drop_index(index_name, redis_url=REDIS_URL):
335389

336390

337391
@register_microservice(
338-
name="opea_service@prepare_videodoc_redis", endpoint="/v1/generate_transcripts", host="0.0.0.0", port=6007
392+
name="opea_service@prepare_videodoc_redis",
393+
endpoint="/v1/generate_transcripts",
394+
host="0.0.0.0",
395+
port=int(os.getenv("DATAPREP_MMR_PORT", 6007)),
339396
)
340397
async def ingest_generate_transcripts(files: List[UploadFile] = File(None)):
341398
"""Upload videos or audio files with speech, generate transcripts using whisper and ingest into redis."""
@@ -444,7 +501,10 @@ async def ingest_generate_transcripts(files: List[UploadFile] = File(None)):
444501

445502

446503
@register_microservice(
447-
name="opea_service@prepare_videodoc_redis", endpoint="/v1/generate_captions", host="0.0.0.0", port=6007
504+
name="opea_service@prepare_videodoc_redis",
505+
endpoint="/v1/generate_captions",
506+
host="0.0.0.0",
507+
port=int(os.getenv("DATAPREP_MMR_PORT", 6007)),
448508
)
449509
async def ingest_generate_caption(files: List[UploadFile] = File(None)):
450510
"""Upload images and videos without speech (only background music or no audio), generate captions using lvm microservice and ingest into redis."""
@@ -506,11 +566,11 @@ async def ingest_generate_caption(files: List[UploadFile] = File(None)):
506566
name="opea_service@prepare_videodoc_redis",
507567
endpoint="/v1/ingest_with_text",
508568
host="0.0.0.0",
509-
port=6007,
569+
port=int(os.getenv("DATAPREP_MMR_PORT", 6007)),
510570
)
511571
async def ingest_with_text(files: List[UploadFile] = File(None)):
512572
if files:
513-
accepted_media_formats = [".mp4", ".png", ".jpg", ".jpeg", ".gif"]
573+
accepted_media_formats = [".mp4", ".png", ".jpg", ".jpeg", ".gif", ".pdf"]
514574
# Create a lookup dictionary containing all media files
515575
matched_files = {f.filename: [f] for f in files if os.path.splitext(f.filename)[1] in accepted_media_formats}
516576
uploaded_files_map = {}
@@ -537,25 +597,25 @@ async def ingest_with_text(files: List[UploadFile] = File(None)):
537597
elif file_extension not in accepted_media_formats:
538598
print(f"Skipping file {file.filename} because of unsupported format.")
539599

540-
# Check if every media file has a caption file
541-
for media_file_name, file_pair in matched_files.items():
542-
if len(file_pair) != 2:
600+
# Check that every media file that is not a pdf has a caption file
601+
for media_file_name, file_list in matched_files.items():
602+
if len(file_list) != 2 and os.path.splitext(media_file_name)[1] != ".pdf":
543603
raise HTTPException(status_code=400, detail=f"No caption file found for {media_file_name}")
544604

545605
if len(matched_files.keys()) == 0:
546606
return HTTPException(
547607
status_code=400,
548-
detail="The uploaded files have unsupported formats. Please upload at least one video file (.mp4) with captions (.vtt) or one image (.png, .jpg, .jpeg, or .gif) with caption (.txt)",
608+
detail="The uploaded files have unsupported formats. Please upload at least one video file (.mp4) with captions (.vtt) or one image (.png, .jpg, .jpeg, or .gif) with caption (.txt) or one .pdf file",
549609
)
550610

551611
for media_file in matched_files:
552612
print(f"Processing file {media_file}")
613+
file_name, file_extension = os.path.splitext(media_file)
553614

554615
# Assign unique identifier to file
555616
file_id = generate_id()
556617

557618
# Create file name by appending identifier
558-
file_name, file_extension = os.path.splitext(media_file)
559619
media_file_name = f"{file_name}_{file_id}{file_extension}"
560620
media_dir_name = os.path.splitext(media_file_name)[0]
561621

@@ -564,25 +624,72 @@ async def ingest_with_text(files: List[UploadFile] = File(None)):
564624
shutil.copyfileobj(matched_files[media_file][0].file, f)
565625
uploaded_files_map[file_name] = media_file_name
566626

567-
# Save caption file in upload directory
568-
caption_file_extension = os.path.splitext(matched_files[media_file][1].filename)[1]
569-
caption_file = f"{media_dir_name}{caption_file_extension}"
570-
with open(os.path.join(upload_folder, caption_file), "wb") as f:
571-
shutil.copyfileobj(matched_files[media_file][1].file, f)
627+
if file_extension == ".pdf":
628+
# Set up location to store pdf images and text, reusing "frames" and "annotations" from video
629+
output_dir = os.path.join(upload_folder, media_dir_name)
630+
os.makedirs(output_dir, exist_ok=True)
631+
os.makedirs(os.path.join(output_dir, "frames"), exist_ok=True)
632+
doc = pymupdf.open(os.path.join(upload_folder, media_file_name))
633+
annotations = []
634+
for page_idx, page in enumerate(doc, start=1):
635+
text = page.get_text()
636+
images = page.get_images()
637+
for image_idx, image in enumerate(images, start=1):
638+
# Write image and caption file for each image found in pdf
639+
img_fname = f"page{page_idx}_image{image_idx}"
640+
img_fpath = os.path.join(output_dir, "frames", img_fname + ".png")
641+
pix = pymupdf.Pixmap(doc, image[0]) # create pixmap
642+
643+
if pix.n - pix.alpha > 3: # if CMYK, convert to RGB first
644+
pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
645+
646+
pix.save(img_fpath) # pixmap to png
647+
pix = None
648+
649+
# Convert image to base64 encoded string
650+
with open(img_fpath, "rb") as image2str:
651+
encoded_string = base64.b64encode(image2str.read()) # png to bytes
652+
653+
decoded_string = encoded_string.decode() # bytes to string
654+
655+
# Create annotations file, reusing metadata keys from video
656+
annotations.append(
657+
{
658+
"video_id": file_id,
659+
"video_name": os.path.basename(os.path.join(upload_folder, media_file_name)),
660+
"b64_img_str": decoded_string,
661+
"caption": text,
662+
"time": 0.0,
663+
"frame_no": page_idx,
664+
"sub_video_id": image_idx,
665+
}
666+
)
667+
668+
with open(os.path.join(output_dir, "annotations.json"), "w") as f:
669+
json.dump(annotations, f)
670+
671+
# Ingest multimodal data into redis
672+
ingest_multimodal(file_name, os.path.join(upload_folder, media_dir_name), embeddings, is_pdf=True)
673+
else:
674+
# Save caption file in upload directory
675+
caption_file_extension = os.path.splitext(matched_files[media_file][1].filename)[1]
676+
caption_file = f"{media_dir_name}{caption_file_extension}"
677+
with open(os.path.join(upload_folder, caption_file), "wb") as f:
678+
shutil.copyfileobj(matched_files[media_file][1].file, f)
572679

573-
# Store frames and caption annotations in a new directory
574-
extract_frames_and_annotations_from_transcripts(
575-
file_id,
576-
os.path.join(upload_folder, media_file_name),
577-
os.path.join(upload_folder, caption_file),
578-
os.path.join(upload_folder, media_dir_name),
579-
)
680+
# Store frames and caption annotations in a new directory
681+
extract_frames_and_annotations_from_transcripts(
682+
file_id,
683+
os.path.join(upload_folder, media_file_name),
684+
os.path.join(upload_folder, caption_file),
685+
os.path.join(upload_folder, media_dir_name),
686+
)
580687

581-
# Delete temporary caption file
582-
os.remove(os.path.join(upload_folder, caption_file))
688+
# Delete temporary caption file
689+
os.remove(os.path.join(upload_folder, caption_file))
583690

584-
# Ingest multimodal data into redis
585-
ingest_multimodal(file_name, os.path.join(upload_folder, media_dir_name), embeddings)
691+
# Ingest multimodal data into redis
692+
ingest_multimodal(file_name, os.path.join(upload_folder, media_dir_name), embeddings)
586693

587694
# Delete temporary media directory containing frames and annotations
588695
shutil.rmtree(os.path.join(upload_folder, media_dir_name))
@@ -602,7 +709,10 @@ async def ingest_with_text(files: List[UploadFile] = File(None)):
602709

603710

604711
@register_microservice(
605-
name="opea_service@prepare_videodoc_redis", endpoint="/v1/dataprep/get_files", host="0.0.0.0", port=6007
712+
name="opea_service@prepare_videodoc_redis",
713+
endpoint="/v1/dataprep/get_files",
714+
host="0.0.0.0",
715+
port=int(os.getenv("DATAPREP_MMR_PORT", 6007)),
606716
)
607717
async def rag_get_file_structure():
608718
"""Returns list of names of uploaded videos saved on the server."""
@@ -616,7 +726,10 @@ async def rag_get_file_structure():
616726

617727

618728
@register_microservice(
619-
name="opea_service@prepare_videodoc_redis", endpoint="/v1/dataprep/delete_files", host="0.0.0.0", port=6007
729+
name="opea_service@prepare_videodoc_redis",
730+
endpoint="/v1/dataprep/delete_files",
731+
host="0.0.0.0",
732+
port=int(os.getenv("DATAPREP_MMR_PORT", 6007)),
620733
)
621734
async def delete_files():
622735
"""Delete all uploaded files along with redis index."""

comps/dataprep/multimodal/redis/langchain/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ opentelemetry-sdk
1111
Pillow
1212
prometheus-fastapi-instrumentator
1313
pydantic
14+
pymupdf
1415
python-multipart
1516
redis
1617
shortuuid

comps/embeddings/src/integrations/multimodal_bridgetower.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,13 @@ async def invoke(self, input: MultimodalDoc) -> EmbedMultimodalDoc:
5151
json["text"] = input.text
5252
elif isinstance(input, TextImageDoc):
5353
json["text"] = input.text.text
54-
img_bytes = input.image.url.load_bytes()
55-
base64_img = base64.b64encode(img_bytes).decode("utf-8")
56-
json["img_b64_str"] = base64_img
54+
if input.image.url:
55+
img_bytes = input.image.url.load_bytes()
56+
base64_img = base64.b64encode(img_bytes).decode("utf-8")
57+
elif input.image.base64_image:
58+
base64_img = input.image.base64_image
59+
if base64_img:
60+
json["img_b64_str"] = base64_img
5761
else:
5862
raise TypeError(
5963
f"Unsupported input type: {type(input)}. "
@@ -71,6 +75,9 @@ async def invoke(self, input: MultimodalDoc) -> EmbedMultimodalDoc:
7175
elif isinstance(input, TextImageDoc):
7276
res = EmbedMultimodalDoc(text=input.text.text, url=input.image.url, embedding=embed_vector)
7377

78+
if base64_img:
79+
res.base64_image = base64_img
80+
7481
return res
7582

7683
def check_health(self) -> bool:

comps/lvms/src/integrations/dependency/llava/README.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# LVM Microservice
22

3-
Visual Question and Answering is one of the multimodal tasks empowered by LVMs (Large Visual Models). This microservice supports visual Q&A by using LLaVA as the base large visual model. It accepts two inputs: a prompt and an image. It outputs the answer to the prompt about the image.
3+
Visual Question and Answering is one of the multimodal tasks empowered by LVMs (Large Visual Models). This microservice supports visual Q&A by using LLaVA as the base large visual model. It accepts two inputs: a prompt and images. It outputs the answer to the prompt about the images.
44

55
## 🚀1. Start Microservice with Python (Option 1)
66

@@ -73,7 +73,23 @@ docker run -p 8399:8399 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_M
7373

7474
#### 2.2.2 Test
7575

76+
> Note: The `MAX_IMAGES` environment variable is used to specify the maximum number of images that will be sent from the LVM service to the LLaVA server.
77+
> If an image list longer than `MAX_IMAGES` is sent to the LVM server, a shortened image list will be sent to the LLaVA service. If the image list
78+
> needs to be shortened, the most recent images (the ones at the end of the list) are prioritized to send to the LLaVA service. Some LLaVA models have not
79+
> been trained with multiple images and may lead to inaccurate results. If `MAX_IMAGES` is not set, it will default to `1`.
80+
7681
```bash
82+
# Use curl/python
83+
84+
# curl with an image and a prompt
85+
http_proxy="" curl http://localhost:9399/v1/lvm -XPOST -d '{"image": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "prompt":"What is this?"}' -H 'Content-Type: application/json'
86+
87+
# curl with multiple images and a prompt (Note that depending on your MAX_IMAGES value, both images may not be sent to the LLaVA model)
88+
http_proxy="" curl http://localhost:9399/v1/lvm -XPOST -d '{"image": ["iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mNkYPhfz0AEYBxVSF+FAP5FDvcfRYWgAAAAAElFTkSuQmCC", "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mNk+M9Qz0AEYBxVSF+FAAhKDveksOjmAAAAAElFTkSuQmCC"], "prompt":"What is in these images?"}' -H 'Content-Type: application/json'
89+
90+
# curl with a prompt only (no image)
91+
http_proxy="" curl http://localhost:9399/v1/lvm -XPOST -d '{"image": "", "prompt":"What is deep learning?"}' -H 'Content-Type: application/json'
92+
7793
# Test
7894
python check_llava_server.py
7995
```

0 commit comments

Comments
 (0)