Fix Multimodal Gemini Batch Request Creation #690

alexisdrakopoulos · 2025-06-13T09:51:24Z

What does this PR do?

This pull request adds support for multimodal inputs (specifically text and images) to the GeminiBatchRequestProcessor. The existing implementation only handled text-based requests, causing errors when a curator.types.Image was included in the prompt.

This change allows users to leverage Gemini's powerful vision capabilities, such as performing OCR on a large batch of images, through the curator library's batch processing interface.

Why is this change needed?

Previously, attempting to run a batch job with an image and text resulted in a 400 INVALID_ARGUMENT error from the Gemini API. The root cause was that the request processor was not correctly formatting the multimodal payload. It tried to serialize the entire content dictionary into a {"text": ...} field, which is invalid.

The Gemini Batch API requires multimodal inputs to be structured as a list of parts within a single contents object, where each part is either text ({"text": ...}) or file data ({"fileData": ...}). This PR implements that required structure.

How was this change implemented?

The core of the change is within the create_api_specific_request_batch method:

Inspects Content Type: The method now checks if message["content"] is the dictionary structure that curator creates for multimodal prompts (e.g., {'texts': [...], 'images': [...]}).
Parses Multimodal Dictionary: It iterates through the texts and images lists within the content dictionary.
Builds Correct Payload:
- For each text string, it creates a {"text": "..."} part.
- For each image, it creates a {"fileData": {"fileUri": "...", "mimeType": "..."}} part. The mimetype is automatically detected from the image URI.
Assembles Parts: All generated parts are collected into a single parts array, creating a valid request payload for the Gemini Batch API.

A fallback for legacy text-only requests is maintained for backward compatibility.

How to test this change?

A reviewer can test this by running a batch job with a curator.LLM class that returns both text and an image from its prompt method.

Example:

import curator
from datasets import Dataset

# Prerequisite: An image must be uploaded to a GCS bucket.
# For example: gs://my-bucket/images/page-1.png

class PageOCR(curator.LLM):
    """A page OCR that extracts text from an image of a page."""
    def prompt(self, input: dict) -> list:
        prompt_text = "Please perform OCR on this image."
        image_url: str = input["image_path"] # e.g., "gs://my-bucket/images/page-1.png"
        return prompt_text, curator.types.Image(url=image_url)

    def parse(self, input: dict, response: str) -> dict:
        return {"ocr_text": response}

# Configure with Gemini backend and batch=True
llm = PageOCR(
    model_name="gemini-1.5-flash-latest", # Or another vision-capable model
    backend="gemini",
    batch=True
)

# Create a dataset with GCS image paths
dataset = Dataset.from_dict({
    "image_path": ["gs://my-bucket/images/page-1.png", "gs://my-bucket/images/page-2.png"]
})

# Run the batch job
result_dataset = llm(dataset)

# The batch job should now be submitted successfully to Vertex AI without errors.

kartik4949 · 2025-06-13T11:16:55Z

@alexisdrakopoulos Thanks for the contribution,
Could you add a test for this?
or let us know if you want us to write the tests for this

please extend the existing tests at tests/integrations/* with vcr cassettes.
Thanks

kartik4949

Looking good, lets do the required changes and add an integration test for this
Thanks

kartik4949 · 2025-06-13T11:38:46Z

src/bespokelabs/curator/request_processor/batch/gemini_batch_request_processor.py

+            content = message["content"]
+
+            # Primary Path: Handle the multimodal dictionary from curator
+            if isinstance(content, dict) and ("texts" in content or "images" in content):


do you need the "texts" or "images" check?

kartik4949 · 2025-06-13T11:38:59Z

src/bespokelabs/curator/request_processor/batch/gemini_batch_request_processor.py

+                        parts.append({"text": text_item})
+
+                # Add all image parts from the 'images' list
+                for image_item in content.get("images", []):


there are "files" as well

kartik4949 · 2025-06-13T11:39:59Z

src/bespokelabs/curator/request_processor/batch/gemini_batch_request_processor.py

+                        # Fallback to guessing the mime type if not provided
+                        mime_type, _ = mimetypes.guess_type(url)
+                        if not mime_type:
+                            logger.warning(f"Could not determine MIME type for {url}. Defaulting to 'image/png'.")


logger.debug

I don't think it's a good idea to only log debug-level messages when the mime_type is missing, as this makes it hard for people to notice the problem—even though other API implementations might not log anything at all. I believe that at the very least a warning should be issued when mime_type is missing, or even an exception should be thrown.

kartik4949 · 2025-06-13T11:41:54Z

src/bespokelabs/curator/request_processor/batch/gemini_batch_request_processor.py

        if generic_request.response_format:
-            request_object.update(
+            # Ensure generationConfig exists before trying to update it
+            if "generationConfig" not in request_object:


we are creating request_object with "contents" key only
how will this if statement hit?

fix

058bd96

kartik4949 requested changes Jun 13, 2025

View reviewed changes

This was referenced Jun 13, 2025

Gemini Batch Images not working, working in normal mode #687

Open

Serialization error with Gemini Batch #664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Multimodal Gemini Batch Request Creation #690

Fix Multimodal Gemini Batch Request Creation #690

Uh oh!

alexisdrakopoulos commented Jun 13, 2025

Uh oh!

kartik4949 commented Jun 13, 2025 •

edited

Loading

Uh oh!

kartik4949 left a comment

Uh oh!

kartik4949 Jun 13, 2025

Uh oh!

kartik4949 Jun 13, 2025

Uh oh!

kartik4949 Jun 13, 2025

Uh oh!

xingyi-Starry Jul 13, 2025

Uh oh!

kartik4949 Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix Multimodal Gemini Batch Request Creation #690

Are you sure you want to change the base?

Fix Multimodal Gemini Batch Request Creation #690

Uh oh!

Conversation

alexisdrakopoulos commented Jun 13, 2025

What does this PR do?

Why is this change needed?

How was this change implemented?

How to test this change?

Uh oh!

kartik4949 commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kartik4949 left a comment

Choose a reason for hiding this comment

Uh oh!

kartik4949 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

kartik4949 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

kartik4949 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

xingyi-Starry Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

kartik4949 Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kartik4949 commented Jun 13, 2025 •

edited

Loading