-
Notifications
You must be signed in to change notification settings - Fork 124
Fix Multimodal Gemini Batch Request Creation #690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix Multimodal Gemini Batch Request Creation #690
Conversation
|
@alexisdrakopoulos Thanks for the contribution, please extend the existing tests at |
kartik4949
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, lets do the required changes and add an integration test for this
Thanks
| content = message["content"] | ||
|
|
||
| # Primary Path: Handle the multimodal dictionary from curator | ||
| if isinstance(content, dict) and ("texts" in content or "images" in content): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need the "texts" or "images" check?
| parts.append({"text": text_item}) | ||
|
|
||
| # Add all image parts from the 'images' list | ||
| for image_item in content.get("images", []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are "files" as well
| # Fallback to guessing the mime type if not provided | ||
| mime_type, _ = mimetypes.guess_type(url) | ||
| if not mime_type: | ||
| logger.warning(f"Could not determine MIME type for {url}. Defaulting to 'image/png'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to only log debug-level messages when the mime_type is missing, as this makes it hard for people to notice the problem—even though other API implementations might not log anything at all. I believe that at the very least a warning should be issued when mime_type is missing, or even an exception should be thrown.
| if generic_request.response_format: | ||
| request_object.update( | ||
| # Ensure generationConfig exists before trying to update it | ||
| if "generationConfig" not in request_object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are creating request_object with "contents" key only
how will this if statement hit?
What does this PR do?
This pull request adds support for multimodal inputs (specifically text and images) to the
GeminiBatchRequestProcessor. The existing implementation only handled text-based requests, causing errors when acurator.types.Imagewas included in the prompt.This change allows users to leverage Gemini's powerful vision capabilities, such as performing OCR on a large batch of images, through the
curatorlibrary's batch processing interface.Why is this change needed?
Previously, attempting to run a batch job with an image and text resulted in a
400 INVALID_ARGUMENTerror from the Gemini API. The root cause was that the request processor was not correctly formatting the multimodal payload. It tried to serialize the entire content dictionary into a{"text": ...}field, which is invalid.The Gemini Batch API requires multimodal inputs to be structured as a list of
partswithin a singlecontentsobject, where each part is either text ({"text": ...}) or file data ({"fileData": ...}). This PR implements that required structure.How was this change implemented?
The core of the change is within the
create_api_specific_request_batchmethod:message["content"]is the dictionary structure thatcuratorcreates for multimodal prompts (e.g.,{'texts': [...], 'images': [...]}).textsandimageslists within the content dictionary.{"text": "..."}part.{"fileData": {"fileUri": "...", "mimeType": "..."}}part. Themimetypeis automatically detected from the image URI.partsarray, creating a valid request payload for the Gemini Batch API.A fallback for legacy text-only requests is maintained for backward compatibility.
How to test this change?
A reviewer can test this by running a batch job with a
curator.LLMclass that returns both text and an image from itspromptmethod.Example: