Skip to content

Conversation

@princepride
Copy link
Contributor

@princepride princepride commented Aug 20, 2025

Purpose

FIX #18850.
This PR adds support for Donut-like models, resolving issue #18850.

It implements the Donut model and the structurally similar Dolphin model. Since the Donut model uses a Swin Transformer as its vision backbone, this PR also includes the implementation for the Swin model.

Test Plan

Donut Model Test

Script

python examples/offline_inference/encoder_decoder_multimodal.py -m donut

Result

image

Dolphin Model Tests

Script

python examples/offline_inference/dolphin.py

Result

Screenshot 2025-08-22 at 11 03 30 AM

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@mergify mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models v1 labels Aug 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Donut-like models, including Donut, Dolphin, and the Swin Transformer backbone. The changes are extensive, adding new model implementations and example inference scripts. My review focuses on correctness and potential bugs. I've identified a critical bug in the Donut model implementation that would cause a TypeError during image validation, and a high-severity issue in the Dolphin example script that could lead to a ZeroDivisionError. I've provided code suggestions to fix both issues.

@princepride
Copy link
Contributor Author

@DarkLight1337 @Isotr0py Could you review it? I messed up the previous PR (#23187) during a rebase, so I've created a new one. Thank you.

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add examples in examples/offline_inference/encoder_decoder_multimodal.py and examples/offline_inference/vision_language.py instead of adding new files in examples/offline_inference/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the images in the S3 bucket and it seems there aren't any suitable for OCR tasks. This is particularly true for the Dolphin model, whose OCR task is similar to executing a workflow. Different prompts will determine whether to segment or parse the document. At the same time, depending on the parsing tags, it will decide whether to parse text or icons. That's why I've added two example files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used fetch_image to load the image and moved donut to encoder_decoder_multimodal.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also removed the --task in dolphin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the independent dolphin example still got merged. We don't really want to have model specific examples as that will clutter the examples and make it harder for new users to find what they need

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@princepride
Copy link
Contributor Author

@Isotr0py @DarkLight1337 I adjust the example code and update the task plan, please review it. thank you.

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@DarkLight1337
Copy link
Member

Thanks!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 24, 2025 10:38
@DarkLight1337 DarkLight1337 merged commit 416f059 into vllm-project:main Aug 24, 2025
42 checks passed
johnnynunez pushed a commit to johnnynunez/vllm that referenced this pull request Aug 24, 2025
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Sep 4, 2025
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: ByteDance/Dolphin

4 participants