Skip to content

Conversation

@mru4913
Copy link

@mru4913 mru4913 commented Feb 18, 2025

This is a follow-up to the Whisper transcription API endpoint, contributed by @DarkLight1337 and @NickLucche ([#12909], [#13301]).

The previous code restricted the use of audio with unknown languages, forcing it to be recognized as English. This update, adapted from the transformers library, introduces language detection. This is a temporary fix to support audio with unknown languages; a more robust and general solution will be implemented in the future.

Thanks to @DarkLight1337 @NickLucche and etc.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Feb 18, 2025
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! :)
I think we should take the chance to do some bookkeeping and organize the code around constants a bit better, left some comments.

Also, we will need appropriate tests for this feature.

Comment on lines +312 to +318
if (
"v3" in self.model_config.model.lower()
and self.model_config.hf_config.model_type.lower() == "whisper"
):
id2token = LANG_ID_TO_LANG_TOKEN["whisper-v3"]
else:
return default_lang_token
Copy link
Collaborator

@NickLucche NickLucche Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't have an approximate check here, I'd rather build a mapping dynamically for all whisper models sharing the same content. See comment below for a cleaner alternative.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. It is better to wait for official supported language config and id-token mapping.

"bo": "Tibetan",
}

LANG_ID_TO_LANG_TOKEN = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some bookkeeping with this extra dict here @DarkLight1337 . How about we use the SupportsTranscription interface to query the model class for a mapping of supported languages?
Otherwise this will quickly become a mess if only another model gets added

Copy link
Author

@mru4913 mru4913 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanna support more asr models, need a better way to structure those mapping

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we use the SupportsTranscription interface to query the model class for a mapping of supported languages?

Sounds good to me!

Comment on lines +322 to +332
PromptType,
{
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {
"audio": audio_data,
},
},
"decoder_prompt": "<|startoftranscript|>",
},
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought, it would be nice to avoid running the encoder twice. Ideally one could start the generation with the same decoder prompt as here, have the model predict a lang token, and then continue by forcing the subsequent tokens to be <|lang|><|transcribe|><|notimestamps|>{prompt}.
But I see how this would be invasive to implement.

We could re-use the encoder kv cache which is already populated here, but how do we make sure it's not evicted, would re-using the same request id here clash with scheduler policy? @DarkLight1337

Copy link
Author

@mru4913 mru4913 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why It is a temporary fix. The current method does not take full adv of vllm features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not relying on the internals of KV cache for now as it may change significantly in V1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this would target V0 though, encoder-decoder models are still far from being supported in V1 afaik #12761

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot about that. I guess it's fine then.

We could re-use the encoder kv cache which is already populated here, but how do we make sure it's not evicted, would re-using the same request id here clash with scheduler policy?

@robertgshaw2-redhat @ywang96 might have a better idea regarding this.

@mru4913
Copy link
Author

mru4913 commented Feb 19, 2025

Thanks for your contribution! :) I think we should take the chance to do some bookkeeping and organize the code around constants a bit better, left some comments.

Also, we will need appropriate tests for this feature.


Add a simple test case

@mergify
Copy link

mergify bot commented May 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mru4913.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 19, 2025
@francis2tm
Copy link

francis2tm commented May 19, 2025

Hello,
Is this PR stalled? I'm willing to do a new PR if required.

I also needs this feature so that transcription works on multiple languages :)

@mru4913

@esceptico
Copy link

Hey!
Is there any updates?

@github-actions
Copy link

github-actions bot commented Sep 2, 2025

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Sep 2, 2025
@github-actions
Copy link

github-actions bot commented Oct 3, 2025

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

@github-actions github-actions bot closed this Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend needs-rebase stale Over 90 days of inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants