-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[model] fix: stuck issue with mixed text-image data #3670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
On top of #3496 Possibly resolve #3665 This PR continue addressing the hanging issue as mentioned here: #3315 (comment) It should be pretty safe and optional to do `model.training` checks even the model is in inference mode, as the value of `inputs_embeds` won't get modified anyway. Signed-off-by: Hollow Man <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix a hanging issue with mixed text-image data by ensuring the vision part of the model is always exercised, even for text-only batches. This is achieved by removing the model.training condition, which triggers a dummy vision forward pass during inference as well. While this is a valid strategy to prevent deadlocks in distributed environments, it introduces a performance overhead for non-distributed inference. My review suggests refining this condition to apply the workaround only when in a distributed context, thus preserving performance for single-device inference scenarios.
|
I encountered this problem as well: the model hangs in |
What does this PR do?
On top of #3496
Possibly resolve #3665
This PR continue addressing the hanging issue as mentioned here: #3315 (comment)
It should be pretty safe and optional to do
model.trainingchecks even the model is in inference mode, as the value ofinputs_embedswon't get modified anyway.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)