-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[TPU][Bugfix] fix OOM issue in CI test #21550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chengji Yao <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses an OOM error in a TPU CI test by increasing the gpu_memory_utilization from 0.7 to 0.95. While this change may resolve the immediate issue, setting the memory utilization to such a high value introduces a risk of test flakiness and future maintenance problems. My review highlights this concern and suggests considering more robust solutions.
This reverts commit 40d86ee.
|
This change broke the CI test again. on main branch, one PR before this https://buildkite.com/vllm/ci/builds/24936 passed all tests. this change caused OOM issue. I will prepare a change to rollback this. |
Signed-off-by: Chengji Yao <[email protected]>
Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: Chengji Yao <[email protected]>
Signed-off-by: Chengji Yao <[email protected]>
Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Chengji Yao <[email protected]>
Purpose
Fix the tpu ci test: tests/v1/tpu/test_basic.py. In #21340, it got merged without running the TPU CI test.
Test Plan
pytest -s -v tests/v1/tpu/test_basic.py
Test Result
Passed
(Optional) Documentation Update