[TPU][Bugfix] fix OOM issue in CI test #21550

yaochengji · 2025-07-24T18:37:17Z

Purpose

Fix the tpu ci test: tests/v1/tpu/test_basic.py. In #21340, it got merged without running the TPU CI test.

Test Plan

pytest -s -v tests/v1/tpu/test_basic.py

Test Result

Passed

(Optional) Documentation Update

Signed-off-by: Chengji Yao <[email protected]>

github-actions · 2025-07-24T18:37:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request addresses an OOM error in a TPU CI test by increasing the gpu_memory_utilization from 0.7 to 0.95. While this change may resolve the immediate issue, setting the memory utilization to such a high value introduces a risk of test flakiness and future maintenance problems. My review highlights this concern and suggests considering more robust solutions.

tests/v1/tpu/test_basic.py

This reverts commit 40d86ee.

QiliangCui · 2025-07-25T14:47:52Z

This change broke the CI test again.

on main branch, one PR before this https://buildkite.com/vllm/ci/builds/24936 passed all tests.

this change caused OOM issue.

ERROR 07-25 07:24:51 [core.py:634] RuntimeError: Bad StatusOr access: RESOURCE_EXHAUSTED: Error loading program: Attempting to reserve 1.20G at the bottom of memory. That was not possible. There are 772.53M free, 0B reserved, and 772.53M reservable.

I will prepare a change to rollback this.

Signed-off-by: Chengji Yao <[email protected]>

Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: x22x22 <[email protected]>

Signed-off-by: Chengji Yao <[email protected]>

Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Paul Pak <[email protected]>

Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

Signed-off-by: Chengji Yao <[email protected]>

[TPU][Bugfix] fix OOM issue in CI test

62562ad

Signed-off-by: Chengji Yao <[email protected]>

mergify bot added v1 tpu Related to Google TPUs labels Jul 24, 2025

gemini-code-assist bot reviewed Jul 24, 2025

View reviewed changes

tests/v1/tpu/test_basic.py Show resolved Hide resolved

yaochengji added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 24, 2025

DarkLight1337 approved these changes Jul 25, 2025

View reviewed changes

vllm-bot merged commit 40d86ee into vllm-project:main Jul 25, 2025
55 of 57 checks passed

yaochengji added a commit that referenced this pull request Jul 25, 2025

Revert "[TPU][Bugfix] fix OOM issue in CI test (#21550)"

c0a8db4

This reverts commit 40d86ee.

yaochengji mentioned this pull request Jul 25, 2025

Revert "[TPU][Bugfix] fix OOM issue in CI test" #21589

Closed

QiliangCui mentioned this pull request Jul 25, 2025

[TPU][Test] Rollback PR-21550. #21619

Merged

liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025

[TPU][Bugfix] fix OOM issue in CI test (vllm-project#21550)

b3ef52b

Signed-off-by: Chengji Yao <[email protected]>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[TPU][Bugfix] fix OOM issue in CI test (vllm-project#21550)

83ae2b9

Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: x22x22 <[email protected]>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[TPU][Bugfix] fix OOM issue in CI test (vllm-project#21550)

1031999

Signed-off-by: Chengji Yao <[email protected]>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[TPU][Bugfix] fix OOM issue in CI test (vllm-project#21550)

a43fe56

Signed-off-by: Chengji Yao <[email protected]>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[TPU][Bugfix] fix OOM issue in CI test (vllm-project#21550)

10b5a35

Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[TPU][Bugfix] fix OOM issue in CI test (vllm-project#21550)

99f62c3

Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Paul Pak <[email protected]>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[TPU][Bugfix] fix OOM issue in CI test (vllm-project#21550)

c13c4ff

Signed-off-by: Chengji Yao <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[TPU][Bugfix] fix OOM issue in CI test (vllm-project#21550)

08b5cd9

Signed-off-by: Chengji Yao <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[TPU][Bugfix] fix OOM issue in CI test #21550

[TPU][Bugfix] fix OOM issue in CI test #21550

Uh oh!

yaochengji commented Jul 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

QiliangCui commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[TPU][Bugfix] fix OOM issue in CI test #21550

[TPU][Bugfix] fix OOM issue in CI test #21550

Uh oh!

Conversation

yaochengji commented Jul 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

QiliangCui commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yaochengji commented Jul 24, 2025 •

edited by github-actions bot

Loading