Skip to content

Commit ac9c9c8

Browse files
Turned back on the Marlin tests (vllm-project#121)
SUMMARY: Turns back on the marlin tests. Issue was that vllm was not properly tearing itself down. Calling the gc explicitly seems to have resolved this in the short term. In general, we should get to the bottom of why vllm does not shut down cleanly. TEST PLAN: Automation
1 parent 66863b4 commit ac9c9c8

1 file changed

Lines changed: 10 additions & 7 deletions

File tree

tests/models/test_marlin.py

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
import pytest
1919
import torch
20+
import gc
2021
from compare_utils import check_logprobs_close
2122
from dataclasses import dataclass
2223
from vllm.model_executor.layers.quantization import _QUANTIZATION_CONFIG_REGISTRY
@@ -45,7 +46,6 @@ class ModelPair:
4546
]
4647

4748

48-
@pytest.mark.skip(reason="out of memory")
4949
@pytest.mark.flaky(reruns=2)
5050
@pytest.mark.skipif(marlin_not_supported,
5151
reason="Marlin is not supported on this GPU type.")
@@ -67,24 +67,27 @@ def test_models(
6767
marlin_outputs = marlin_model.generate_greedy_logprobs(
6868
example_prompts, max_tokens, num_logprobs)
6969

70-
# Note: not sure why, but deleting just the model on Ada Lovelace
71-
# does not free the GPU memory. On Ampere, deleting the just model
72-
# frees the memory.
70+
# vllm memory cleanup is poor. This seems to fix things.
71+
# NOTE: upstream sync should use downstream version.
7372
del marlin_model.model.llm_engine.driver_worker
7473
del marlin_model
7574

75+
gc.collect()
76+
torch.cuda.empty_cache()
77+
7678
gptq_model = vllm_runner_nm(model_pair.model_gptq,
7779
dtype=dtype,
7880
max_model_len=MAX_MODEL_LEN)
7981
gptq_outputs = gptq_model.generate_greedy_logprobs(example_prompts,
8082
max_tokens,
8183
num_logprobs)
8284

83-
# Note: not sure why, but deleting just the model on Ada Lovelace
84-
# does not free the GPU memory. On Ampere, deleting the just model
85-
# frees the memory.
85+
# vllm memory cleanup is poor. This seems to fix things.
86+
# NOTE: upstream sync should use downstream version.
8687
del gptq_model.model.llm_engine.driver_worker
8788
del gptq_model
89+
gc.collect()
90+
torch.cuda.empty_cache()
8891

8992
# loop through the prompts
9093
# use logprobs or else this will consistently run out of memory

0 commit comments

Comments
 (0)