Skip to content

Commit 892c090

Browse files
authored
Change profile Run batch based on max_seq_len (vllm-project#415)
Changes the profile_run batches based on the max sequence length. This avoids padding during prepare_prompt; thus avoiding breaking constraints based on batch_size * seq_len <= max_num_batch_tokens. Current logic for profile_run max_batch_size takes precedence. e.g. - > max_batch_size = 256, max_num_batch_tokens = 2048, block_size = 128, max_seq_len = 1024 with current logic max_seq_len is updated as 8; however in **prepare_prompt** seq_len is padded to 128, thus getting batch_size * seq_len as 256 * 128 > max_num_batch_tokens; thus violating the above mentioned constraint with the updated logic, we calculate max_batch_size as 2, this avoids the padding at **prepare_prompt**, thus keeping the constraints in place. Fixes: HabanaAI#405
1 parent 3af4b6c commit 892c090

1 file changed

Lines changed: 2 additions & 4 deletions

File tree

vllm/worker/hpu_model_runner.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1306,10 +1306,8 @@ def create_dummy_seq_group_metadata(self,
13061306
def profile_run(self) -> None:
13071307
num_layers = self.model_config.get_num_layers(self.parallel_config)
13081308
kv_caches = [None] * num_layers
1309-
max_batch_size = self.bucketing_global_state.prompt_bs_bucket_cfg[-1]
1310-
max_seq_len = min(
1311-
self.bucketing_global_state.prompt_seq_bucket_cfg[-1],
1312-
self.max_num_batched_tokens // max_batch_size)
1309+
max_seq_len = self.bucketing_global_state.prompt_seq_bucket_cfg[-1]
1310+
max_batch_size = self.max_num_batched_tokens // max_seq_len
13131311

13141312
self.warmup_scenario(max_batch_size, max_seq_len, True, kv_caches,
13151313
False, True)

0 commit comments

Comments
 (0)