Skip to content

Conversation

@tjohnson31415
Copy link
Member

Cherry-pick of fix commit 6100f4b from ODH:
opendatahub-io/vllm#17

Copy link
Contributor

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill njhill merged commit 06d9876 into main May 8, 2024
@njhill njhill deleted the fix-stats branch May 8, 2024 23:18
tdoublep pushed a commit that referenced this pull request Jan 20, 2025
This PR implements support for `batch size > 1 `and tracks the progress
of the warmup of multiple different
`prompt-length/max-decode/batch-size` shapes.

### Contributions:

- Introduce env var and interpret `BATCH_SIZE` as list of values
(similar to `MIN_PAD_LENGTH` and `MAX_NEW_TOKENS`)
- Adapt warmup loop over the zipped list containing **pad length**,
**max new tokens** and **batch size**
- Support batch dimension for input arguments (tokens, positions, masks)
in warmup algorithm
- Add batch dimension support in update function for the attention mask
(`update_mask()` in sendnn.py)
- Alter test scripts to work with `batch size > 1 `

#### The code has been tested in the following settings: 

- On **CPU**: `batch size = 4` and `batch size = 8` with
`torch.compile(backend=inductor)`
- On **AIU**: `batch size = 1` 

in both **offline** and **online** mode.


### Open questions (including unaddressed questions from
[PR23](https://github.ibm.com/ai-foundation/vllm/pull/23)):
- [x] verify code functionality for `batch size = 4` and `batch size =
8` on **AIU**
- [ ] ideally, the `SENDNNWorker` checks how many compiled shapes fit in
AIU memory before starting to warmup all of them. Unclear how to decide,
implementation missing.
- [ ] How to handle requests that are too long? Right now there are just
cut to maximum padding length (we should probably fail the request and inform the client)
- [ ] verify output of example prompts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants