Accumulated decoding when streaming

I'm trying to serve a Llama-2-70b-chat-hf model using Triton inferencer server with TRT-LLM engine. The script I used is `tools/inflight_batcher_llm/end_to_end_streaming_client.py`:

```
python3 tools/inflight_batcher_llm/end_to_end_streaming_client.py -p "What is deep learning?" -S -o 64
```

This script streams the generated tokens in byte. I changed the callback function so that it would print strings:

```
print(output[0].decode(), flush=True, end="")
```

However, the output becomes:

```
Deeplearningisasubsetofmachinelearningthatinvolvestheuseofartificialneuralnetworkstomodelandsolvecomplexproblems.Inadeeplearningsystem,therearetypicallymultiplelayersofneuralnetworksthatprocessandtransformthedatainahierarchicalmanner.Eachlayerbuildsonthepreviousone,allowingthesystem
```

We can see that the spaces are gone. This is because the postprocess model in ensemble model decodes tokens one-by-one. In order to have correct spacing, we should do `tokenizer.decode(accumulated_tokens)` instead of `tokenizer.decode(this_token)`, and only output the delta texts in postprocess model. However, I have no idea how to maintain the status in the postprocess model as all models in the ensemble model forms a single forward, stateless function.

One solution I could think of is removing the postprocess from ensemble model and let the client use tokenizer to decode the tokens. However, this doesn't make sense because it requires the client to know and load the tokenizer of the model it talks to.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accumulated decoding when streaming #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Accumulated decoding when streaming #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions