Skip to content

Accumulated decoding when streaming #34

@comaniac

Description

@comaniac

I'm trying to serve a Llama-2-70b-chat-hf model using Triton inferencer server with TRT-LLM engine. The script I used is tools/inflight_batcher_llm/end_to_end_streaming_client.py:

python3 tools/inflight_batcher_llm/end_to_end_streaming_client.py -p "What is deep learning?" -S -o 64

This script streams the generated tokens in byte. I changed the callback function so that it would print strings:

print(output[0].decode(), flush=True, end="")

However, the output becomes:

Deeplearningisasubsetofmachinelearningthatinvolvestheuseofartificialneuralnetworkstomodelandsolvecomplexproblems.Inadeeplearningsystem,therearetypicallymultiplelayersofneuralnetworksthatprocessandtransformthedatainahierarchicalmanner.Eachlayerbuildsonthepreviousone,allowingthesystem

We can see that the spaces are gone. This is because the postprocess model in ensemble model decodes tokens one-by-one. In order to have correct spacing, we should do tokenizer.decode(accumulated_tokens) instead of tokenizer.decode(this_token), and only output the delta texts in postprocess model. However, I have no idea how to maintain the status in the postprocess model as all models in the ensemble model forms a single forward, stateless function.

One solution I could think of is removing the postprocess from ensemble model and let the client use tokenizer to decode the tokens. However, this doesn't make sense because it requires the client to know and load the tokenizer of the model it talks to.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions