-
Notifications
You must be signed in to change notification settings - Fork 133
Description
I'm trying to serve a Llama-2-70b-chat-hf model using Triton inferencer server with TRT-LLM engine. The script I used is tools/inflight_batcher_llm/end_to_end_streaming_client.py:
python3 tools/inflight_batcher_llm/end_to_end_streaming_client.py -p "What is deep learning?" -S -o 64
This script streams the generated tokens in byte. I changed the callback function so that it would print strings:
print(output[0].decode(), flush=True, end="")
However, the output becomes:
Deeplearningisasubsetofmachinelearningthatinvolvestheuseofartificialneuralnetworkstomodelandsolvecomplexproblems.Inadeeplearningsystem,therearetypicallymultiplelayersofneuralnetworksthatprocessandtransformthedatainahierarchicalmanner.Eachlayerbuildsonthepreviousone,allowingthesystem
We can see that the spaces are gone. This is because the postprocess model in ensemble model decodes tokens one-by-one. In order to have correct spacing, we should do tokenizer.decode(accumulated_tokens) instead of tokenizer.decode(this_token), and only output the delta texts in postprocess model. However, I have no idea how to maintain the status in the postprocess model as all models in the ensemble model forms a single forward, stateless function.
One solution I could think of is removing the postprocess from ensemble model and let the client use tokenizer to decode the tokens. However, this doesn't make sense because it requires the client to know and load the tokenizer of the model it talks to.