-
Notifications
You must be signed in to change notification settings - Fork 28
Description
vllm with version > 0.6.3.post1 does not return usage stats in each response and this breaks llm-load-test. Also they appear to have moved to a cumulative usage metrics reporting (If I'm understanding it correctly, we might be able to simplify the logic in our streaming http request function).
Response in 0.6.3.post1:
{'id': 'cmpl-ccb120ae389e410ebfd5505f78b90b15', 'object': 'text_completion', 'created': 1737139401, 'model': 'opt-125m', 'choices': [{'index': 0, 'text': '.', 'logprobs': None, 'finish_reason': 'length', 'stop_reason': None}], 'usage': {'prompt_tokens': 214, 'total_tokens': 265, 'completion_tokens': 51}}
Response in 0.6.4.post1:
{'id': 'cmpl-8111903b974a4d5f8c02f9687fc3c282', 'object': 'text_completion', 'created': 1737140221, 'model': 'opt-125m', 'choices': [{'index': 0, 'text': 'Yes', 'logprobs': None, 'finish_reason': None, 'stop_reason': None}], 'usage': None}
Error seen:
sh-5.1# python3 load_test.py -c /mnt/results/config.yaml
2025-01-17 18:57:00,751 INFO root MainProcess dataset config: {'file': 'datasets/openorca_large_subset_011.jsonl', 'max_queries': 1000, 'min_input_tokens': 0, 'max_input_tokens': 1024, 'min_output_tokens': 0, 'max_output_tokens': 1024, 'max_sequence_tokens': 2048}
2025-01-17 18:57:00,751 INFO root MainProcess load_options config: {'type': 'constant', 'concurrency': 1, 'duration': 20}
2025-01-17 18:57:00,751 INFO root MainProcess Initializing dataset with {'self': <dataset.Dataset object at 0x7ff8c3d16130>, 'file': 'datasets/openorca_large_subset_011.jsonl', 'max_queries': 1000, 'min_input_tokens': 0, 'max_input_tokens': 1024, 'min_output_tokens': 0, 'max_output_tokens': 1024, 'max_sequence_tokens': 2048, 'custom_prompt_format': None}
2025-01-17 18:57:00,838 INFO root MainProcess Starting <SpawnProcess name='SpawnProcess-1' parent=403 initial>
2025-01-17 18:57:00,842 INFO root MainProcess Test from main process
2025-01-17 18:57:01,334 INFO user SpawnProcess-1 User 0 making request
{'id': 'cmpl-8111903b974a4d5f8c02f9687fc3c282', 'object': 'text_completion', 'created': 1737140221, 'model': 'opt-125m', 'choices': [{'index': 0, 'text': 'Yes', 'logprobs': None, 'finish_reason': None, 'stop_reason': None}], 'usage': None}
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/src/llm-load-test/user.py", line 68, in run_user_process
result = self.make_request(test_end_time)
File "/src/llm-load-test/user.py", line 48, in make_request
result = self.plugin.request_func(query, self.user_id, test_end_time)
File "/src/llm-load-test/plugins/openai_plugin.py", line 321, in streaming_request_http
current_usage = deepget(message, "usage", "completion_tokens")
File "/src/llm-load-test/plugins/openai_plugin.py", line 44, in deepget
current = current[pos]
TypeError: 'NoneType' object is not subscriptable
2025-01-17 18:57:20,871 INFO root MainProcess Timer ended, stopping processes
2025-01-17 18:57:20,871 ERROR root MainProcess Unexpected exception in main process
Traceback (most recent call last):
File "/src/llm-load-test/load_test.py", line 155, in main
results_list = gather_results(results_pipes)
File "/src/llm-load-test/load_test.py", line 55, in gather_results
user_results = results_pipe.recv()
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 254, in recv
buf = self._recv_bytes()
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 418, in _recv_bytes
buf = self._recv(4)
File "/usr/lib64/python3.9/multiprocessing/connection.py", line 387, in _recv
raise EOFError
EOFError
2025-01-17 18:57:20,872 INFO root MainProcess User processes terminated succesfully
The fix that seems to have broken llm-load-test : vllm-project/vllm#9357