-
Notifications
You must be signed in to change notification settings - Fork 583
feat(benchmark): Implement Mock LLM streaming #1564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…NCY_* to E2E_LATENCY_* to distinguish between the two
Documentation preview |
Greptile SummaryThis PR adds streaming support to the mock LLM server with proper TTFT and ITL latency simulation. The changes include:
The implementation correctly handles the OpenAI streaming format with proper SSE formatting, chunk sequencing, and finish tokens. However, there's a timing issue in
|
| Filename | Overview |
|---|---|
| benchmark/mock_llm_server/api.py | Added streaming support for chat/completions endpoints with TTFT timing issue where first role chunk sent without TTFT delay |
| benchmark/mock_llm_server/config.py | Renamed latency config fields from LATENCY_* to E2E_LATENCY_* and added new TTFT and ITL streaming config fields |
| benchmark/mock_llm_server/models.py | Added streaming response models: DeltaMessage, ChatCompletionStreamChoice, ChatCompletionStreamResponse, etc. |
| benchmark/mock_llm_server/response_data.py | Added split_response_into_chunks() and generate_chunk_latencies() functions for streaming, updated field names |
| benchmark/tests/test_mock_api.py | Added comprehensive streaming tests, updated config field names, improved test fixtures |
Sequence Diagram
sequenceDiagram
participant Client
participant FastAPI
participant StreamHandler as stream_chat_completion
participant ResponseData as response_data
participant Config as ModelSettings
Client->>FastAPI: POST /v1/chat/completions (stream=true)
FastAPI->>Config: get_settings()
Config-->>FastAPI: ModelSettings
FastAPI->>ResponseData: get_response(config)
ResponseData-->>FastAPI: response_content
FastAPI->>ResponseData: split_response_into_chunks(response_content)
ResponseData-->>FastAPI: chunks[]
FastAPI->>ResponseData: generate_chunk_latencies(config, len(chunks))
ResponseData-->>FastAPI: latencies[] (TTFT + ITL)
FastAPI->>StreamHandler: stream_chat_completion()
Note over StreamHandler: First chunk (role)
StreamHandler->>Client: data: {role: "assistant", content: ""}
loop For each content chunk
StreamHandler->>StreamHandler: await asyncio.sleep(latencies[idx])
StreamHandler->>Client: data: {content: chunk}
end
Note over StreamHandler: Final chunk
StreamHandler->>Client: data: {finish_reason: "stop"}
StreamHandler->>Client: data: [DONE]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 files reviewed, 1 comment
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tgasser-nv, I've reviewed this PR and have several comments inline below. There are some questions to address, but looks good to merge.
- q: TTFT timing question, is the first role chunk metadata chunk?
- q: global random state mutation could cause issues under concurrent load if seed is passed in prod, but seems not to be the case
- nit: inconsistent naming (chunk_latency vs ITL)
I'll add inline comments for details.
Feel free to merge and please review the commit description before merging to remove unnecessary commit messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see individual file comments below.
Description
The Mock LLM server only supported non-streaming responses prior to this PR. This PR adds support for streaming responses with a simple splitting of the safe and unsafe responses on whitespace. The Time To First Token (TTFT) and Inter Token Latency (ITL) latencies are parameterized by a truncated normal distribution. This takes as parameters mean and standard deviation, whose samples are truncated by a min and max latency.
This feature can be used to isolate the latency due to Guardrails internal processing, as all LLM responses can be tightly controlled.
Functional testing
Patch coverage
Pre-commit
Unit-tests
Non-Streaming integration test
Server terminal
Client terminal
Streaming integration test
Server terminal
Client terminal
Checklist