feat(benchmark): Implement Mock LLM streaming #1564

tgasser-nv · 2026-01-05T22:46:34Z

Description

The Mock LLM server only supported non-streaming responses prior to this PR. This PR adds support for streaming responses with a simple splitting of the safe and unsafe responses on whitespace. The Time To First Token (TTFT) and Inter Token Latency (ITL) latencies are parameterized by a truncated normal distribution. This takes as parameters mean and standard deviation, whose samples are truncated by a min and max latency.

This feature can be used to isolate the latency due to Guardrails internal processing, as all LLM responses can be tightly controlled.

Functional testing

Patch coverage

-------------
Diff Coverage
Diff: develop...HEAD, staged and unstaged changes
-------------
benchmark/mock_llm_server/api.py (100%)
benchmark/mock_llm_server/config.py (100%)
benchmark/mock_llm_server/models.py (100%)
benchmark/mock_llm_server/response_data.py (100%)
benchmark/tests/test_mock_api.py (100%)
benchmark/tests/test_mock_config.py (100%)
benchmark/tests/test_mock_response_data.py (100%)
-------------
Total:   395 lines
Missing: 0 lines
Coverage: 100%
-------------

Pre-commit

$ poetry run pre-commit run --all-files
check yaml...............................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
ruff (legacy alias)......................................................Passed
ruff format..............................................................Passed
Insert license in comments...............................................Passed
pyright..................................................................Passed

Unit-tests

$ poetry run pytest -q

.......................ssss.........................................................................................s... [  5%]
........................................................................................................................ [ 10%]
........................ss....ss....................sssssss............................................................. [ 15%]
...........................................................................................................ss.......s... [ 20%]
.........s................................................................................................ss........ss.. [ 25%]
.ss................................ss................s...................................................s............s. [ 30%]
........................................................................................................................ [ 35%]
........................................................................................................................ [ 40%]
...............................sssss......ssssssssssssssssss.........sssss.............................................. [ 45%]
...................................s...........ss.................................ssssssss.ssssssssss................... [ 50%]
..............................................................s....s.....................................ssssssss....... [ 55%]
.......sss...ss...ss.....ssssssssssssss............................................/Users/tgasser/env/code_quality/lib/python3.1
3/site-packages/_pytest/stash.py:108: RuntimeWarning: coroutine 'AsyncMockMixin._execute_mock_call' was never awaited
  del self._storage[ke

RuntimeWarning: Enable tracemalloc to get the object allocation traceback
................s.................... [ 60%]
..........................................................................................sssssssss.........ss.......... [ 65%]
........................................................................................................sssssss......... [ 70%]
.......................................................................................s................................ [ 75%]
........................................................................ss.............................................. [ 80%]
........................................................................................................................ [ 85%]
........................................................................................................................ [ 90%]
...s.................................................................................................................... [ 95%]
...........................................................................................................              [100%]
2253 passed, 135 skipped in 189.35s (0:03:09)

Non-Streaming integration test

Server terminal

$ source ~/env/benchmark_env/bin/activate
$ cd benchmark
$ PYTHONPATH=.. python mock_llm_server/run_server.py --workers 1 --port 8000 --config-file mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:07:36 INFO: Using config file: mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:07:36 INFO: Starting Mock LLM Server on 0.0.0.0:8000
2026-01-05 17:07:36 INFO: OpenAPI docs available at: http://0.0.0.0:8000/docs
2026-01-05 17:07:36 INFO: Health check at: http://0.0.0.0:8000/health
2026-01-05 17:07:36 INFO: Serving model with config mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:07:36 INFO: Press Ctrl+C to stop the server
INFO:     Loading environment from 'mock_llm_server/configs/meta-llama-3.3-70b-instruct.env'
INFO:     Started server process [59959]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2026-01-05 17:07:47 INFO: Request finished: 200, took 4.107 seconds
INFO:     127.0.0.1:57946 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Client terminal

$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
   -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [{"role":"user","content":"What can you do?"}],
    "temperature": 0.5,
    "top_p": 1,
    "max_tokens": 1024
  }'

{"id":"chatcmpl-36b9a8fb","object":"chat.completion","created":1767654481,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities."},"finish_reason":"stop"}],"usage":{"prompt_tokens":4,"completion_tokens":71,"total_tokens":75}}%```

Streaming integration test

Server terminal

PYTHONPATH=.. python mock_llm_server/run_server.py --workers 1 --port 8000 --config-file mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:09:04 INFO: Using config file: mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:09:04 INFO: Starting Mock LLM Server on 0.0.0.0:8000
2026-01-05 17:09:04 INFO: OpenAPI docs available at: http://0.0.0.0:8000/docs
2026-01-05 17:09:04 INFO: Health check at: http://0.0.0.0:8000/health
2026-01-05 17:09:04 INFO: Serving model with config mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:09:04 INFO: Press Ctrl+C to stop the server
INFO:     Loading environment from 'mock_llm_server/configs/meta-llama-3.3-70b-instruct.env'
INFO:     Started server process [60031]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2026-01-05 17:09:31 INFO: Request finished: 200, took 0.019 seconds
INFO:     127.0.0.1:57995 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Client terminal

$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
   -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [{"role":"user","content":"What can you do?"}],
    "temperature": 0.5,
    "top_p": 1,
    "max_tokens": 1024,
    "stream": true
  }' 

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":"I "}}]}

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":"can "}}]}

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":"provide "}}]}

...
...

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":"illegal "}}]}

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":"activities."}}]}

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Checklist

I've read the CONTRIBUTING guidelines.
I've updated the documentation if applicable.
I've added tests if applicable.
@mentions of the person or team responsible for reviewing proposed changes.

…NCY_* to E2E_LATENCY_* to distinguish between the two

github-actions · 2026-01-05T22:47:56Z

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1564

greptile-apps · 2026-01-05T22:49:34Z

Greptile Summary

This PR adds streaming support to the mock LLM server with proper TTFT and ITL latency simulation. The changes include:

Renamed LATENCY_* config fields to E2E_LATENCY_* to distinguish from streaming metrics
Added TTFT (Time to First Token) and ITL (Inter-Token Latency) configuration parameters
Implemented Server-Sent Events (SSE) streaming for both /v1/chat/completions and /v1/completions endpoints
Added helper functions split_response_into_chunks() and generate_chunk_latencies() for streaming support
Created streaming response models (ChatCompletionStreamResponse, CompletionStreamResponse, etc.)
Added comprehensive test coverage for streaming functionality
Updated config files with new TTFT/ITL values
Minor README formatting improvements

The implementation correctly handles the OpenAI streaming format with proper SSE formatting, chunk sequencing, and finish tokens. However, there's a timing issue in stream_chat_completion where TTFT is applied to the first content chunk instead of the first overall chunk (the role message).

Confidence Score: 4/5

Safe to merge with one timing issue in chat completions streaming that should be addressed
The PR adds well-tested streaming functionality with comprehensive test coverage. The code follows OpenAI API conventions and includes proper error handling. However, there's a logical issue where TTFT delay is applied incorrectly in stream_chat_completion (applied before first content chunk rather than first overall chunk). The stream_completion function implements it correctly. This affects benchmark accuracy but won't cause runtime errors.
Pay close attention to benchmark/mock_llm_server/api.py - the TTFT timing issue needs correction

Important Files Changed

Filename	Overview
benchmark/mock_llm_server/api.py	Added streaming support for chat/completions endpoints with TTFT timing issue where first role chunk sent without TTFT delay
benchmark/mock_llm_server/config.py	Renamed latency config fields from `LATENCY_` to `E2E_LATENCY_` and added new TTFT and ITL streaming config fields
benchmark/mock_llm_server/models.py	Added streaming response models: `DeltaMessage`, `ChatCompletionStreamChoice`, `ChatCompletionStreamResponse`, etc.
benchmark/mock_llm_server/response_data.py	Added `split_response_into_chunks()` and `generate_chunk_latencies()` functions for streaming, updated field names
benchmark/tests/test_mock_api.py	Added comprehensive streaming tests, updated config field names, improved test fixtures

Sequence Diagram

sequenceDiagram
    participant Client
    participant FastAPI
    participant StreamHandler as stream_chat_completion
    participant ResponseData as response_data
    participant Config as ModelSettings

    Client->>FastAPI: POST /v1/chat/completions (stream=true)
    FastAPI->>Config: get_settings()
    Config-->>FastAPI: ModelSettings
    FastAPI->>ResponseData: get_response(config)
    ResponseData-->>FastAPI: response_content
    FastAPI->>ResponseData: split_response_into_chunks(response_content)
    ResponseData-->>FastAPI: chunks[]
    FastAPI->>ResponseData: generate_chunk_latencies(config, len(chunks))
    ResponseData-->>FastAPI: latencies[] (TTFT + ITL)
    FastAPI->>StreamHandler: stream_chat_completion()
    
    Note over StreamHandler: First chunk (role)
    StreamHandler->>Client: data: {role: "assistant", content: ""}
    
    loop For each content chunk
        StreamHandler->>StreamHandler: await asyncio.sleep(latencies[idx])
        StreamHandler->>Client: data: {content: chunk}
    end
    
    Note over StreamHandler: Final chunk
    StreamHandler->>Client: data: {finish_reason: "stop"}
    StreamHandler->>Client: data: [DONE]

greptile-apps

_{10 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

benchmark/mock_llm_server/api.py

codecov · 2026-01-05T22:53:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Pouyanpi

Thank you @tgasser-nv, I've reviewed this PR and have several comments inline below. There are some questions to address, but looks good to merge.

q: TTFT timing question, is the first role chunk metadata chunk?
q: global random state mutation could cause issues under concurrent load if seed is passed in prod, but seems not to be the case
nit: inconsistent naming (chunk_latency vs ITL)

I'll add inline comments for details.

Feel free to merge and please review the commit description before merging to remove unnecessary commit messages.

Pouyanpi

please see individual file comments below.

benchmark/mock_llm_server/api.py

benchmark/mock_llm_server/config.py

tgasser-nv added 3 commits January 5, 2026 14:49

Add streaming to mock LLMs, TTFT and ITL to config files, rename LATE…

f843e37

…NCY_* to E2E_LATENCY_* to distinguish between the two

Line length formatter changes

158b61d

Clean up tests

e2c7ca4

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

benchmark/mock_llm_server/api.py Show resolved Hide resolved

tgasser-nv changed the title ~~feat(mock llm): Implement streaming~~ feat(mock llm): Implement Mock LLM streaming Jan 5, 2026

tgasser-nv requested review from Pouyanpi and cparisien January 5, 2026 23:13

tgasser-nv self-assigned this Jan 5, 2026

Pouyanpi changed the title ~~feat(mock llm): Implement Mock LLM streaming~~ feat(benchmark): Implement Mock LLM streaming Jan 6, 2026

Rename Inter-Token-Latency to chunk latency

f5702d1

Pouyanpi reviewed Jan 8, 2026

View reviewed changes

benchmark/mock_llm_server/api.py Show resolved Hide resolved

Pouyanpi reviewed Jan 8, 2026

View reviewed changes

benchmark/mock_llm_server/api.py Show resolved Hide resolved

Pouyanpi reviewed Jan 8, 2026

View reviewed changes

benchmark/mock_llm_server/config.py Show resolved Hide resolved

Pouyanpi approved these changes Jan 8, 2026

View reviewed changes

Pouyanpi merged commit 79101d2 into develop Jan 8, 2026
10 checks passed

Pouyanpi deleted the feat/mock-llm-streaming branch January 8, 2026 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmark): Implement Mock LLM streaming #1564

feat(benchmark): Implement Mock LLM streaming #1564

Uh oh!

tgasser-nv commented Jan 5, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

greptile-apps bot commented Jan 5, 2026

Confidence Score: 4/5

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

codecov bot commented Jan 5, 2026

Uh oh!

Pouyanpi left a comment •

edited

Loading

Uh oh!

Pouyanpi left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(benchmark): Implement Mock LLM streaming #1564

feat(benchmark): Implement Mock LLM streaming #1564

Uh oh!

Conversation

tgasser-nv commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Functional testing

Patch coverage

Pre-commit

Unit-tests

Non-Streaming integration test

Streaming integration test

Checklist

Uh oh!

github-actions bot commented Jan 5, 2026

Documentation preview

Uh oh!

greptile-apps bot commented Jan 5, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jan 5, 2026

Codecov Report

Uh oh!

Pouyanpi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pouyanpi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tgasser-nv commented Jan 5, 2026 •

edited

Loading

Pouyanpi left a comment •

edited

Loading

Pouyanpi left a comment •

edited

Loading