Skip to content

Conversation

@tgasser-nv
Copy link
Collaborator

@tgasser-nv tgasser-nv commented Jan 5, 2026

Description

The Mock LLM server only supported non-streaming responses prior to this PR. This PR adds support for streaming responses with a simple splitting of the safe and unsafe responses on whitespace. The Time To First Token (TTFT) and Inter Token Latency (ITL) latencies are parameterized by a truncated normal distribution. This takes as parameters mean and standard deviation, whose samples are truncated by a min and max latency.

This feature can be used to isolate the latency due to Guardrails internal processing, as all LLM responses can be tightly controlled.

Functional testing

Patch coverage

-------------
Diff Coverage
Diff: develop...HEAD, staged and unstaged changes
-------------
benchmark/mock_llm_server/api.py (100%)
benchmark/mock_llm_server/config.py (100%)
benchmark/mock_llm_server/models.py (100%)
benchmark/mock_llm_server/response_data.py (100%)
benchmark/tests/test_mock_api.py (100%)
benchmark/tests/test_mock_config.py (100%)
benchmark/tests/test_mock_response_data.py (100%)
-------------
Total:   395 lines
Missing: 0 lines
Coverage: 100%
-------------

Pre-commit

$ poetry run pre-commit run --all-files
check yaml...............................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
ruff (legacy alias)......................................................Passed
ruff format..............................................................Passed
Insert license in comments...............................................Passed
pyright..................................................................Passed

Unit-tests

$ poetry run pytest -q

.......................ssss.........................................................................................s... [  5%]
........................................................................................................................ [ 10%]
........................ss....ss....................sssssss............................................................. [ 15%]
...........................................................................................................ss.......s... [ 20%]
.........s................................................................................................ss........ss.. [ 25%]
.ss................................ss................s...................................................s............s. [ 30%]
........................................................................................................................ [ 35%]
........................................................................................................................ [ 40%]
...............................sssss......ssssssssssssssssss.........sssss.............................................. [ 45%]
...................................s...........ss.................................ssssssss.ssssssssss................... [ 50%]
..............................................................s....s.....................................ssssssss....... [ 55%]
.......sss...ss...ss.....ssssssssssssss............................................/Users/tgasser/env/code_quality/lib/python3.1
3/site-packages/_pytest/stash.py:108: RuntimeWarning: coroutine 'AsyncMockMixin._execute_mock_call' was never awaited
  del self._storage[ke

RuntimeWarning: Enable tracemalloc to get the object allocation traceback
................s.................... [ 60%]
..........................................................................................sssssssss.........ss.......... [ 65%]
........................................................................................................sssssss......... [ 70%]
.......................................................................................s................................ [ 75%]
........................................................................ss.............................................. [ 80%]
........................................................................................................................ [ 85%]
........................................................................................................................ [ 90%]
...s.................................................................................................................... [ 95%]
...........................................................................................................              [100%]
2253 passed, 135 skipped in 189.35s (0:03:09)

Non-Streaming integration test

Server terminal

$ source ~/env/benchmark_env/bin/activate
$ cd benchmark
$ PYTHONPATH=.. python mock_llm_server/run_server.py --workers 1 --port 8000 --config-file mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:07:36 INFO: Using config file: mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:07:36 INFO: Starting Mock LLM Server on 0.0.0.0:8000
2026-01-05 17:07:36 INFO: OpenAPI docs available at: http://0.0.0.0:8000/docs
2026-01-05 17:07:36 INFO: Health check at: http://0.0.0.0:8000/health
2026-01-05 17:07:36 INFO: Serving model with config mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:07:36 INFO: Press Ctrl+C to stop the server
INFO:     Loading environment from 'mock_llm_server/configs/meta-llama-3.3-70b-instruct.env'
INFO:     Started server process [59959]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2026-01-05 17:07:47 INFO: Request finished: 200, took 4.107 seconds
INFO:     127.0.0.1:57946 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Client terminal

$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
   -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [{"role":"user","content":"What can you do?"}],
    "temperature": 0.5,
    "top_p": 1,
    "max_tokens": 1024
  }'

{"id":"chatcmpl-36b9a8fb","object":"chat.completion","created":1767654481,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities."},"finish_reason":"stop"}],"usage":{"prompt_tokens":4,"completion_tokens":71,"total_tokens":75}}%```

Streaming integration test

Server terminal

PYTHONPATH=.. python mock_llm_server/run_server.py --workers 1 --port 8000 --config-file mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:09:04 INFO: Using config file: mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:09:04 INFO: Starting Mock LLM Server on 0.0.0.0:8000
2026-01-05 17:09:04 INFO: OpenAPI docs available at: http://0.0.0.0:8000/docs
2026-01-05 17:09:04 INFO: Health check at: http://0.0.0.0:8000/health
2026-01-05 17:09:04 INFO: Serving model with config mock_llm_server/configs/meta-llama-3.3-70b-instruct.env
2026-01-05 17:09:04 INFO: Press Ctrl+C to stop the server
INFO:     Loading environment from 'mock_llm_server/configs/meta-llama-3.3-70b-instruct.env'
INFO:     Started server process [60031]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2026-01-05 17:09:31 INFO: Request finished: 200, took 0.019 seconds
INFO:     127.0.0.1:57995 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Client terminal

$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
   -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [{"role":"user","content":"What can you do?"}],
    "temperature": 0.5,
    "top_p": 1,
    "max_tokens": 1024,
    "stream": true
  }' 

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""}}]}

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":"I "}}]}

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":"can "}}]}

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":"provide "}}]}

...
...

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":"illegal "}}]}

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":"activities."}}]}

data: {"id":"chatcmpl-1b5a7bb5","object":"chat.completion.chunk","created":1767654571,"model":"meta/llama-3.3-70b-instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Checklist

  • I've read the CONTRIBUTING guidelines.
  • I've updated the documentation if applicable.
  • I've added tests if applicable.
  • @mentions of the person or team responsible for reviewing proposed changes.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2026

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1564

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Summary

This PR adds streaming support to the mock LLM server with proper TTFT and ITL latency simulation. The changes include:

  • Renamed LATENCY_* config fields to E2E_LATENCY_* to distinguish from streaming metrics
  • Added TTFT (Time to First Token) and ITL (Inter-Token Latency) configuration parameters
  • Implemented Server-Sent Events (SSE) streaming for both /v1/chat/completions and /v1/completions endpoints
  • Added helper functions split_response_into_chunks() and generate_chunk_latencies() for streaming support
  • Created streaming response models (ChatCompletionStreamResponse, CompletionStreamResponse, etc.)
  • Added comprehensive test coverage for streaming functionality
  • Updated config files with new TTFT/ITL values
  • Minor README formatting improvements

The implementation correctly handles the OpenAI streaming format with proper SSE formatting, chunk sequencing, and finish tokens. However, there's a timing issue in stream_chat_completion where TTFT is applied to the first content chunk instead of the first overall chunk (the role message).

Confidence Score: 4/5

  • Safe to merge with one timing issue in chat completions streaming that should be addressed
  • The PR adds well-tested streaming functionality with comprehensive test coverage. The code follows OpenAI API conventions and includes proper error handling. However, there's a logical issue where TTFT delay is applied incorrectly in stream_chat_completion (applied before first content chunk rather than first overall chunk). The stream_completion function implements it correctly. This affects benchmark accuracy but won't cause runtime errors.
  • Pay close attention to benchmark/mock_llm_server/api.py - the TTFT timing issue needs correction

Important Files Changed

Filename Overview
benchmark/mock_llm_server/api.py Added streaming support for chat/completions endpoints with TTFT timing issue where first role chunk sent without TTFT delay
benchmark/mock_llm_server/config.py Renamed latency config fields from LATENCY_* to E2E_LATENCY_* and added new TTFT and ITL streaming config fields
benchmark/mock_llm_server/models.py Added streaming response models: DeltaMessage, ChatCompletionStreamChoice, ChatCompletionStreamResponse, etc.
benchmark/mock_llm_server/response_data.py Added split_response_into_chunks() and generate_chunk_latencies() functions for streaming, updated field names
benchmark/tests/test_mock_api.py Added comprehensive streaming tests, updated config field names, improved test fixtures

Sequence Diagram

sequenceDiagram
    participant Client
    participant FastAPI
    participant StreamHandler as stream_chat_completion
    participant ResponseData as response_data
    participant Config as ModelSettings

    Client->>FastAPI: POST /v1/chat/completions (stream=true)
    FastAPI->>Config: get_settings()
    Config-->>FastAPI: ModelSettings
    FastAPI->>ResponseData: get_response(config)
    ResponseData-->>FastAPI: response_content
    FastAPI->>ResponseData: split_response_into_chunks(response_content)
    ResponseData-->>FastAPI: chunks[]
    FastAPI->>ResponseData: generate_chunk_latencies(config, len(chunks))
    ResponseData-->>FastAPI: latencies[] (TTFT + ITL)
    FastAPI->>StreamHandler: stream_chat_completion()
    
    Note over StreamHandler: First chunk (role)
    StreamHandler->>Client: data: {role: "assistant", content: ""}
    
    loop For each content chunk
        StreamHandler->>StreamHandler: await asyncio.sleep(latencies[idx])
        StreamHandler->>Client: data: {content: chunk}
    end
    
    Note over StreamHandler: Final chunk
    StreamHandler->>Client: data: {finish_reason: "stop"}
    StreamHandler->>Client: data: [DONE]
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@codecov
Copy link

codecov bot commented Jan 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@tgasser-nv tgasser-nv changed the title feat(mock llm): Implement streaming feat(mock llm): Implement Mock LLM streaming Jan 5, 2026
@tgasser-nv tgasser-nv self-assigned this Jan 5, 2026
@Pouyanpi Pouyanpi changed the title feat(mock llm): Implement Mock LLM streaming feat(benchmark): Implement Mock LLM streaming Jan 6, 2026
Copy link
Collaborator

@Pouyanpi Pouyanpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tgasser-nv, I've reviewed this PR and have several comments inline below. There are some questions to address, but looks good to merge.

  • q: TTFT timing question, is the first role chunk metadata chunk?
  • q: global random state mutation could cause issues under concurrent load if seed is passed in prod, but seems not to be the case
  • nit: inconsistent naming (chunk_latency vs ITL)

I'll add inline comments for details.

Feel free to merge and please review the commit description before merging to remove unnecessary commit messages.

Copy link
Collaborator

@Pouyanpi Pouyanpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see individual file comments below.

@Pouyanpi Pouyanpi merged commit 79101d2 into develop Jan 8, 2026
10 checks passed
@Pouyanpi Pouyanpi deleted the feat/mock-llm-streaming branch January 8, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants