Skip to content

Conversation

@castlenthesky
Copy link

Motivation and Context

  1. Why is this change required? - Helps add observability into latency from the LLM provider
  2. What problem does it solve? - Current telemetry and observability does not provide any insight into where latency sits during the response generation. This PR aims to add some degree of insight into this question.
  3. What scenario does it contribute to? - It contributes to instances where teams are trying to optimize for speed.
  4. If it fixes an open issue, please link to the issue here. - Open issue link
    -->

Description

Adding spans/metrics to track streaming latency. Specific metrics added to otel exports:

  • gen_ai.client.operation.time_to_first_chunk
  • gen_ai.client.operation.time_per_output_chunk

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

… measurements

* Added TIME_TO_FIRST_CHUNK_BUCKET_BOUNDARIES and TIME_PER_OUTPUT_CHUNK_BUCKET_BOUNDARIES for improved metric tracking.
* Implemented _get_time_to_first_chunk_histogram and _get_time_per_output_chunk_histogram functions to create new histograms.
* Updated _trace_get_streaming_response to record metrics for time to first chunk and time per output chunk.
* Introduced _record_streaming_metrics function to handle the recording of streaming-specific metrics.
Copilot AI review requested due to automatic review settings February 3, 2026 00:28
@github-actions github-actions bot changed the title Feature: telemetry for tracking provider latency Python: Feature: telemetry for tracking provider latency Feb 3, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds telemetry metrics for tracking streaming provider latency in the agent framework's observability module. The implementation introduces three new OpenTelemetry metrics to measure streaming operation performance from the client's perspective.

Changes:

  • Added three new histogram metrics for streaming latency: gen_ai.client.operation.time_to_first_chunk, gen_ai.client.operation.time_per_output_chunk, and gen_ai.client.operation.duration
  • Modified trace_get_streaming_response to track timing information for chunks and record streaming-specific metrics
  • Added bucket boundaries configurations optimized for streaming latency measurements
  • Created tests to verify streaming metrics are recorded during both successful and error scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
python/packages/core/agent_framework/observability.py Added new histogram creation functions, bucket boundaries for streaming metrics, modified streaming response tracing to track chunk timing, and implemented _record_streaming_metrics helper function
python/packages/core/tests/core/test_observability.py Added test fixtures and test cases for streaming metrics recording in both success and error scenarios

Comment on lines +864 to +880
async def test_streaming_metrics_recorded(mock_timed_streaming_chat_client, span_exporter: InMemorySpanExporter):
"""Test that streaming specific metrics are recorded correctly."""
client = use_instrumentation(mock_timed_streaming_chat_client)()
messages = [ChatMessage(role=Role.USER, text="Test")]
span_exporter.clear()

updates = []
async for update in client.get_streaming_response(messages=messages, model_id="TestStreaming"):
updates.append(update)

assert len(updates) == 3
spans = span_exporter.get_finished_spans()
assert len(spans) == 1
span = spans[0]
# Check that execution completed successfully and span was created
assert span.name == "chat TestStreaming"
assert span.attributes[OtelAttr.OPERATION.value] == OtelAttr.CHAT_COMPLETION_OPERATION
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test does not verify that the new streaming metrics are actually recorded. It only checks that the span was created and has the correct operation attribute. Consider adding assertions to verify that the time_to_first_chunk, time_per_output_chunk, and client_operation_duration metrics were recorded with expected values or at least recorded at all. This would ensure the feature is working as intended.

Copilot uses AI. Check for mistakes.
Comment on lines +903 to +916
async def test_streaming_metrics_with_error(mock_error_streaming_chat_client, span_exporter: InMemorySpanExporter):
"""Test that metrics are recorded even if the stream fails after the first chunk."""
client = use_instrumentation(mock_error_streaming_chat_client)()
messages = [ChatMessage(role=Role.USER, text="Test")]
span_exporter.clear()

with pytest.raises(ValueError, match="Stream interrupted"):
async for _ in client.get_streaming_response(messages=messages, model_id="TestError"):
pass

spans = span_exporter.get_finished_spans()
assert len(spans) == 1
span = spans[0]
assert span.attributes[OtelAttr.OPERATION.value] == OtelAttr.CHAT_COMPLETION_OPERATION
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test does not verify that the streaming metrics are recorded when an error occurs. While it checks that the span contains the correct operation attribute, it should also verify that the time_to_first_chunk and client_operation_duration metrics were recorded (since at least one chunk was received before the error). This would ensure that the error handling path correctly records partial metrics.

Copilot uses AI. Check for mistakes.
start_time_stamp = perf_counter()
first_chunk_time: float | None = None
previous_chunk_time: float | None = None
chunk_count = 0
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'chunk_count' is misleading because it actually represents the number of inter-chunk intervals (which is one less than the total number of chunks). For example, if 3 chunks are received, chunk_count will be 2. Consider renaming this to 'inter_chunk_count' or 'chunk_intervals_count' to make the intent clearer and avoid confusion.

Copilot uses AI. Check for mistakes.
start_time: The start time of the streaming operation.
first_chunk_time: The time when the first chunk arrived.
end_time: The end time of the streaming operation.
chunk_count: The number of chunks after the first chunk.
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states that 'chunk_count' is "The number of chunks after the first chunk", but this is imprecise. It's actually the number of inter-chunk intervals (i.e., transitions between chunks after the first one). For 3 total chunks, this value would be 2. Consider updating the docstring to say "The number of inter-chunk intervals (chunks received after the first chunk)" to be more precise.

Suggested change
chunk_count: The number of chunks after the first chunk.
chunk_count: The number of inter-chunk intervals (chunks received after the first chunk).

Copilot uses AI. Check for mistakes.


def _record_streaming_metrics(
span: trace.Span,
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'span' parameter is not used in the function body. It's passed in but never referenced. Consider removing this parameter if it's not needed, or add a comment explaining why it's included for future extensibility.

Suggested change
span: trace.Span,
span: trace.Span, # Span parameter is currently unused; retained for future span-based metrics and API consistency.

Copilot uses AI. Check for mistakes.
Comment on lines +1318 to +1335
# Record streaming-specific metrics
if first_chunk_time is not None:
_record_streaming_metrics(
span=span,
attributes=attributes,
start_time=start_time_stamp,
first_chunk_time=first_chunk_time,
end_time=end_time_stamp,
chunk_count=chunk_count,
total_inter_chunk_time=total_inter_chunk_time,
time_to_first_chunk_histogram=self.additional_properties["time_to_first_chunk_histogram"],
time_per_output_chunk_histogram=self.additional_properties[
"time_per_output_chunk_histogram"
],
client_operation_duration_histogram=self.additional_properties[
"client_operation_duration_histogram"
],
)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two separate duration metrics are being recorded for streaming operations: 'gen_ai.operation.duration' (via _capture_response) and 'gen_ai.client.operation.duration' (via _record_streaming_metrics). While this might be intentional to provide both a general operation duration and a client-specific streaming duration, consider documenting the distinction between these metrics or consolidating them if they measure the same thing. Both are calculated as end_time - start_time over the same time period.

Copilot uses AI. Check for mistakes.
Comment on lines +1286 to +1303
# Record metrics even if exception occurred (if we got at least one chunk)
if first_chunk_time is not None:
_record_streaming_metrics(
span=span,
attributes=attributes,
start_time=start_time_stamp,
first_chunk_time=first_chunk_time,
end_time=end_time_stamp,
chunk_count=chunk_count,
total_inter_chunk_time=total_inter_chunk_time,
time_to_first_chunk_histogram=self.additional_properties["time_to_first_chunk_histogram"],
time_per_output_chunk_histogram=self.additional_properties[
"time_per_output_chunk_histogram"
],
client_operation_duration_histogram=self.additional_properties[
"client_operation_duration_histogram"
],
)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an exception occurs during streaming, the error type is recorded on the span via 'capture_exception' (line 1285) but is not added to the 'attributes' dictionary before calling '_record_streaming_metrics'. This means the streaming metrics recorded in the error path won't include the error type in their attributes. Consider updating the attributes dictionary with the error type before calling _record_streaming_metrics, similar to how it's done in the success path via _get_response_attributes.

Copilot uses AI. Check for mistakes.
@castlenthesky
Copy link
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants