-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Python: Feature: telemetry for tracking provider latency #3631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Python: Feature: telemetry for tracking provider latency #3631
Conversation
… measurements * Added TIME_TO_FIRST_CHUNK_BUCKET_BOUNDARIES and TIME_PER_OUTPUT_CHUNK_BUCKET_BOUNDARIES for improved metric tracking. * Implemented _get_time_to_first_chunk_histogram and _get_time_per_output_chunk_histogram functions to create new histograms. * Updated _trace_get_streaming_response to record metrics for time to first chunk and time per output chunk. * Introduced _record_streaming_metrics function to handle the recording of streaming-specific metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds telemetry metrics for tracking streaming provider latency in the agent framework's observability module. The implementation introduces three new OpenTelemetry metrics to measure streaming operation performance from the client's perspective.
Changes:
- Added three new histogram metrics for streaming latency:
gen_ai.client.operation.time_to_first_chunk,gen_ai.client.operation.time_per_output_chunk, andgen_ai.client.operation.duration - Modified
trace_get_streaming_responseto track timing information for chunks and record streaming-specific metrics - Added bucket boundaries configurations optimized for streaming latency measurements
- Created tests to verify streaming metrics are recorded during both successful and error scenarios
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| python/packages/core/agent_framework/observability.py | Added new histogram creation functions, bucket boundaries for streaming metrics, modified streaming response tracing to track chunk timing, and implemented _record_streaming_metrics helper function |
| python/packages/core/tests/core/test_observability.py | Added test fixtures and test cases for streaming metrics recording in both success and error scenarios |
| async def test_streaming_metrics_recorded(mock_timed_streaming_chat_client, span_exporter: InMemorySpanExporter): | ||
| """Test that streaming specific metrics are recorded correctly.""" | ||
| client = use_instrumentation(mock_timed_streaming_chat_client)() | ||
| messages = [ChatMessage(role=Role.USER, text="Test")] | ||
| span_exporter.clear() | ||
|
|
||
| updates = [] | ||
| async for update in client.get_streaming_response(messages=messages, model_id="TestStreaming"): | ||
| updates.append(update) | ||
|
|
||
| assert len(updates) == 3 | ||
| spans = span_exporter.get_finished_spans() | ||
| assert len(spans) == 1 | ||
| span = spans[0] | ||
| # Check that execution completed successfully and span was created | ||
| assert span.name == "chat TestStreaming" | ||
| assert span.attributes[OtelAttr.OPERATION.value] == OtelAttr.CHAT_COMPLETION_OPERATION |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test does not verify that the new streaming metrics are actually recorded. It only checks that the span was created and has the correct operation attribute. Consider adding assertions to verify that the time_to_first_chunk, time_per_output_chunk, and client_operation_duration metrics were recorded with expected values or at least recorded at all. This would ensure the feature is working as intended.
| async def test_streaming_metrics_with_error(mock_error_streaming_chat_client, span_exporter: InMemorySpanExporter): | ||
| """Test that metrics are recorded even if the stream fails after the first chunk.""" | ||
| client = use_instrumentation(mock_error_streaming_chat_client)() | ||
| messages = [ChatMessage(role=Role.USER, text="Test")] | ||
| span_exporter.clear() | ||
|
|
||
| with pytest.raises(ValueError, match="Stream interrupted"): | ||
| async for _ in client.get_streaming_response(messages=messages, model_id="TestError"): | ||
| pass | ||
|
|
||
| spans = span_exporter.get_finished_spans() | ||
| assert len(spans) == 1 | ||
| span = spans[0] | ||
| assert span.attributes[OtelAttr.OPERATION.value] == OtelAttr.CHAT_COMPLETION_OPERATION |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test does not verify that the streaming metrics are recorded when an error occurs. While it checks that the span contains the correct operation attribute, it should also verify that the time_to_first_chunk and client_operation_duration metrics were recorded (since at least one chunk was received before the error). This would ensure that the error handling path correctly records partial metrics.
| start_time_stamp = perf_counter() | ||
| first_chunk_time: float | None = None | ||
| previous_chunk_time: float | None = None | ||
| chunk_count = 0 |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'chunk_count' is misleading because it actually represents the number of inter-chunk intervals (which is one less than the total number of chunks). For example, if 3 chunks are received, chunk_count will be 2. Consider renaming this to 'inter_chunk_count' or 'chunk_intervals_count' to make the intent clearer and avoid confusion.
| start_time: The start time of the streaming operation. | ||
| first_chunk_time: The time when the first chunk arrived. | ||
| end_time: The end time of the streaming operation. | ||
| chunk_count: The number of chunks after the first chunk. |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring states that 'chunk_count' is "The number of chunks after the first chunk", but this is imprecise. It's actually the number of inter-chunk intervals (i.e., transitions between chunks after the first one). For 3 total chunks, this value would be 2. Consider updating the docstring to say "The number of inter-chunk intervals (chunks received after the first chunk)" to be more precise.
| chunk_count: The number of chunks after the first chunk. | |
| chunk_count: The number of inter-chunk intervals (chunks received after the first chunk). |
|
|
||
|
|
||
| def _record_streaming_metrics( | ||
| span: trace.Span, |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'span' parameter is not used in the function body. It's passed in but never referenced. Consider removing this parameter if it's not needed, or add a comment explaining why it's included for future extensibility.
| span: trace.Span, | |
| span: trace.Span, # Span parameter is currently unused; retained for future span-based metrics and API consistency. |
| # Record streaming-specific metrics | ||
| if first_chunk_time is not None: | ||
| _record_streaming_metrics( | ||
| span=span, | ||
| attributes=attributes, | ||
| start_time=start_time_stamp, | ||
| first_chunk_time=first_chunk_time, | ||
| end_time=end_time_stamp, | ||
| chunk_count=chunk_count, | ||
| total_inter_chunk_time=total_inter_chunk_time, | ||
| time_to_first_chunk_histogram=self.additional_properties["time_to_first_chunk_histogram"], | ||
| time_per_output_chunk_histogram=self.additional_properties[ | ||
| "time_per_output_chunk_histogram" | ||
| ], | ||
| client_operation_duration_histogram=self.additional_properties[ | ||
| "client_operation_duration_histogram" | ||
| ], | ||
| ) |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two separate duration metrics are being recorded for streaming operations: 'gen_ai.operation.duration' (via _capture_response) and 'gen_ai.client.operation.duration' (via _record_streaming_metrics). While this might be intentional to provide both a general operation duration and a client-specific streaming duration, consider documenting the distinction between these metrics or consolidating them if they measure the same thing. Both are calculated as end_time - start_time over the same time period.
| # Record metrics even if exception occurred (if we got at least one chunk) | ||
| if first_chunk_time is not None: | ||
| _record_streaming_metrics( | ||
| span=span, | ||
| attributes=attributes, | ||
| start_time=start_time_stamp, | ||
| first_chunk_time=first_chunk_time, | ||
| end_time=end_time_stamp, | ||
| chunk_count=chunk_count, | ||
| total_inter_chunk_time=total_inter_chunk_time, | ||
| time_to_first_chunk_histogram=self.additional_properties["time_to_first_chunk_histogram"], | ||
| time_per_output_chunk_histogram=self.additional_properties[ | ||
| "time_per_output_chunk_histogram" | ||
| ], | ||
| client_operation_duration_histogram=self.additional_properties[ | ||
| "client_operation_duration_histogram" | ||
| ], | ||
| ) |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an exception occurs during streaming, the error type is recorded on the span via 'capture_exception' (line 1285) but is not added to the 'attributes' dictionary before calling '_record_streaming_metrics'. This means the streaming metrics recorded in the error path won't include the error type in their attributes. Consider updating the attributes dictionary with the error type before calling _record_streaming_metrics, similar to how it's done in the success path via _get_response_attributes.
|
@microsoft-github-policy-service agree |
Motivation and Context
-->
Description
Adding spans/metrics to track streaming latency. Specific metrics added to otel exports:
gen_ai.client.operation.time_to_first_chunkgen_ai.client.operation.time_per_output_chunkContribution Checklist