Python: Feature: telemetry for tracking provider latency #3631

castlenthesky · 2026-02-03T00:28:26Z

Motivation and Context

Why is this change required? - Helps add observability into latency from the LLM provider
What problem does it solve? - Current telemetry and observability does not provide any insight into where latency sits during the response generation. This PR aims to add some degree of insight into this question.
What scenario does it contribute to? - It contributes to instances where teams are trying to optimize for speed.
If it fixes an open issue, please link to the issue here. - Open issue link
-->

Description

Adding spans/metrics to track streaming latency. Specific metrics added to otel exports:

gen_ai.client.operation.time_to_first_chunk
gen_ai.client.operation.time_per_output_chunk

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the Contribution Guidelines
All unit tests pass, and I have added new tests where possible
Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

… measurements * Added TIME_TO_FIRST_CHUNK_BUCKET_BOUNDARIES and TIME_PER_OUTPUT_CHUNK_BUCKET_BOUNDARIES for improved metric tracking. * Implemented _get_time_to_first_chunk_histogram and _get_time_per_output_chunk_histogram functions to create new histograms. * Updated _trace_get_streaming_response to record metrics for time to first chunk and time per output chunk. * Introduced _record_streaming_metrics function to handle the recording of streaming-specific metrics.

…latency.

Copilot

Pull request overview

This PR adds telemetry metrics for tracking streaming provider latency in the agent framework's observability module. The implementation introduces three new OpenTelemetry metrics to measure streaming operation performance from the client's perspective.

Changes:

Added three new histogram metrics for streaming latency: gen_ai.client.operation.time_to_first_chunk, gen_ai.client.operation.time_per_output_chunk, and gen_ai.client.operation.duration
Modified trace_get_streaming_response to track timing information for chunks and record streaming-specific metrics
Added bucket boundaries configurations optimized for streaming latency measurements
Created tests to verify streaming metrics are recorded during both successful and error scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File	Description
python/packages/core/agent_framework/observability.py	Added new histogram creation functions, bucket boundaries for streaming metrics, modified streaming response tracing to track chunk timing, and implemented `_record_streaming_metrics` helper function
python/packages/core/tests/core/test_observability.py	Added test fixtures and test cases for streaming metrics recording in both success and error scenarios

Copilot · 2026-02-03T00:34:17Z

python/packages/core/tests/core/test_observability.py

+async def test_streaming_metrics_recorded(mock_timed_streaming_chat_client, span_exporter: InMemorySpanExporter):
+    """Test that streaming specific metrics are recorded correctly."""
+    client = use_instrumentation(mock_timed_streaming_chat_client)()
+    messages = [ChatMessage(role=Role.USER, text="Test")]
+    span_exporter.clear()
+
+    updates = []
+    async for update in client.get_streaming_response(messages=messages, model_id="TestStreaming"):
+        updates.append(update)
+
+    assert len(updates) == 3
+    spans = span_exporter.get_finished_spans()
+    assert len(spans) == 1
+    span = spans[0]
+    # Check that execution completed successfully and span was created
+    assert span.name == "chat TestStreaming"
+    assert span.attributes[OtelAttr.OPERATION.value] == OtelAttr.CHAT_COMPLETION_OPERATION


The test does not verify that the new streaming metrics are actually recorded. It only checks that the span was created and has the correct operation attribute. Consider adding assertions to verify that the time_to_first_chunk, time_per_output_chunk, and client_operation_duration metrics were recorded with expected values or at least recorded at all. This would ensure the feature is working as intended.

Copilot · 2026-02-03T00:34:17Z

python/packages/core/tests/core/test_observability.py

+async def test_streaming_metrics_with_error(mock_error_streaming_chat_client, span_exporter: InMemorySpanExporter):
+    """Test that metrics are recorded even if the stream fails after the first chunk."""
+    client = use_instrumentation(mock_error_streaming_chat_client)()
+    messages = [ChatMessage(role=Role.USER, text="Test")]
+    span_exporter.clear()
+
+    with pytest.raises(ValueError, match="Stream interrupted"):
+        async for _ in client.get_streaming_response(messages=messages, model_id="TestError"):
+            pass
+
+    spans = span_exporter.get_finished_spans()
+    assert len(spans) == 1
+    span = spans[0]
+    assert span.attributes[OtelAttr.OPERATION.value] == OtelAttr.CHAT_COMPLETION_OPERATION


The test does not verify that the streaming metrics are recorded when an error occurs. While it checks that the span contains the correct operation attribute, it should also verify that the time_to_first_chunk and client_operation_duration metrics were recorded (since at least one chunk was received before the error). This would ensure that the error handling path correctly records partial metrics.

Copilot · 2026-02-03T00:34:18Z