[VEC-76] Instrument observability for api#200
Conversation
📝 WalkthroughWalkthroughThis pull request integrates OpenTelemetry tracing throughout the application. It introduces configuration settings for enabling telemetry and specifying the OTLP endpoint, implements core telemetry setup and shutdown utilities, integrates instrumentation into the FastAPI app lifecycle, enhances middleware to track request IDs in spans, adds required dependencies, and provides comprehensive test coverage. Changes
Sequence Diagram(s)sequenceDiagram
participant App as FastAPI App
participant Telemetry as Telemetry Module
participant OTLPExp as OTLP Exporter
participant Provider as TracerProvider
participant Middleware as RequestIdMiddleware
participant SQLAlchemy as SQLAlchemy
App->>Telemetry: setup_telemetry(app, settings)
alt otel_enabled = true
Telemetry->>Provider: Create TracerProvider
Telemetry->>OTLPExp: Create OTLPSpanExporter
Telemetry->>Provider: Add BatchSpanProcessor
Telemetry->>Telemetry: Instrument FastAPI app
Telemetry->>Telemetry: Instrument SQLAlchemy<br/>(if provider active)
Telemetry-->>App: Return TracerProvider
else otel_enabled = false
Telemetry-->>App: Return None
end
App->>SQLAlchemy: Engine instrumentation<br/>(if provider available)
rect rgba(100, 150, 200, 0.5)
Note over Middleware,Provider: Request Handling
App->>Middleware: Receive HTTP request
Middleware->>Provider: get_current_span()
Middleware->>Provider: Set span attribute<br/>request_id
Middleware-->>App: Continue processing
end
App->>Telemetry: shutdown_telemetry(provider)
alt provider is not None
Telemetry->>Provider: provider.shutdown()
Provider-->>OTLPExp: Flush pending spans
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
9dc807e to
4f5e12f
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces basic OpenTelemetry tracing support for the Python FastAPI service, wiring telemetry setup/shutdown into the app lifecycle and enriching spans with request IDs.
Changes:
- Add OpenTelemetry SDK + OTLP exporter + FastAPI/SQLAlchemy instrumentation dependencies.
- Add
app.telemetrymodule to configure tracing and instrument FastAPI + SQLAlchemy. - Extend Request ID middleware to attach the request id to the active span and add unit tests for telemetry behavior.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
stl-verify/python/pyproject.toml |
Adds OTel SDK/exporter/instrumentation dependencies. |
stl-verify/python/uv.lock |
Locks newly added OTel dependency set and transitive deps. |
stl-verify/python/app/telemetry.py |
New telemetry setup/shutdown + SQLAlchemy engine instrumentation helper. |
stl-verify/python/app/main.py |
Hooks telemetry setup into app creation and shutdown into lifespan cleanup; instruments SQLAlchemy engine. |
stl-verify/python/app/config.py |
Adds OTel-related settings (enable flag, OTLP endpoint, service name). |
stl-verify/python/app/middleware/request_id.py |
Adds request-id attribute onto the active OTel span. |
stl-verify/python/tests/unit/test_telemetry.py |
New unit tests covering telemetry setup/shutdown and request_id propagation to spans. |
| rid = request.headers.get(_REQUEST_ID_HEADER) or str(uuid.uuid4()) | ||
| token = request_id_var.set(rid) | ||
| span = trace.get_current_span() | ||
| span.set_attribute("request_id", rid) |
There was a problem hiding this comment.
X-Request-ID is accepted verbatim from the incoming request and added as an OpenTelemetry span attribute. Because this value is fully client-controlled, it can lead to high-cardinality / oversized attributes (increased telemetry cost, exporter payload bloat, and potential DoS if very large headers are sent).
Consider validating/sanitizing the inbound request id before storing it on the span (e.g., enforce a max length and allowed characters, or only accept UUIDs and otherwise generate a new id).
| @pytest.fixture(autouse=True) | ||
| def _reset_tracer_provider(): | ||
| """Ensure OTel global state is clean between tests.""" | ||
| original_provider = trace._TRACER_PROVIDER | ||
| original_done = trace._TRACER_PROVIDER_SET_ONCE._done | ||
| yield | ||
| trace._TRACER_PROVIDER = original_provider | ||
| trace._TRACER_PROVIDER_SET_ONCE._done = original_done |
There was a problem hiding this comment.
This fixture mutates OpenTelemetry private globals (trace._TRACER_PROVIDER and trace._TRACER_PROVIDER_SET_ONCE._done) to reset global state between tests. These internal attributes are not part of the public API and can change across OTel releases, making the test suite brittle.
Consider avoiding the global provider entirely in tests by refactoring the production code to pass an explicit tracer_provider into the instrumentors (FastAPI/SQLAlchemy) instead of calling trace.set_tracer_provider, or use a supported reset helper if OpenTelemetry provides one for testing.
| resource = Resource.create({"service.name": settings.otel_service_name}) | ||
| provider = TracerProvider(resource=resource) | ||
| exporter = OTLPSpanExporter(endpoint=settings.otel_exporter_otlp_endpoint) | ||
| provider.add_span_processor(BatchSpanProcessor(exporter)) | ||
| trace.set_tracer_provider(provider) | ||
|
|
||
| FastAPIInstrumentor.instrument_app(app) | ||
| return provider |
There was a problem hiding this comment.
setup_telemetry sets the process-global tracer provider via trace.set_tracer_provider. OpenTelemetry treats this as a set-once global; it makes multi-app usage (and unit tests) harder and is the reason the tests need to reach into private OTel globals to reset state.
Consider keeping the provider owned by this app instance and passing it explicitly into FastAPIInstrumentor.instrument_app(..., tracer_provider=provider) (and similarly into SQLAlchemy instrumentation) rather than setting the global provider. This avoids global side effects and improves test isolation.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
stl-verify/python/app/main.py (1)
20-32:⚠️ Potential issue | 🟠 MajorEnsure telemetry cleanup also runs on failed startup.
If startup raises before
yield, Lines 29-32 never execute, so the tracer provider created on Line 36 is left running. Wrap the full lifespan body intry/finallyand dispose the engine conditionally.Proposed fix
`@asynccontextmanager` async def lifespan(app: FastAPI) -> AsyncIterator[None]: - engine = create_async_engine(settings.async_database_url, pool_pre_ping=True) - async with engine.connect() as conn: - await conn.execute(text("SELECT 1")) - app.state.engine = engine - instrument_sqlalchemy_engine(engine) - async with httpx.AsyncClient() as http_client: - app.state.http_client = http_client - yield + engine = None try: - await engine.dispose() + engine = create_async_engine(settings.async_database_url, pool_pre_ping=True) + async with engine.connect() as conn: + await conn.execute(text("SELECT 1")) + app.state.engine = engine + instrument_sqlalchemy_engine(engine) + async with httpx.AsyncClient() as http_client: + app.state.http_client = http_client + yield finally: - shutdown_telemetry(app.state.tracer_provider) + try: + if engine is not None: + await engine.dispose() + finally: + shutdown_telemetry(app.state.tracer_provider)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@stl-verify/python/app/main.py` around lines 20 - 32, The lifespan coroutine (lifespan) must ensure telemetry shutdown runs even if startup fails: wrap the entire body of async def lifespan(app: FastAPI) in a try/finally so that shutdown_telemetry(app.state.tracer_provider) is always called; create the engine and set app.state.engine and instrument_sqlalchemy_engine(engine) inside the try, and in the finally conditionally await engine.dispose() only if engine was created/assigned (and similarly ensure app.state.http_client is closed if created), so startup exceptions before the yield still trigger telemetry cleanup and resource disposal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@stl-verify/python/app/telemetry.py`:
- Around line 20-26: setup_telemetry currently creates a new TracerProvider and
calls trace.set_tracer_provider(provider) but may return and later shut down a
provider that was never installed if a global provider already existed; change
setup_telemetry (and the provider variable usage) so after creating provider you
check the current global with trace.get_tracer_provider(), only call
trace.set_tracer_provider(provider) if the current global is the default/unset
one, and if set_tracer_provider was not applied return the actual global
provider (and avoid closing the unused local provider later); ensure
FastAPIInstrumentor.instrument_app(app) is still called and reference the
provider/global tracer appropriately when returning so the app uses the real
active provider.
In `@stl-verify/python/tests/unit/test_telemetry.py`:
- Around line 28-35: The _reset_tracer_provider fixture currently only restores
OpenTelemetry globals after a test, causing order-dependent leaks; change it to
call reset_trace_globals() from opentelemetry.test.globals_test both before and
after the yield so the global tracer state is pristine at test start and
restored at test end (update the fixture _reset_tracer_provider to import
reset_trace_globals and invoke it before yield and again after yield); reference
the fixture name _reset_tracer_provider and the OpenTelemetry globals
(trace.set_tracer_provider, trace._TRACER_PROVIDER) to locate where to make this
change and to verify tests such as test_setup_telemetry_noop_when_disabled and
the test that calls trace.set_tracer_provider behave deterministically.
---
Outside diff comments:
In `@stl-verify/python/app/main.py`:
- Around line 20-32: The lifespan coroutine (lifespan) must ensure telemetry
shutdown runs even if startup fails: wrap the entire body of async def
lifespan(app: FastAPI) in a try/finally so that
shutdown_telemetry(app.state.tracer_provider) is always called; create the
engine and set app.state.engine and instrument_sqlalchemy_engine(engine) inside
the try, and in the finally conditionally await engine.dispose() only if engine
was created/assigned (and similarly ensure app.state.http_client is closed if
created), so startup exceptions before the yield still trigger telemetry cleanup
and resource disposal.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aaed5343-be74-4cdb-bcd8-1fdca548e585
⛔ Files ignored due to path filters (1)
stl-verify/python/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
stl-verify/python/app/config.pystl-verify/python/app/main.pystl-verify/python/app/middleware/request_id.pystl-verify/python/app/telemetry.pystl-verify/python/pyproject.tomlstl-verify/python/tests/unit/test_telemetry.py
| provider = TracerProvider(resource=resource) | ||
| exporter = OTLPSpanExporter(endpoint=settings.otel_exporter_otlp_endpoint) | ||
| provider.add_span_processor(BatchSpanProcessor(exporter)) | ||
| trace.set_tracer_provider(provider) | ||
|
|
||
| FastAPIInstrumentor.instrument_app(app) | ||
| return provider |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In OpenTelemetry Python, can trace.set_tracer_provider() replace an existing global tracer provider, or is it effectively set only once per process?
💡 Result:
In OpenTelemetry Python, trace.set_tracer_provider cannot replace an existing global tracer provider. It is effectively set only once per process. Subsequent calls log a warning "Overriding of current TracerProvider is not allowed" and do not change the provider, enforced by the _TRACER_PROVIDER_SET_ONCE mechanism using a Once object in the API source code. This design prevents accidental overrides, as discussed in GitHub issues and PRs where attempts to allow overrides were rejected or modified to warn/block. The first call (explicit or implicit via get_tracer_provider) sets it permanently. For multiple providers, create them separately without using the global trace.get_tracer; use provider.get_tracer directly.
Citations:
- 1: https://stackoverflow.com/questions/76076082/overriding-of-current-tracerprovider-is-not-allowed
- 2: https://github.com/open-telemetry/opentelemetry-python/blob/eed100c3/opentelemetry-api/src/opentelemetry/trace/__init__.py
- 3: DefaultTracerProvider set permanently in global var, does not look up user-configurable provider open-telemetry/opentelemetry-python#1159
- 4: Added a warning log when the existing TracerProvider and MeterProvider are overridden open-telemetry/opentelemetry-python#856
- 5: Stop TracerProvider from being overridden open-telemetry/opentelemetry-python#959
- 6: https://www.tulletts.co.uk
🏁 Script executed:
# Find and examine the telemetry.py file
find . -name "telemetry.py" -type fRepository: archon-research/stl
Length of output: 99
🏁 Script executed:
# Once we find it, let's look at the setup_telemetry and shutdown_telemetry functions
fd telemetry.py -type f --exec cat -n {} \;Repository: archon-research/stl
Length of output: 233
🏁 Script executed:
# Also search for main.py to see how setup_telemetry is called
find . -name "main.py" -type f | head -5Repository: archon-research/stl
Length of output: 94
🏁 Script executed:
cat -n ./stl-verify/python/app/telemetry.pyRepository: archon-research/stl
Length of output: 2137
🏁 Script executed:
cat -n ./stl-verify/python/app/main.pyRepository: archon-research/stl
Length of output: 2014
Prevent mismatched provider lifecycle when a global provider is already registered.
trace.set_tracer_provider() enforces single-assignment per process. If a global provider was previously registered, Line 23 leaves it unchanged and logs a warning, yet setup_telemetry() still returns the newly allocated local provider. The app then stores and later shuts down this unused provider while the actual tracing continues through the existing global provider.
Proposed fix
def setup_telemetry(app, settings: Settings) -> TracerProvider | None:
"""Configure OTel tracing if enabled. Returns the provider for shutdown."""
if not settings.otel_enabled:
return None
+
+ existing_provider = trace.get_tracer_provider()
+ if isinstance(existing_provider, TracerProvider):
+ FastAPIInstrumentor.instrument_app(app)
+ return existing_provider
resource = Resource.create({"service.name": settings.otel_service_name})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint=settings.otel_exporter_otlp_endpoint)
provider.add_span_processor(BatchSpanProcessor(exporter))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@stl-verify/python/app/telemetry.py` around lines 20 - 26, setup_telemetry
currently creates a new TracerProvider and calls
trace.set_tracer_provider(provider) but may return and later shut down a
provider that was never installed if a global provider already existed; change
setup_telemetry (and the provider variable usage) so after creating provider you
check the current global with trace.get_tracer_provider(), only call
trace.set_tracer_provider(provider) if the current global is the default/unset
one, and if set_tracer_provider was not applied return the actual global
provider (and avoid closing the unused local provider later); ensure
FastAPIInstrumentor.instrument_app(app) is still called and reference the
provider/global tracer appropriately when returning so the app uses the real
active provider.
| @pytest.fixture(autouse=True) | ||
| def _reset_tracer_provider(): | ||
| """Ensure OTel global state is clean between tests.""" | ||
| original_provider = trace._TRACER_PROVIDER | ||
| original_done = trace._TRACER_PROVIDER_SET_ONCE._done | ||
| yield | ||
| trace._TRACER_PROVIDER = original_provider | ||
| trace._TRACER_PROVIDER_SET_ONCE._done = original_done |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does OpenTelemetry Python expose a public way to reset the global tracer provider for test isolation, or is mutating trace._TRACER_PROVIDER still relying on private internals?
💡 Result:
No, OpenTelemetry Python does not expose a public way to reset the global tracer provider for test isolation. Mutating trace._TRACER_PROVIDER relies on private internals, as the provider is designed to be set only once via set_tracer_provider using an internal Once mechanism that prevents overrides and logs a warning on attempts. For testing, the official approach uses the test utilities package (opentelemetry-test-utils), which provides reset_trace_globals from opentelemetry.test.globals_test. This function resets the global state (including _TRACER_PROVIDER) specifically for tests, as seen in the test base class and various test files. Users must install opentelemetry-test (part of opentelemetry-test-utils) and import it, but this is not part of the public production API/SDK. Official docs and examples (e.g., instrumentation guides) show only set_tracer_provider without reset capability. Open issues like #4557 (Apr 2025, still open) request public reset functionality, and spec issue #3302 was closed as wontfix, recommending test support but no general reset. Production code should avoid globals or manage a single provider; tests use the private reset util.
Citations:
- 1: Allow resetting global providers open-telemetry/opentelemetry-python#4557
- 2: Allow resetting the global tracer provider open-telemetry/opentelemetry-specification#3302
- 3: https://github.com/open-telemetry/opentelemetry-python/blob/main/tests/opentelemetry-test-utils/src/opentelemetry/test/test_base.py
- 4: https://github.com/open-telemetry/opentelemetry-python/blob/eed100c3/opentelemetry-api/src/opentelemetry/trace/__init__.py
- 5: https://github.com/open-telemetry/opentelemetry-python/blob/eed100c3/opentelemetry-api/tests/trace/test_globals.py
- 6: https://opentelemetry.io/docs/languages/python/instrumentation
- 7: https://stackoverflow.com/questions/76076082/overriding-of-current-tracerprovider-is-not-allowed
- 8: https://stackoverflow.com/questions/67826617/open-telemetry-inmemoryspanexporter-not-reseting-after-test-case-is-completed
- 9: https://github.com/open-telemetry/opentelemetry-python/blob/eed100c3/opentelemetry-api/tests/metrics/test_meter_provider.py
- 10: https://opentelemetry.io/docs/instrumentation/python/getting-started/
- 11: https://opentelemetry.io/docs/languages/python
🏁 Script executed:
# Find project dependency files
find . -maxdepth 3 -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" -o -name "Pipfile" \) | head -20Repository: archon-research/stl
Length of output: 97
🏁 Script executed:
# Check the test file in question
cat -n stl-verify/python/tests/unit/test_telemetry.py | head -50Repository: archon-research/stl
Length of output: 1834
🏁 Script executed:
# Search for existing reset_trace_globals usage in the codebase
rg "reset_trace_globals" --type pyRepository: archon-research/stl
Length of output: 45
🏁 Script executed:
# Search for opentelemetry.test imports
rg "from opentelemetry.test" --type pyRepository: archon-research/stl
Length of output: 45
🏁 Script executed:
# Check project dependencies in pyproject.toml
cat -n stl-verify/python/pyproject.tomlRepository: archon-research/stl
Length of output: 1102
🏁 Script executed:
# Get more context on the test logic to understand why ordering matters
cat -n stl-verify/python/tests/unit/test_telemetry.py | sed -n '50,100p'Repository: archon-research/stl
Length of output: 2091
Use the official OpenTelemetry test utility to reset global tracer state between tests.
This fixture restores the global tracer provider after each test but does not reset it before the test starts. Tests like test_setup_telemetry_noop_when_disabled() (line 54–55) expect the global provider to be in a pristine state. However, if a previous test calls trace.set_tracer_provider() (as in line 77), the global state persists, causing test isolation failures and order-dependent test behavior.
The recommended approach is to use reset_trace_globals() from opentelemetry.test.globals_test (part of the opentelemetry-test-utils package). This is the official utility for resetting OpenTelemetry's global state in tests:
from opentelemetry.test.globals_test import reset_trace_globals
`@pytest.fixture`(autouse=True)
def _reset_tracer_provider():
"""Ensure OTel global state is clean between tests."""
reset_trace_globals()
yield
reset_trace_globals()If you prefer to avoid adding opentelemetry-test-utils as a dependency, the manual approach from the original review (resetting trace._TRACER_PROVIDER and trace._TRACER_PROVIDER_SET_ONCE._done before yield) is a valid fallback, though it relies on private internals.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@stl-verify/python/tests/unit/test_telemetry.py` around lines 28 - 35, The
_reset_tracer_provider fixture currently only restores OpenTelemetry globals
after a test, causing order-dependent leaks; change it to call
reset_trace_globals() from opentelemetry.test.globals_test both before and after
the yield so the global tracer state is pristine at test start and restored at
test end (update the fixture _reset_tracer_provider to import
reset_trace_globals and invoke it before yield and again after yield); reference
the fixture name _reset_tracer_provider and the OpenTelemetry globals
(trace.set_tracer_provider, trace._TRACER_PROVIDER) to locate where to make this
change and to verify tests such as test_setup_telemetry_noop_when_disabled and
the test that calls trace.set_tracer_provider behave deterministically.
This PR adds some basic telemetry setup to the python part of the project
Summary by CodeRabbit
New Features
Tests