Skip to content

[VEC-76] Instrument observability for api#200

Merged
toreluntang merged 2 commits into
mainfrom
vec-76-setup-otel-tracing
Apr 14, 2026
Merged

[VEC-76] Instrument observability for api#200
toreluntang merged 2 commits into
mainfrom
vec-76-setup-otel-tracing

Conversation

@toreluntang

@toreluntang toreluntang commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

This PR adds some basic telemetry setup to the python part of the project

Summary by CodeRabbit

  • New Features

    • Added OpenTelemetry tracing and observability capabilities for improved system monitoring.
    • Enabled request ID tracking and propagation across distributed traces.
    • Integrated database query instrumentation for enhanced operational visibility.
    • Introduced configurable telemetry settings for service endpoints and names.
  • Tests

    • Added comprehensive test coverage for telemetry functionality and integration scenarios.

@coderabbitai

coderabbitai Bot commented Apr 14, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This pull request integrates OpenTelemetry tracing throughout the application. It introduces configuration settings for enabling telemetry and specifying the OTLP endpoint, implements core telemetry setup and shutdown utilities, integrates instrumentation into the FastAPI app lifecycle, enhances middleware to track request IDs in spans, adds required dependencies, and provides comprehensive test coverage.

Changes

Cohort / File(s) Summary
Configuration & Dependencies
stl-verify/python/app/config.py, stl-verify/python/pyproject.toml
Added three new Settings fields for OpenTelemetry configuration (otel_enabled, otel_exporter_otlp_endpoint, otel_service_name) and four OpenTelemetry-related dependencies (SDK, FastAPI instrumentor, SQLAlchemy instrumentor, OTLP exporter).
Telemetry Module
stl-verify/python/app/telemetry.py
New module providing setup_telemetry() to conditionally initialize TracerProvider with OTLP export, shutdown_telemetry() for graceful provider shutdown, and instrument_sqlalchemy_engine() for database tracing when a provider is active.
Application Integration
stl-verify/python/app/main.py, stl-verify/python/app/middleware/request_id.py
Integrated telemetry initialization during app startup and shutdown; instrumented SQLAlchemy engine; enhanced RequestIdMiddleware to set request ID as a span attribute for distributed tracing correlation.
Testing
stl-verify/python/tests/unit/test_telemetry.py
Comprehensive test suite covering telemetry setup behavior (enabled/disabled), provider lifecycle management, engine instrumentation safety, and end-to-end request ID propagation through request handling with span verification.

Sequence Diagram(s)

sequenceDiagram
    participant App as FastAPI App
    participant Telemetry as Telemetry Module
    participant OTLPExp as OTLP Exporter
    participant Provider as TracerProvider
    participant Middleware as RequestIdMiddleware
    participant SQLAlchemy as SQLAlchemy

    App->>Telemetry: setup_telemetry(app, settings)
    alt otel_enabled = true
        Telemetry->>Provider: Create TracerProvider
        Telemetry->>OTLPExp: Create OTLPSpanExporter
        Telemetry->>Provider: Add BatchSpanProcessor
        Telemetry->>Telemetry: Instrument FastAPI app
        Telemetry->>Telemetry: Instrument SQLAlchemy<br/>(if provider active)
        Telemetry-->>App: Return TracerProvider
    else otel_enabled = false
        Telemetry-->>App: Return None
    end

    App->>SQLAlchemy: Engine instrumentation<br/>(if provider available)
    
    rect rgba(100, 150, 200, 0.5)
        Note over Middleware,Provider: Request Handling
        App->>Middleware: Receive HTTP request
        Middleware->>Provider: get_current_span()
        Middleware->>Provider: Set span attribute<br/>request_id
        Middleware-->>App: Continue processing
    end

    App->>Telemetry: shutdown_telemetry(provider)
    alt provider is not None
        Telemetry->>Provider: provider.shutdown()
        Provider-->>OTLPExp: Flush pending spans
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[VEC-76] Instrument observability for api' accurately describes the primary objective of the changeset, which adds OpenTelemetry telemetry instrumentation to the Python application.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch vec-76-setup-otel-tracing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@toreluntang toreluntang force-pushed the vec-76-setup-otel-tracing branch from 9dc807e to 4f5e12f Compare April 14, 2026 12:29
@toreluntang toreluntang marked this pull request as ready for review April 14, 2026 12:30
@toreluntang toreluntang requested review from a team and Copilot April 14, 2026 12:30
@toreluntang toreluntang requested a review from a team as a code owner April 14, 2026 12:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces basic OpenTelemetry tracing support for the Python FastAPI service, wiring telemetry setup/shutdown into the app lifecycle and enriching spans with request IDs.

Changes:

  • Add OpenTelemetry SDK + OTLP exporter + FastAPI/SQLAlchemy instrumentation dependencies.
  • Add app.telemetry module to configure tracing and instrument FastAPI + SQLAlchemy.
  • Extend Request ID middleware to attach the request id to the active span and add unit tests for telemetry behavior.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
stl-verify/python/pyproject.toml Adds OTel SDK/exporter/instrumentation dependencies.
stl-verify/python/uv.lock Locks newly added OTel dependency set and transitive deps.
stl-verify/python/app/telemetry.py New telemetry setup/shutdown + SQLAlchemy engine instrumentation helper.
stl-verify/python/app/main.py Hooks telemetry setup into app creation and shutdown into lifespan cleanup; instruments SQLAlchemy engine.
stl-verify/python/app/config.py Adds OTel-related settings (enable flag, OTLP endpoint, service name).
stl-verify/python/app/middleware/request_id.py Adds request-id attribute onto the active OTel span.
stl-verify/python/tests/unit/test_telemetry.py New unit tests covering telemetry setup/shutdown and request_id propagation to spans.

Comment on lines 21 to +24
rid = request.headers.get(_REQUEST_ID_HEADER) or str(uuid.uuid4())
token = request_id_var.set(rid)
span = trace.get_current_span()
span.set_attribute("request_id", rid)

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X-Request-ID is accepted verbatim from the incoming request and added as an OpenTelemetry span attribute. Because this value is fully client-controlled, it can lead to high-cardinality / oversized attributes (increased telemetry cost, exporter payload bloat, and potential DoS if very large headers are sent).

Consider validating/sanitizing the inbound request id before storing it on the span (e.g., enforce a max length and allowed characters, or only accept UUIDs and otherwise generate a new id).

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +35
@pytest.fixture(autouse=True)
def _reset_tracer_provider():
"""Ensure OTel global state is clean between tests."""
original_provider = trace._TRACER_PROVIDER
original_done = trace._TRACER_PROVIDER_SET_ONCE._done
yield
trace._TRACER_PROVIDER = original_provider
trace._TRACER_PROVIDER_SET_ONCE._done = original_done

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixture mutates OpenTelemetry private globals (trace._TRACER_PROVIDER and trace._TRACER_PROVIDER_SET_ONCE._done) to reset global state between tests. These internal attributes are not part of the public API and can change across OTel releases, making the test suite brittle.

Consider avoiding the global provider entirely in tests by refactoring the production code to pass an explicit tracer_provider into the instrumentors (FastAPI/SQLAlchemy) instead of calling trace.set_tracer_provider, or use a supported reset helper if OpenTelemetry provides one for testing.

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +26
resource = Resource.create({"service.name": settings.otel_service_name})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint=settings.otel_exporter_otlp_endpoint)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

FastAPIInstrumentor.instrument_app(app)
return provider

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup_telemetry sets the process-global tracer provider via trace.set_tracer_provider. OpenTelemetry treats this as a set-once global; it makes multi-app usage (and unit tests) harder and is the reason the tests need to reach into private OTel globals to reset state.

Consider keeping the provider owned by this app instance and passing it explicitly into FastAPIInstrumentor.instrument_app(..., tracer_provider=provider) (and similarly into SQLAlchemy instrumentation) rather than setting the global provider. This avoids global side effects and improves test isolation.

Copilot uses AI. Check for mistakes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
stl-verify/python/app/main.py (1)

20-32: ⚠️ Potential issue | 🟠 Major

Ensure telemetry cleanup also runs on failed startup.

If startup raises before yield, Lines 29-32 never execute, so the tracer provider created on Line 36 is left running. Wrap the full lifespan body in try/finally and dispose the engine conditionally.

Proposed fix
     `@asynccontextmanager`
     async def lifespan(app: FastAPI) -> AsyncIterator[None]:
-        engine = create_async_engine(settings.async_database_url, pool_pre_ping=True)
-        async with engine.connect() as conn:
-            await conn.execute(text("SELECT 1"))
-        app.state.engine = engine
-        instrument_sqlalchemy_engine(engine)
-        async with httpx.AsyncClient() as http_client:
-            app.state.http_client = http_client
-            yield
+        engine = None
         try:
-            await engine.dispose()
+            engine = create_async_engine(settings.async_database_url, pool_pre_ping=True)
+            async with engine.connect() as conn:
+                await conn.execute(text("SELECT 1"))
+            app.state.engine = engine
+            instrument_sqlalchemy_engine(engine)
+            async with httpx.AsyncClient() as http_client:
+                app.state.http_client = http_client
+                yield
         finally:
-            shutdown_telemetry(app.state.tracer_provider)
+            try:
+                if engine is not None:
+                    await engine.dispose()
+            finally:
+                shutdown_telemetry(app.state.tracer_provider)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@stl-verify/python/app/main.py` around lines 20 - 32, The lifespan coroutine
(lifespan) must ensure telemetry shutdown runs even if startup fails: wrap the
entire body of async def lifespan(app: FastAPI) in a try/finally so that
shutdown_telemetry(app.state.tracer_provider) is always called; create the
engine and set app.state.engine and instrument_sqlalchemy_engine(engine) inside
the try, and in the finally conditionally await engine.dispose() only if engine
was created/assigned (and similarly ensure app.state.http_client is closed if
created), so startup exceptions before the yield still trigger telemetry cleanup
and resource disposal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@stl-verify/python/app/telemetry.py`:
- Around line 20-26: setup_telemetry currently creates a new TracerProvider and
calls trace.set_tracer_provider(provider) but may return and later shut down a
provider that was never installed if a global provider already existed; change
setup_telemetry (and the provider variable usage) so after creating provider you
check the current global with trace.get_tracer_provider(), only call
trace.set_tracer_provider(provider) if the current global is the default/unset
one, and if set_tracer_provider was not applied return the actual global
provider (and avoid closing the unused local provider later); ensure
FastAPIInstrumentor.instrument_app(app) is still called and reference the
provider/global tracer appropriately when returning so the app uses the real
active provider.

In `@stl-verify/python/tests/unit/test_telemetry.py`:
- Around line 28-35: The _reset_tracer_provider fixture currently only restores
OpenTelemetry globals after a test, causing order-dependent leaks; change it to
call reset_trace_globals() from opentelemetry.test.globals_test both before and
after the yield so the global tracer state is pristine at test start and
restored at test end (update the fixture _reset_tracer_provider to import
reset_trace_globals and invoke it before yield and again after yield); reference
the fixture name _reset_tracer_provider and the OpenTelemetry globals
(trace.set_tracer_provider, trace._TRACER_PROVIDER) to locate where to make this
change and to verify tests such as test_setup_telemetry_noop_when_disabled and
the test that calls trace.set_tracer_provider behave deterministically.

---

Outside diff comments:
In `@stl-verify/python/app/main.py`:
- Around line 20-32: The lifespan coroutine (lifespan) must ensure telemetry
shutdown runs even if startup fails: wrap the entire body of async def
lifespan(app: FastAPI) in a try/finally so that
shutdown_telemetry(app.state.tracer_provider) is always called; create the
engine and set app.state.engine and instrument_sqlalchemy_engine(engine) inside
the try, and in the finally conditionally await engine.dispose() only if engine
was created/assigned (and similarly ensure app.state.http_client is closed if
created), so startup exceptions before the yield still trigger telemetry cleanup
and resource disposal.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aaed5343-be74-4cdb-bcd8-1fdca548e585

📥 Commits

Reviewing files that changed from the base of the PR and between 81d89ca and 4f5e12f.

⛔ Files ignored due to path filters (1)
  • stl-verify/python/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • stl-verify/python/app/config.py
  • stl-verify/python/app/main.py
  • stl-verify/python/app/middleware/request_id.py
  • stl-verify/python/app/telemetry.py
  • stl-verify/python/pyproject.toml
  • stl-verify/python/tests/unit/test_telemetry.py

Comment on lines +20 to +26
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(endpoint=settings.otel_exporter_otlp_endpoint)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

FastAPIInstrumentor.instrument_app(app)
return provider

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In OpenTelemetry Python, can trace.set_tracer_provider() replace an existing global tracer provider, or is it effectively set only once per process?

💡 Result:

In OpenTelemetry Python, trace.set_tracer_provider cannot replace an existing global tracer provider. It is effectively set only once per process. Subsequent calls log a warning "Overriding of current TracerProvider is not allowed" and do not change the provider, enforced by the _TRACER_PROVIDER_SET_ONCE mechanism using a Once object in the API source code. This design prevents accidental overrides, as discussed in GitHub issues and PRs where attempts to allow overrides were rejected or modified to warn/block. The first call (explicit or implicit via get_tracer_provider) sets it permanently. For multiple providers, create them separately without using the global trace.get_tracer; use provider.get_tracer directly.

Citations:


🏁 Script executed:

# Find and examine the telemetry.py file
find . -name "telemetry.py" -type f

Repository: archon-research/stl

Length of output: 99


🏁 Script executed:

# Once we find it, let's look at the setup_telemetry and shutdown_telemetry functions
fd telemetry.py -type f --exec cat -n {} \;

Repository: archon-research/stl

Length of output: 233


🏁 Script executed:

# Also search for main.py to see how setup_telemetry is called
find . -name "main.py" -type f | head -5

Repository: archon-research/stl

Length of output: 94


🏁 Script executed:

cat -n ./stl-verify/python/app/telemetry.py

Repository: archon-research/stl

Length of output: 2137


🏁 Script executed:

cat -n ./stl-verify/python/app/main.py

Repository: archon-research/stl

Length of output: 2014


Prevent mismatched provider lifecycle when a global provider is already registered.

trace.set_tracer_provider() enforces single-assignment per process. If a global provider was previously registered, Line 23 leaves it unchanged and logs a warning, yet setup_telemetry() still returns the newly allocated local provider. The app then stores and later shuts down this unused provider while the actual tracing continues through the existing global provider.

Proposed fix
 def setup_telemetry(app, settings: Settings) -> TracerProvider | None:
     """Configure OTel tracing if enabled. Returns the provider for shutdown."""
     if not settings.otel_enabled:
         return None
+
+    existing_provider = trace.get_tracer_provider()
+    if isinstance(existing_provider, TracerProvider):
+        FastAPIInstrumentor.instrument_app(app)
+        return existing_provider
 
     resource = Resource.create({"service.name": settings.otel_service_name})
     provider = TracerProvider(resource=resource)
     exporter = OTLPSpanExporter(endpoint=settings.otel_exporter_otlp_endpoint)
     provider.add_span_processor(BatchSpanProcessor(exporter))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@stl-verify/python/app/telemetry.py` around lines 20 - 26, setup_telemetry
currently creates a new TracerProvider and calls
trace.set_tracer_provider(provider) but may return and later shut down a
provider that was never installed if a global provider already existed; change
setup_telemetry (and the provider variable usage) so after creating provider you
check the current global with trace.get_tracer_provider(), only call
trace.set_tracer_provider(provider) if the current global is the default/unset
one, and if set_tracer_provider was not applied return the actual global
provider (and avoid closing the unused local provider later); ensure
FastAPIInstrumentor.instrument_app(app) is still called and reference the
provider/global tracer appropriately when returning so the app uses the real
active provider.

Comment on lines +28 to +35
@pytest.fixture(autouse=True)
def _reset_tracer_provider():
"""Ensure OTel global state is clean between tests."""
original_provider = trace._TRACER_PROVIDER
original_done = trace._TRACER_PROVIDER_SET_ONCE._done
yield
trace._TRACER_PROVIDER = original_provider
trace._TRACER_PROVIDER_SET_ONCE._done = original_done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does OpenTelemetry Python expose a public way to reset the global tracer provider for test isolation, or is mutating trace._TRACER_PROVIDER still relying on private internals?

💡 Result:

No, OpenTelemetry Python does not expose a public way to reset the global tracer provider for test isolation. Mutating trace._TRACER_PROVIDER relies on private internals, as the provider is designed to be set only once via set_tracer_provider using an internal Once mechanism that prevents overrides and logs a warning on attempts. For testing, the official approach uses the test utilities package (opentelemetry-test-utils), which provides reset_trace_globals from opentelemetry.test.globals_test. This function resets the global state (including _TRACER_PROVIDER) specifically for tests, as seen in the test base class and various test files. Users must install opentelemetry-test (part of opentelemetry-test-utils) and import it, but this is not part of the public production API/SDK. Official docs and examples (e.g., instrumentation guides) show only set_tracer_provider without reset capability. Open issues like #4557 (Apr 2025, still open) request public reset functionality, and spec issue #3302 was closed as wontfix, recommending test support but no general reset. Production code should avoid globals or manage a single provider; tests use the private reset util.

Citations:


🏁 Script executed:

# Find project dependency files
find . -maxdepth 3 -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" -o -name "setup.cfg" -o -name "Pipfile" \) | head -20

Repository: archon-research/stl

Length of output: 97


🏁 Script executed:

# Check the test file in question
cat -n stl-verify/python/tests/unit/test_telemetry.py | head -50

Repository: archon-research/stl

Length of output: 1834


🏁 Script executed:

# Search for existing reset_trace_globals usage in the codebase
rg "reset_trace_globals" --type py

Repository: archon-research/stl

Length of output: 45


🏁 Script executed:

# Search for opentelemetry.test imports
rg "from opentelemetry.test" --type py

Repository: archon-research/stl

Length of output: 45


🏁 Script executed:

# Check project dependencies in pyproject.toml
cat -n stl-verify/python/pyproject.toml

Repository: archon-research/stl

Length of output: 1102


🏁 Script executed:

# Get more context on the test logic to understand why ordering matters
cat -n stl-verify/python/tests/unit/test_telemetry.py | sed -n '50,100p'

Repository: archon-research/stl

Length of output: 2091


Use the official OpenTelemetry test utility to reset global tracer state between tests.

This fixture restores the global tracer provider after each test but does not reset it before the test starts. Tests like test_setup_telemetry_noop_when_disabled() (line 54–55) expect the global provider to be in a pristine state. However, if a previous test calls trace.set_tracer_provider() (as in line 77), the global state persists, causing test isolation failures and order-dependent test behavior.

The recommended approach is to use reset_trace_globals() from opentelemetry.test.globals_test (part of the opentelemetry-test-utils package). This is the official utility for resetting OpenTelemetry's global state in tests:

from opentelemetry.test.globals_test import reset_trace_globals

`@pytest.fixture`(autouse=True)
def _reset_tracer_provider():
    """Ensure OTel global state is clean between tests."""
    reset_trace_globals()
    yield
    reset_trace_globals()

If you prefer to avoid adding opentelemetry-test-utils as a dependency, the manual approach from the original review (resetting trace._TRACER_PROVIDER and trace._TRACER_PROVIDER_SET_ONCE._done before yield) is a valid fallback, though it relies on private internals.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@stl-verify/python/tests/unit/test_telemetry.py` around lines 28 - 35, The
_reset_tracer_provider fixture currently only restores OpenTelemetry globals
after a test, causing order-dependent leaks; change it to call
reset_trace_globals() from opentelemetry.test.globals_test both before and after
the yield so the global tracer state is pristine at test start and restored at
test end (update the fixture _reset_tracer_provider to import
reset_trace_globals and invoke it before yield and again after yield); reference
the fixture name _reset_tracer_provider and the OpenTelemetry globals
(trace.set_tracer_provider, trace._TRACER_PROVIDER) to locate where to make this
change and to verify tests such as test_setup_telemetry_noop_when_disabled and
the test that calls trace.set_tracer_provider behave deterministically.

@sboutzen sboutzen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're the man

@toreluntang toreluntang merged commit bd6fcdf into main Apr 14, 2026
12 checks passed
@toreluntang toreluntang deleted the vec-76-setup-otel-tracing branch April 14, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants