Skip to content

Conversation

@juniemariam
Copy link

@juniemariam juniemariam commented Nov 25, 2025

What this PR does / why we need it

This PR adds optional support for exporting Numaflow metrics through the OpenTelemetry Protocol (OTLP), alongside the existing Prometheus scraping. The goal is to make the metrics pipeline more vendor-neutral, similar to Argo Workflows’ approach, where metrics can be scraped by Prometheus or pushed via OTLP. Prometheus behavior remains unchanged. OTLP export is enabled only when an OTEL endpoint environment variable is set.

Related issues

Fixes #3087
Part of #3037
Discussed in #3035

Testing

  • Started an OpenTelemetry Collector locally (debug exporter).
  • Ran both daemon-server and mvtx-daemon-server with:
export OTEL_EXPORTER_OTLP_ENDPOINT=127.0.0.1:4317
export OTEL_EXPORTER_OTLP_INSECURE=true

Verified:

  • OTLP exporter initialized successfully.
image image
  • Metrics appeared in the OTEL collector output.
  • Prometheus /metrics endpoint still returned all metrics correctly.
  • Confirmed that when the OTEL env vars are not set, behavior falls back to Prometheus-only.

Special notes for reviewers

  • No changes to existing Prometheus scraping.
  • OTLP path is optional and does not impact users unless configured.
  • This PR intentionally touches only the daemon and MonoVertex daemon (controller-manager is not updated due to controller-runtime limitations).
  • Added only required OTEL dependencies.

@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

❌ Patch coverage is 0% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.74%. Comparing base (21f645a) to head (8f28513).

Files with missing lines Patch % Lines
pkg/shared/telemetry/otel.go 0.00% 39 Missing ⚠️
pkg/daemon/server/daemon_server.go 0.00% 7 Missing ⚠️
pkg/mvtxdaemon/server/daemon_server.go 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3086      +/-   ##
==========================================
+ Coverage   79.72%   79.74%   +0.01%     
==========================================
  Files         288      289       +1     
  Lines       64971    65024      +53     
==========================================
+ Hits        51800    51852      +52     
+ Misses      12623    12621       -2     
- Partials      548      551       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vigith
Copy link
Member

vigith commented Nov 25, 2025

please run "make codegen".. also this does not fix the parent issue, can you open a sub issue under the parent issue?

// OTEL_EXPORTER_OTLP_ENDPOINT environment variable is set. This allows metrics
// to be exported to an OTLP collector while keeping Prometheus scraping unchanged.
func InitOTLPExporter(ctx context.Context, componentName, componentInstance string, gatherer prometheus.Gatherer) (func(context.Context) error, error) {
endpoint := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pkg/shared/util.LookupEnvStringOr

res, err := resource.New(ctx,
resource.WithAttributes(
attribute.String("service.name", componentName),
attribute.String("service.namespace", "numaflow"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pkg/apis.Project to replace "numaflow".

log.Warnw("Failed to initialize OTLP exporter, continuing without OTLP", zap.Error(err))
} else {
defer func() {
_ = shutdown(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ = shutdown(context.Background())
_ = shutdown(ctx)

any reason why we can't do this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used context.Background() here because the server’s ctx is cancelled during shutdown which prevents the OTLP exporter from flushing its final metrics.

This follows OTEL Go examples, which also use context.Background() for graceful shutdown https://pkg.go.dev/go.opentelemetry.io/otel/sdk/metric#section-readme

That said, I’m happy to adjust based on what we prefer in Numaflow

  • should we wrap it in a timed context (eg ctx with a timeout) to avoid hanging on shutdown
  • should we skip calling Shutdown entirely and just rely on process exit?

Please let me know if we need a change here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable OTLP metrics export for Daemon services

3 participants