Skip to content

Conversation

@christinaexyou
Copy link

This PR configures NeMO to allow OpenTelemetry (OTel) to export traces to Jaegar. It adds scripts/setup_otel.py to configure OTel and updates scripts/entrypoint.sh to access this script and setup OTel before starting NeMo server.

Screenshot 2025-10-23 at 2 43 00 PM

@@ -1,4 +1,4 @@
# This file is automatically @generated by Poetry 1.8.2 and should not be changed by hand.
# This file is automatically @generated by Poetry 1.8.4 and should not be changed by hand.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the benefit of moving from Poetry 1.8.2 to Poetry 1.8.4?

api.app.disable_chat_ui = True
api.set_default_config_id('${CONFIG_ID}')
# Start the server using uvicorn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why replace the cli path with a direct uvicorn? Cli was used previously as composition over replacement was deemed preferable

Copy link
Author

@christinaexyou christinaexyou Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct uvicorn setups OpenTelemetry before starting the server, which is required, whereas the CLI starts the server immediately (line 155 in cli/init.py)

service_version = os.getenv("OTEL_SERVICE_VERSION", "1.0.0")
environment = os.getenv("OTEL_ENVIRONMENT", "production")
otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
enable_console = os.getenv("OTEL_ENABLE_CONSOLE", "true").lower() == "true"
Copy link
Collaborator

@m-misiura m-misiura Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the default value set on line 17 conflict with entrypoint.sh?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

export OTEL_ENVIRONMENT="${OTEL_ENVIRONMENT:-production}"
export OTEL_ENABLE_CONSOLE="${OTEL_ENABLE_CONSOLE:-false}"
export OTEL_EXPORTER_OTLP_INSECURE="${OTEL_EXPORTER_OTLP_INSECURE:-true}"
export OTEL_EXPORTER_OTLP_ENDPOINT="${OTEL_EXPORTER_OTLP_ENDPOINT:-http://jaeger:4317}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this setup seems to be tightly coupled to http://jaeger:4317 and presumably assumed same-namespace deployment; is this always the case? if it isn't, should the user have some kind of control over this configuration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, generally servers and tracing/metrics services are deployed in the same namespace

print(f'❌ OpenTelemetry setup failed: {e}')
import traceback
traceback.print_exc()
sys.exit(1)
Copy link
Collaborator

@m-misiura m-misiura Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if any error occurs during OTel setup, the server refuses to start? I'd say that observability should not break core functionality

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - fixed this and added messages to verify setup

from setup_otel import setup_opentelemetry
setup_opentelemetry()
print('✅ OpenTelemetry configured in server process')
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this is a too broad exception handling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added different exception handling according to the error


RUN poetry install --no-ansi --extras="sdd jailbreak openai nvidia tracing" && \
poetry run pip install "spacy>=3.4.4,<4.0.0" && \
poetry run pip install \
Copy link
Collaborator

@m-misiura m-misiura Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably move all these deps inside pyproject.toml

I did not want to initially touch pyproject.toml, but if we are adding more or more deps, they should be managed via pyproject.toml

also I think some of the opentelemetry packages are already inside .toml

I think it might be best to place opentelemetry packages inside tracing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had them inside pyproject.toml but the NeMo team wants the server to be independent of opentelemetry API packages

fi

# Setup OpenTelemetry
ENABLE_OTEL="${ENABLE_OTEL:-${AUTO_SETUP_OTEL:-false}}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the advantage of having the dual variable fallback here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% a typo - removed AUTO_SETUP_OTEL

export OTEL_SERVICE_VERSION="${OTEL_SERVICE_VERSION:-1.0.0}"
export OTEL_ENVIRONMENT="${OTEL_ENVIRONMENT:-production}"
export OTEL_ENABLE_CONSOLE="${OTEL_ENABLE_CONSOLE:-false}"
export OTEL_EXPORTER_OTLP_INSECURE="${OTEL_EXPORTER_OTLP_INSECURE:-true}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any scenarios when this can be secure?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we configure OpenTelemetry/Jaegar to use secure connections but that configuration is independent from the NeMo server (a user would have to specify it on their OTelCollector CR)

Copy link
Collaborator

@m-misiura m-misiura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I think it could be beneficial to consider the following:

  • tweak how the server starts up in case there are OTel errors
  • add more informative logging so that debugging becomes easier
  • adding some documentation / example deployment manifests to the PR

I left some additional comments, highlighting specific lines that could be worth revisiting :)

@m-misiura m-misiura force-pushed the develop branch 2 times, most recently from b7d6750 to 603bf72 Compare November 10, 2025 16:54
@christinaexyou
Copy link
Author

@m-misiura thanks for your review and comments ! i updated my PR with the following:

  • Better error handling in case the OpenTelemetry configuration fails
  • Logging throughout
  • Documentation on how to locally test OTel integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants