-
Notifications
You must be signed in to change notification settings - Fork 3
📈 Add OTel configuration #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
📈 Add OTel configuration #10
Conversation
| @@ -1,4 +1,4 @@ | |||
| # This file is automatically @generated by Poetry 1.8.2 and should not be changed by hand. | |||
| # This file is automatically @generated by Poetry 1.8.4 and should not be changed by hand. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the benefit of moving from Poetry 1.8.2 to Poetry 1.8.4?
| api.app.disable_chat_ui = True | ||
| api.set_default_config_id('${CONFIG_ID}') | ||
| # Start the server using uvicorn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why replace the cli path with a direct uvicorn? Cli was used previously as composition over replacement was deemed preferable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direct uvicorn setups OpenTelemetry before starting the server, which is required, whereas the CLI starts the server immediately (line 155 in cli/init.py)
scripts/setup_otel.py
Outdated
| service_version = os.getenv("OTEL_SERVICE_VERSION", "1.0.0") | ||
| environment = os.getenv("OTEL_ENVIRONMENT", "production") | ||
| otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT") | ||
| enable_console = os.getenv("OTEL_ENABLE_CONSOLE", "true").lower() == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the default value set on line 17 conflict with entrypoint.sh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| export OTEL_ENVIRONMENT="${OTEL_ENVIRONMENT:-production}" | ||
| export OTEL_ENABLE_CONSOLE="${OTEL_ENABLE_CONSOLE:-false}" | ||
| export OTEL_EXPORTER_OTLP_INSECURE="${OTEL_EXPORTER_OTLP_INSECURE:-true}" | ||
| export OTEL_EXPORTER_OTLP_ENDPOINT="${OTEL_EXPORTER_OTLP_ENDPOINT:-http://jaeger:4317}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this setup seems to be tightly coupled to http://jaeger:4317 and presumably assumed same-namespace deployment; is this always the case? if it isn't, should the user have some kind of control over this configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, generally servers and tracing/metrics services are deployed in the same namespace
scripts/entrypoint.sh
Outdated
| print(f'❌ OpenTelemetry setup failed: {e}') | ||
| import traceback | ||
| traceback.print_exc() | ||
| sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if any error occurs during OTel setup, the server refuses to start? I'd say that observability should not break core functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - fixed this and added messages to verify setup
| from setup_otel import setup_opentelemetry | ||
| setup_opentelemetry() | ||
| print('✅ OpenTelemetry configured in server process') | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this is a too broad exception handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added different exception handling according to the error
|
|
||
| RUN poetry install --no-ansi --extras="sdd jailbreak openai nvidia tracing" && \ | ||
| poetry run pip install "spacy>=3.4.4,<4.0.0" && \ | ||
| poetry run pip install \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably move all these deps inside pyproject.toml
I did not want to initially touch pyproject.toml, but if we are adding more or more deps, they should be managed via pyproject.toml
also I think some of the opentelemetry packages are already inside .toml
I think it might be best to place opentelemetry packages inside tracing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially had them inside pyproject.toml but the NeMo team wants the server to be independent of opentelemetry API packages
scripts/entrypoint.sh
Outdated
| fi | ||
|
|
||
| # Setup OpenTelemetry | ||
| ENABLE_OTEL="${ENABLE_OTEL:-${AUTO_SETUP_OTEL:-false}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the advantage of having the dual variable fallback here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% a typo - removed AUTO_SETUP_OTEL
| export OTEL_SERVICE_VERSION="${OTEL_SERVICE_VERSION:-1.0.0}" | ||
| export OTEL_ENVIRONMENT="${OTEL_ENVIRONMENT:-production}" | ||
| export OTEL_ENABLE_CONSOLE="${OTEL_ENABLE_CONSOLE:-false}" | ||
| export OTEL_EXPORTER_OTLP_INSECURE="${OTEL_EXPORTER_OTLP_INSECURE:-true}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any scenarios when this can be secure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we configure OpenTelemetry/Jaegar to use secure connections but that configuration is independent from the NeMo server (a user would have to specify it on their OTelCollector CR)
m-misiura
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I think it could be beneficial to consider the following:
- tweak how the server starts up in case there are OTel errors
- add more informative logging so that debugging becomes easier
- adding some documentation / example deployment manifests to the PR
I left some additional comments, highlighting specific lines that could be worth revisiting :)
b7d6750 to
603bf72
Compare
0d1efdb to
3bd76e9
Compare
|
@m-misiura thanks for your review and comments ! i updated my PR with the following:
|
This PR configures NeMO to allow OpenTelemetry (OTel) to export traces to Jaegar. It adds
scripts/setup_otel.pyto configure OTel and updatesscripts/entrypoint.shto access this script and setup OTel before starting NeMo server.