[SDK] Use shared_ptr internally for AttributesProcessor to prevent use-after-free #3457

lalitb · 2025-06-02T19:02:18Z

Changes

This PR changes metric storage and View class to use std::shared_ptr<const AttributesProcessor> internally, while the public View API continues to accept a std::unique_ptr. The unique pointer is promoted to a shared pointer inside the View constructor. The change prevents use-after-free when metrics are recorded after MeterProvider shutdown or destruction.

Each metric recording operation now accesses the AttributesProcessor via a std::shared_ptr instead of a raw pointer as before. This adds a minimal overhead due to shared pointer reference counting in recording hot-path, but ensures memory safety, so should be acceptable.

For significant contributions please make sure you have completed the following items:

CHANGELOG.md updated for non-trivial changes
Unit tests have been added
Changes in public API reviewed

netlify · 2025-06-02T19:02:22Z

✅ Deploy Preview for opentelemetry-cpp-api-docs canceled.

Name	Link
🔨 Latest commit	`446fdfe`
🔍 Latest deploy log	https://app.netlify.com/projects/opentelemetry-cpp-api-docs/deploys/683e32b64c69b6000877f8dc

ThomsonTan · 2025-06-02T19:06:23Z

Thanks for raising the PR. Could you add a dedicated test to validate the seg fault?

codecov · 2025-06-02T19:30:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.89%. Comparing base (05ac106) to head (446fdfe).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3457      +/-   ##
==========================================
+ Coverage   89.88%   89.89%   +0.02%     
==========================================
  Files         212      212              
  Lines        6941     6942       +1     
==========================================
+ Hits         6238     6240       +2     
+ Misses        703      702       -1

Files with missing lines	Coverage Δ
...ntelemetry/sdk/metrics/state/sync_metric_storage.h	`88.58% <100.00%> (+0.34%)`	⬆️
sdk/include/opentelemetry/sdk/metrics/view/view.h	`100.00% <100.00%> (ø)`
sdk/src/metrics/meter.cc	`85.89% <100.00%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lalitb · 2025-06-02T21:47:51Z

Thanks for raising the PR. Could you add a dedicated test to validate the seg fault?

done

marcalff

LGTM

…c callback during shutdown (#60048) ## Description When a Ray worker process shuts down (e.g., during `ray.shutdown()` or node termination), the OpenTelemetry `PeriodicExportingMetricReader`'s background thread may still be invoking the gauge callback (`_DoubleGaugeCallback`), which then accesses already-destroyed member data, resulting in a use-after-free crash. The error message: ``` (bundle_reservation_check_func pid=1543823) pure virtual method called (bundle_reservation_check_func pid=1543823) __cxa_deleted_virtual ``` I looked further into this, and ideally, at the OpenTelemetry code level, shutdown should be handled correctly. [PeriodicExportingMetricReader's shutdown](https://github.com/open-telemetry/opentelemetry-cpp/blob/f33dcc07c56c7e3b18fd18e13986f0eda965d116/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L292-L299) waits for `worker_thread_` to finish. ```c bool PeriodicExportingMetricReader::OnShutDown(std::chrono::microseconds timeout) noexcept { if (worker_thread_.joinable()) { cv_.notify_all(); worker_thread_.join(); } return exporter_->Shutdown(timeout); } ``` And callback(`worker_thread_`) is in a [while (IsShutdown() != true)](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L147) loop. Therefore, there should be no use-after-free race condition at the OpenTelemetry code level, and it should be safe to call `meter_provider_->Shutdown()`. However, the issue is that the last callback appears to access member data that has already been destroyed during ForceFlush, which is called before Shutdown. This member data belongs to the OpenTelemetry SDK itself. The more I look into it, the more it feels like this is actually a bug in the OpenTelemetry SDK. And even further, I found this:[[SDK] Use shared_ptr internally for AttributesProcessor to prevent use-after-free ](open-telemetry/opentelemetry-cpp#3457) Which is exactly the issue I encountered! This PR upgrade the OpenTelemetry C++ SDK version to include this fix. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information It is quit easy to reproduced， For example, if we manually running the `test_placement_group_reschedule_node_dead` in `python/ray/autoscaler/v2/tests/test_e2e.py`. ``` (docs) ubuntu@devbox:~/ray$ pkill -9 -f raylet 2>/dev/null || true; pkill -9 -f gcs_server 2>/dev/null || true; ray stop --force 2>/dev/null || true; sleep 2 Did not find any active Ray processes. (docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?" ............ __cxa_deleted_virtual opentelemetry::v1::sdk::metrics::FilteredOrderedAttributeMap::FilteredOrderedAttributeMap()::{lambda()#1}::operator()() opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()() opentelemetry::v1::sdk::metrics::ObserverResultT<>::Observe() opentelemetry::v1::metrics::ObserverResultT<>::Observe<>() ray::observability::OpenTelemetryMetricRecorder::CollectGaugeMetricValues() (anonymous namespace)::_DoubleGaugeCallback() opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe() opentelemetry::v1::sdk::metrics::Meter::Collect() opentelemetry::v1::sdk::metrics::MetricCollector::Produce() opentelemetry::v1::sdk::metrics::MetricReader::Collect() opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce() std::thread::_State_impl<>::_M_run() ............ ``` after this pr, no such error message: ``` (docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?" ============================= test session starts ============================== platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /home/ubuntu/.conda/envs/docs/bin/python cachedir: .pytest_cache rootdir: /home/ubuntu/ray configfile: pytest.ini plugins: asyncio-1.3.0, anyio-4.11.0 asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collecting ... collected 2 items python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v1] Did not find any active Ray processes. Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. Local node IP: 172.31.5.171 -------------------- Ray runtime started. -------------------- Next steps To add another node to this Ray cluster, run ray start --address='172.31.5.171:6379' To connect to this Ray cluster: import ray ray.init() To submit a Ray job using the Ray Jobs CLI: RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster. To terminate the Ray runtime, run ray stop To view the status of the cluster, use ray status To monitor and debug Ray, view the dashboard at 127.0.0.1:8265 If connection to the dashboard fails, check your firewall settings and network configuration. 2026-01-12 12:30:00,347 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379... 2026-01-12 12:30:00,385 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 (autoscaler +11s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +11s) Resized to 0 CPUs. (autoscaler +12s) Resized to 0 CPUs. (autoscaler +14s) Resized to 0 CPUs. (autoscaler +15s) Resized to 0 CPUs. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +16s) Resized to 0 CPUs. (autoscaler +16s) Adding 1 node(s) of type type-1. (autoscaler +16s) Adding 1 node(s) of type type-2. (autoscaler +16s) Adding 1 node(s) of type type-3. Killing pids 1566233 (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms [state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB. [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms [state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB. [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 Stopped all 10 Ray processes. (autoscaler +32s) Resized to 0 CPUs. (autoscaler +32s) Adding 1 node(s) of type type-1. (autoscaler +32s) Adding 1 node(s) of type type-2. (autoscaler +32s) Adding 1 node(s) of type type-3. (autoscaler +32s) Adding 1 node(s) of type type-3. (autoscaler +32s) Removing 1 nodes of type type-3 (idle). PASSED python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v2] Did not find any active Ray processes. Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. Local node IP: 172.31.5.171 -------------------- Ray runtime started. -------------------- Next steps To add another node to this Ray cluster, run ray start --address='172.31.5.171:6379' To connect to this Ray cluster: import ray ray.init() To submit a Ray job using the Ray Jobs CLI: RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster. To terminate the Ray runtime, run ray stop To view the status of the cluster, use ray status To monitor and debug Ray, view the dashboard at 127.0.0.1:8265 If connection to the dashboard fails, check your firewall settings and network configuration. 2026-01-12 12:30:40,170 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379... 2026-01-12 12:30:40,202 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 Stopped only 9 out of 12 Ray processes within the grace period 16 seconds. Set `-v` to see more details. Remaining processes [psutil.Process(pid=1569612, name='raylet', status='terminated'), psutil.Process(pid=1569160, name='raylet', status='terminated'), psutil.Process(pid=1568952, name='raylet', status='terminated')] will be forcefully terminated. You can also use `--force` to forcefully terminate processes or set higher `--grace-period` to wait longer time for proper termination. Killing pids 1568744 (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 0 [state-dump] [state-dump] [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB. [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 (raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 0 [state-dump] [state-dump] [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB. [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 PASSED ========================= 2 passed in 80.90s (0:01:20) ========================= EXIT CODE: 0 (docs) ubuntu@devbox:~/ray$ ``` Signed-off-by: yicheng <[email protected]> Co-authored-by: yicheng <[email protected]>

…c callback during shutdown (ray-project#60048) ## Description When a Ray worker process shuts down (e.g., during `ray.shutdown()` or node termination), the OpenTelemetry `PeriodicExportingMetricReader`'s background thread may still be invoking the gauge callback (`_DoubleGaugeCallback`), which then accesses already-destroyed member data, resulting in a use-after-free crash. The error message: ``` (bundle_reservation_check_func pid=1543823) pure virtual method called (bundle_reservation_check_func pid=1543823) __cxa_deleted_virtual ``` I looked further into this, and ideally, at the OpenTelemetry code level, shutdown should be handled correctly. [PeriodicExportingMetricReader's shutdown](https://github.com/open-telemetry/opentelemetry-cpp/blob/f33dcc07c56c7e3b18fd18e13986f0eda965d116/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L292-L299) waits for `worker_thread_` to finish. ```c bool PeriodicExportingMetricReader::OnShutDown(std::chrono::microseconds timeout) noexcept { if (worker_thread_.joinable()) { cv_.notify_all(); worker_thread_.join(); } return exporter_->Shutdown(timeout); } ``` And callback(`worker_thread_`) is in a [while (IsShutdown() != true)](https://github.com/open-telemetry/opentelemetry-cpp/blob/main/sdk/src/metrics/export/periodic_exporting_metric_reader.cc#L147) loop. Therefore, there should be no use-after-free race condition at the OpenTelemetry code level, and it should be safe to call `meter_provider_->Shutdown()`. However, the issue is that the last callback appears to access member data that has already been destroyed during ForceFlush, which is called before Shutdown. This member data belongs to the OpenTelemetry SDK itself. The more I look into it, the more it feels like this is actually a bug in the OpenTelemetry SDK. And even further, I found this:[[SDK] Use shared_ptr internally for AttributesProcessor to prevent use-after-free ](open-telemetry/opentelemetry-cpp#3457) Which is exactly the issue I encountered! This PR upgrade the OpenTelemetry C++ SDK version to include this fix. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information It is quit easy to reproduced， For example, if we manually running the `test_placement_group_reschedule_node_dead` in `python/ray/autoscaler/v2/tests/test_e2e.py`. ``` (docs) ubuntu@devbox:~/ray$ pkill -9 -f raylet 2>/dev/null || true; pkill -9 -f gcs_server 2>/dev/null || true; ray stop --force 2>/dev/null || true; sleep 2 Did not find any active Ray processes. (docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?" ............ __cxa_deleted_virtual opentelemetry::v1::sdk::metrics::FilteredOrderedAttributeMap::FilteredOrderedAttributeMap()::{lambda()ray-project#1}::operator()() opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()ray-project#1}::operator()() opentelemetry::v1::sdk::metrics::ObserverResultT<>::Observe() opentelemetry::v1::metrics::ObserverResultT<>::Observe<>() ray::observability::OpenTelemetryMetricRecorder::CollectGaugeMetricValues() (anonymous namespace)::_DoubleGaugeCallback() opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe() opentelemetry::v1::sdk::metrics::Meter::Collect() opentelemetry::v1::sdk::metrics::MetricCollector::Produce() opentelemetry::v1::sdk::metrics::MetricReader::Collect() opentelemetry::v1::sdk::metrics::PeriodicExportingMetricReader::CollectAndExportOnce() std::thread::_State_impl<>::_M_run() ............ ``` after this pr, no such error message: ``` (docs) ubuntu@devbox:~/ray$ timeout 180 python -m pytest python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead -xvs 2>&1 | tee /tmp/test_otel.txt; echo "EXIT CODE: $?" ============================= test session starts ============================== platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /home/ubuntu/.conda/envs/docs/bin/python cachedir: .pytest_cache rootdir: /home/ubuntu/ray configfile: pytest.ini plugins: asyncio-1.3.0, anyio-4.11.0 asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collecting ... collected 2 items python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v1] Did not find any active Ray processes. Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. Local node IP: 172.31.5.171 -------------------- Ray runtime started. -------------------- Next steps To add another node to this Ray cluster, run ray start --address='172.31.5.171:6379' To connect to this Ray cluster: import ray ray.init() To submit a Ray job using the Ray Jobs CLI: RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster. To terminate the Ray runtime, run ray stop To view the status of the cluster, use ray status To monitor and debug Ray, view the dashboard at 127.0.0.1:8265 If connection to the dashboard fails, check your firewall settings and network configuration. 2026-01-12 12:30:00,347 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379... 2026-01-12 12:30:00,385 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 (autoscaler +11s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (autoscaler +11s) Resized to 0 CPUs. (autoscaler +12s) Resized to 0 CPUs. (autoscaler +14s) Resized to 0 CPUs. (autoscaler +15s) Resized to 0 CPUs. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +15s) Adding 1 node(s) of type type-1. (autoscaler +15s) Adding 1 node(s) of type type-2. (autoscaler +15s) Adding 1 node(s) of type type-3. (autoscaler +16s) Resized to 0 CPUs. (autoscaler +16s) Adding 1 node(s) of type type-1. (autoscaler +16s) Adding 1 node(s) of type type-2. (autoscaler +16s) Adding 1 node(s) of type type-3. Killing pids 1566233 (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms [state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB. [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +17s) Adding 1 node(s) of type type-3. (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (autoscaler +24s) Removing 1 nodes of type type-3 (idle). (raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 880.39ms, total = 880.39ms, Queueing time: mean = 0.06ms, max = 0.06ms, min = 0.06ms, total = 0.06ms [state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 1 [state-dump] [state-dump] [2026-01-12 12:29:59,875 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:00,447 I 1565894 1565917] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3914 GB. [2026-01-12 12:30:00,453 I 1565894 1565894] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:02,834 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:02,851 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:03,995 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:04,012 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:05,178 I 1565894 1565894] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:05,197 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:05,215 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,254 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,297 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:05,315 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:05,716 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:05,817 I 1565894 1565894] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 Stopped all 10 Ray processes. (autoscaler +32s) Resized to 0 CPUs. (autoscaler +32s) Adding 1 node(s) of type type-1. (autoscaler +32s) Adding 1 node(s) of type type-2. (autoscaler +32s) Adding 1 node(s) of type type-3. (autoscaler +32s) Adding 1 node(s) of type type-3. (autoscaler +32s) Removing 1 nodes of type type-3 (idle). PASSED python/ray/autoscaler/v2/tests/test_e2e.py::test_placement_group_reschedule_node_dead[v2] Did not find any active Ray processes. Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details. Local node IP: 172.31.5.171 -------------------- Ray runtime started. -------------------- Next steps To add another node to this Ray cluster, run ray start --address='172.31.5.171:6379' To connect to this Ray cluster: import ray ray.init() To submit a Ray job using the Ray Jobs CLI: RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster. To terminate the Ray runtime, run ray stop To view the status of the cluster, use ray status To monitor and debug Ray, view the dashboard at 127.0.0.1:8265 If connection to the dashboard fails, check your firewall settings and network configuration. 2026-01-12 12:30:40,170 INFO worker.py:1826 -- Connecting to existing Ray cluster at address: 172.31.5.171:6379... 2026-01-12 12:30:40,202 INFO worker.py:2006 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 Stopped only 9 out of 12 Ray processes within the grace period 16 seconds. Set `-v` to see more details. Remaining processes [psutil.Process(pid=1569612, name='raylet', status='terminated'), psutil.Process(pid=1569160, name='raylet', status='terminated'), psutil.Process(pid=1568952, name='raylet', status='terminated')] will be forcefully terminated. You can also use `--force` to forcefully terminate processes or set higher `--grace-period` to wait longer time for proper termination. Killing pids 1568744 (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 0 [state-dump] [state-dump] [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB. [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 (raylet) The node with node id: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 and address: 172.31.5.171 and node name: 172.31.5.171 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.) (2) raylet has lagging heartbeats due to slow network or busy workload. (raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: [state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms [state-dump] DebugString() time ms: 0 [state-dump] [state-dump] [2026-01-12 12:30:39,701 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00000 [2026-01-12 12:30:40,257 I 1568506 1568529] (raylet) object_store.cc:37: Object store current usage 8e-09 / 27.3852 GB. [2026-01-12 12:30:40,262 I 1568506 1568506] (raylet) worker_pool.cc:733: Job 01000000 already started in worker pool. [2026-01-12 12:30:41,697 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00001 [2026-01-12 12:30:41,714 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:42,858 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00002 [2026-01-12 12:30:42,876 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:44,050 I 1568506 1568506] (raylet) accessor.cc:436: Received address and liveness notification for node, IsAlive = 1 node_id=fffffffffffffffffffffffffffffffffffffffffffffffffff00003 [2026-01-12 12:30:44,073 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 0, dropped message version: 0 [2026-01-12 12:30:45,018 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 1, dropped message version: 1 [2026-01-12 12:30:45,076 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,079 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,119 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 5, dropped message version: 5 [2026-01-12 12:30:45,177 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 6, dropped message version: 6 [2026-01-12 12:30:45,578 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 7, dropped message version: 7 [2026-01-12 12:30:45,679 I 1568506 1568506] (raylet) ray_syncer_bidi_reactor_base.h:76: Dropping sync message with stale version. latest version: 9, dropped message version: 9 PASSED ========================= 2 passed in 80.90s (0:01:20) ========================= EXIT CODE: 0 (docs) ubuntu@devbox:~/ray$ ``` Signed-off-by: yicheng <[email protected]> Co-authored-by: yicheng <[email protected]> Signed-off-by: jeffery4011 <[email protected]>

fix

89f3575

lalitb requested a review from a team as a code owner June 2, 2025 19:02

fix

a6a4519

add test

e1e1073

lalitb added 2 commits June 2, 2025 22:09

fix

f89d38b

iwuc

446fdfe

marcalff approved these changes Jun 3, 2025

View reviewed changes

marcalff changed the title ~~Fix: Use shared_ptr internally for AttributesProcessor to prevent use-after-free~~ [SDK] Use shared_ptr internally for AttributesProcessor to prevent use-after-free Jun 3, 2025

lalitb merged commit d2e7289 into open-telemetry:main Jun 3, 2025
67 checks passed

malkia mentioned this pull request Jun 3, 2025

[SDK] Use shared_ptr internally for AttributesProcessor to prevent us… malkia/opentelemetry-cpp#330

Merged

3 tasks

BrewTestBot mentioned this pull request Jul 11, 2025

opentelemetry-cpp 1.22.0 Homebrew/homebrew-core#229832

Merged

Yicheng-Lu-llll mentioned this pull request Jan 13, 2026

[Core] Fix use-after-free race condition in OpenTelemetry gauge metric callback during shutdown ray-project/ray#60048

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SDK] Use shared_ptr internally for AttributesProcessor to prevent use-after-free #3457

[SDK] Use shared_ptr internally for AttributesProcessor to prevent use-after-free #3457

Uh oh!

lalitb commented Jun 2, 2025 •

edited

Loading

Uh oh!

netlify bot commented Jun 2, 2025 •

edited

Loading

Uh oh!

ThomsonTan commented Jun 2, 2025

Uh oh!

codecov bot commented Jun 2, 2025 •

edited

Loading

Uh oh!

lalitb commented Jun 2, 2025

Uh oh!

marcalff left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SDK] Use shared_ptr internally for AttributesProcessor to prevent use-after-free #3457

[SDK] Use shared_ptr internally for AttributesProcessor to prevent use-after-free #3457

Uh oh!

Conversation

lalitb commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

netlify bot commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for opentelemetry-cpp-api-docs canceled.

Uh oh!

ThomsonTan commented Jun 2, 2025

Uh oh!

codecov bot commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lalitb commented Jun 2, 2025

Uh oh!

marcalff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lalitb commented Jun 2, 2025 •

edited

Loading

netlify bot commented Jun 2, 2025 •

edited

Loading

codecov bot commented Jun 2, 2025 •

edited

Loading