Skip to content

fix: avoid deadlock when entering a span#251

Open
mladedav wants to merge 1 commit intov0.1.xfrom
dm/deadlock-fix
Open

fix: avoid deadlock when entering a span#251
mladedav wants to merge 1 commit intov0.1.xfrom
dm/deadlock-fix

Conversation

@mladedav
Copy link
Copy Markdown
Collaborator

@mladedav mladedav commented Jan 28, 2026

Motivation

There is a potential deadlock described in #250 where we lock ExtensionsMut when entering a span, we let opentelemetry activate the context, it tries to log through tracing in that operation which in turn calls on_event in this layer and we try to lock ExtensionsMut again.

Solution

Instead of accessing span's internal state through ExtensionsMut, we store an Arc<Mutex<OtelData>>, clone the Arc and then lock it to access it. This allows us to hold onto the state without deadlocking when other layers try to lock the extensions in re-entrant calls.

There's a test to check for this but it needs to be run with RUSTFLAGS='--cfg reentrant_tracing_test' environment. I did not put that into the CI because changing the environment forces recompilation of every dependency. So I'm still looking whether I'm able to do this a bit better or if I'll just drop the test.

@mladedav
Copy link
Copy Markdown
Collaborator Author

I'll have to look closer at what's happening with the RefCell when setting the dispatch. It's at least panicking consistently.

@mladedav
Copy link
Copy Markdown
Collaborator Author

Changing the Dispatch will not work. If we're not using the global dispatcher, the ref cell is already borrowed because someone is logging something so when we try to override it from within these methods, we're trying to mutably borrow a ref cell that's already borrowed, leading to a panic.

Now I'm trying to figure out if I can use event_enabled to filter out events inside the critical sections but that could still leave us open to tracing span usage inside opentelemetry or one of its dependencies.

Maybe we'll have to just rework this to not hold onto the locks while calling out to opentelemetry.

@mladedav mladedav force-pushed the dm/deadlock-fix branch 3 times, most recently from 99e90ab to b08339b Compare February 16, 2026 16:07
@mladedav
Copy link
Copy Markdown
Collaborator Author

@bantonsson if you have some time, would you mind taking a look? This reworks the way you added the method to get to OpenTelemetry context from SpanRef so I just want to make sure I'm not missing something.

The solution right now has pretty bad performance regression (one benchmark went up by 23 %) and I hope to get that a bit under control, but in the worst case, bad performance sounds better than a deadlock.

@bantonsson
Copy link
Copy Markdown
Contributor

Hey @mladedav. I completely missed this notification. I'll try to look at it today or tomorrow.

Copy link
Copy Markdown
Contributor

@bantonsson bantonsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally got around to this. I'll continue to look at it, but one thing immediately caught my eye.

Comment thread src/layer.rs
Comment on lines +41 to +43
std::thread_local! {
static INSIDE_TRACING: AtomicBool = const { AtomicBool::new(false) }
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a thread local, it will only be accessed by the current thread, so it doesn't need atomic ordering.

Suggested change
std::thread_local! {
static INSIDE_TRACING: AtomicBool = const { AtomicBool::new(false) }
}
std::thread_local! {
static INSIDE_TRACING: Cell<bool> = const { Cell::new(false) }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants