Skip to content

fix: potential deadlock in baton#3754

Merged
DrJosh9000 merged 1 commit intomainfrom
fix-baton-deadlock
Mar 11, 2026
Merged

fix: potential deadlock in baton#3754
DrJosh9000 merged 1 commit intomainfrom
fix-baton-deadlock

Conversation

@DrJosh9000
Copy link
Contributor

Description

Fixes a concurrency design problem in baton.go - revert to using something much closer to @zhming0 's design (#3722).

Currently, the following program deadlocks:

func main() {
	bat := newBaton()

	actor := func(n string) func() {
		return func() {
			time.Sleep(rand.N(time.Microsecond))
			<-bat.Acquire(n)
			time.Sleep(rand.N(time.Microsecond))
			bat.Release(n)
		}
	}

	for {
		var wg sync.WaitGroup
		wg.Go(actor("a"))
		wg.Go(actor("b"))
		wg.Wait()
	}
}

The reason is that Release can occur in between Acquire returning the channel, and the channel receive. Release sees no other actor waiting for the channel, so drops the baton.

Context

It came to me in a dream.

Changes

Revert the design to be closer to #3722

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

Disclosures / Credits

I did not use AI tools at all

@DrJosh9000 DrJosh9000 requested review from a team as code owners March 11, 2026 03:36
@DrJosh9000 DrJosh9000 force-pushed the fix-baton-deadlock branch 2 times, most recently from a8af420 to 01aa45d Compare March 11, 2026 03:42
@@ -5,17 +5,19 @@ import "sync"
// baton is a channel-based mutex. This allows for using it as part of a select
// statement.
type baton struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have a test suite which picked this up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your question as written: no, there's no existing test suite that picked it up. I don't mind adding a test though.

Copy link
Contributor

@zhming0 zhming0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation looks sound so ✅ .

But I have a question about how the deadlock would occur.

// Something holds the baton, so record that this actor is
// waiting for the baton.
b.acquire[by] = ch
return ch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this deadlock? or race condition instead.

If release happened in between return ch and <-ch, then I think the worst case is that multiple parties consider themselves as holder at the same time.

But I don't see how deadlock is happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume two actors "a" and "b", one runs <-b.Acquire("a") and the other b.Release("b"). Release scans all entries in the b.acquire map to find any that are waiting on the channel. If the Release in "b" happens after Acquire returns (it must at least wait for the mutex), but before the channel is being waited on by "a", then the scan finds no channel being waited on, so assigns the holder to "" and has cleared the map as it goes.

@DrJosh9000 DrJosh9000 enabled auto-merge March 11, 2026 05:06
@DrJosh9000 DrJosh9000 merged commit c36458a into main Mar 11, 2026
2 checks passed
@DrJosh9000 DrJosh9000 deleted the fix-baton-deadlock branch March 11, 2026 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants