Skip to content

Conversation

@erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Dec 7, 2021

handleRaftReadyRaftMuLocked is not prepared to handle context
cancellation. It is typically called via the Raft scheduler, which uses
a background context, but can be called via other paths as well (e.g.
snapshot application).

This patch adds an assertion that the given context is not cancellable,
and creates a new background context for the main scheduler code path
instead of using the CLI's cancellable context.

Release note: None


Split off from #73484, see previous discussion there.

Turns out that this fails because the Raft scheduler context is in fact cancellable. It's rooted at the CLI context:

cockroach/pkg/cli/start.go

Lines 407 to 408 in c3e8d85

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

There's a few different options here, including:

  1. Fixing handleRaftReady to to handle context cancellation safely.
  2. Using a new background context for the Raft scheduler and populating it with necessary data from the passed context.
  3. Getting @andreimatei's contextutil.WithoutCancel() from util/stop: a better async task interface #66387 merged, and use it.

@erikgrinaker erikgrinaker requested review from nvb and tbg December 7, 2021 13:37
@erikgrinaker erikgrinaker self-assigned this Dec 7, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👎 on option 3, I think it's an unwise/unsound way to derive a context (see discussion there).
Also not a fan of option 1, since it's high-risk-low-reward.
There's also option 4 which is living with the sub-par assertion.

Option 2 sounds the most reasonable. The scheduler doesn't use the context cancellation. It listens to the stopper:

func (s *raftScheduler) Start(ctx context.Context, stopper *stop.Stopper) {
waitQuiesce := func(context.Context) {
<-stopper.ShouldQuiesce()
s.mu.Lock()
s.mu.stopped = true
s.mu.Unlock()
s.mu.cond.Broadcast()
}

We can pass s.cfg.AmbientContext into newRaftScheduler here instead of ctx:

s.scheduler = newRaftScheduler(s.metrics, s, storeSchedulerConcurrency)
and use that to derive a ctx and I think that should be it?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👎 on option 3, I think it's an unwise/unsound way to derive a context (see discussion there).

Let's agree on what exactly is unsound. The problems with uncanceledCtx from #66387 I think are not around the context that terminates the cancelation itself, but about the InheritsCancelation(child, parent context.Context) bool utility method, which is not particularly reliable. So having that utility might be a bad idea, but I don't see a problem with uncanceledCtx. I think it is indistinguishable from a hypothetical context that copies everything from the parent but the cancelation. Right?
So I think there's option 5 here too, which is to not assert anything, but to actively terminate the cancelation by deriving an uncanceledCtx.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@erikgrinaker erikgrinaker changed the title kvserver: assert non-cancellable context in handleRaftReadyRaftMuLocked kvserver: use and assert non-cancellable Raft scheduler context Dec 8, 2021
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 2 sounds the most reasonable.

Done, for now.

I don't see a problem with uncanceledCtx. I think it is indistinguishable from a hypothetical context that copies everything from the parent but the cancelation. Right?

I tend to agree with this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@erikgrinaker erikgrinaker marked this pull request as ready for review December 8, 2021 09:42
@erikgrinaker erikgrinaker requested a review from a team as a code owner December 8, 2021 09:42
@tbg
Copy link
Member

tbg commented Dec 8, 2021

I don't see a problem with uncanceledCtx. I think it is indistinguishable from a hypothetical context that copies everything from the parent but the cancelation. Right?

I tend to agree with this.

I think there are unspoken rules about implementing context.Context and one of them is that you can never widen the cancellation. Quoth https://pkg.go.dev/context#pkg-overview (emphasis mine):

Incoming requests to a server should create a Context, and outgoing calls to servers should accept a Context. The chain of function calls between them must propagate the Context, optionally replacing it with a derived Context created using WithCancel, WithDeadline, WithTimeout, or WithValue. When a Context is canceled, all Contexts derived from it are also canceled.

Besides, the purist in me just doesn't want that to be a thing because there is already an idiomatic way, which is deriving from a context that has the wider cancellation you need. I worry that playing fast and loose with context.Context will make it more difficult to extend it in useful ways such as #66884.

Not relevant here, but another unspoken rule is that you can never wholesale discard the values stored in a context, i.e. I don't think the inventors of context.Context ever intended for there to be a throwawayAllValuesContext wrapper, i.e. you can only unset values that you know the key for. This is what enables the chain-walking up to parents that context implementations rely on internally.

`handleRaftReadyRaftMuLocked` is not prepared to handle context
cancellation. It is typically called via the Raft scheduler, which uses
a background context, but can be called via other paths as well (e.g.
snapshot application).

This patch adds an assertion that the given context is not cancellable,
and creates a new background context for the main scheduler code path
instead of using the CLI's cancellable context.

Release note: None
@erikgrinaker
Copy link
Contributor Author

I get your point @tbg, and to a large extent I think you're right. I considered uncanceledCtx as a simple way to copy the context values without the cancellation, but even with values, it's likely better to be explicit about what we copy rather than to grab everything wholesale (however inconvenient). In principle, there could be all sorts of values in the context from third-party packages (e.g. gRPC) that we probably wouldn't want to include without good reason.

TFTR!

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Dec 9, 2021

Build succeeded:

@craig craig bot merged commit d848eff into cockroachdb:master Dec 9, 2021
@erikgrinaker erikgrinaker deleted the raft-ctx-nocancel branch December 10, 2021 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants