storage: prevent recording events into closed tracing spans #39203

ajwerner · 2019-08-01T02:17:56Z

In #39188 we saw a surprising assertion failure due to a panic not being captured in a deferred function, the second commit deals with this obfuscation. The first commit adds a useful testing knobs utilized by the third commit. The third commit ensures that the if a command is reproposed more than one time at different MaxLeaseIndex values in such a way that the reproposals fail that the latter, not the earlier failure will report to the client. This prevents the panics seen in #39188. The fourth commit deals with cases where the same command is reproposed more than once at the same MaxLeaseIndex.

The motivation for all of this is the logging of events into a closed recording trace span. In order to test such reuse the tests enable the trace.DebugUseAfterFinish and then read the logs to determine if the debugging code detected a use after finish. The code also adds an assertion that commands are finished at the highest MaxLeaseIndex at which they have been proposed.

Fixes #39188.
Fixes #39190.

cockroach-teamcity · 2019-08-01T02:18:06Z

This change is

ajwerner · 2019-08-01T02:31:18Z

Writing a test to exercise the fourth commit now.

nvb

Reviewed 3 of 3 files at r1, 1 of 1 files at r2, 3 of 3 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

pkg/storage/replica_application.go, line 173 at r3 (raw file):

			// tryReproposeWithNewLeaseIndex). In that case, there's a newer
			// version of the proposal in the pipeline, so don't consider this
			// entry to have been proposed locally. We entry must necessarily be

s/We/The/

pkg/storage/replica_application.go, line 132 at r4 (raw file):

		for ok := it.init(&b.cmdBuf); ok; ok = it.next() {
			cmd := it.cur()
			// Reset the context for already applied commands to ensure that

I see why the other case is necessary, but why here? Nothing in this batch has applied yet.

pkg/storage/replica_application.go, line 881 at r4 (raw file):

	for ok := it.init(&b.cmdBuf); ok; ok = it.next() {
		cmd := it.cur()
		// Reset the context for already applied commands to ensure that

Nice job spotting this issue! It seems like it would also cause problems. I'm fine with this going in without a test for now.

Interestingly, the refactor I'm working on splits applying the side effects of each command in a batch from acknowledging the commands in a batch, so it wouldn't run into this kind of issue. The need for this logic further justifies performing the steps in that order.

pkg/storage/replica_test.go, line 11561 at r3 (raw file):

TestReproposalAtLaterIndexDoesNotShareContextWithEarlierReproposal

Looks like the name changed.

pkg/storage/replica_test.go, line 11581 at r3 (raw file):

	tc.manualClock = hlc.NewManualClock(123)
	cfg := TestStoreConfig(hlc.NewClock(tc.manualClock.UnixNano, time.Nanosecond))
	// Below we saet txnID to the value of the transaction we're going to force to

set

pkg/storage/replica_test.go, line 11604 at r3 (raw file):

		// Detect the application of the
		TestingApplyFilter: func(args storagebase.ApplyFilterArgs) (retry int, pErr *roachpb.Error) {
			if !seen && args.CmdID != cmdID {

Should this be if seen || args.CmdID != cmdID?

pkg/storage/replica_test.go, line 11617 at r3 (raw file):

			tc.repl.mu.proposalBuf.flushLocked()
			// Increase the lease sequence so that future reproposals will fail with
			// NotLeaseHolderError.

"This mimics the outcome of a leaseholder change slipping in between the application of the first proposal and the reproposals."

pkg/storage/replica_test.go, line 11621 at r3 (raw file):

			// This return value will force another retry which will carry a yet
			// higher MaxLeaseIndex and will trigger our invariant violation.
			return 1, roachpb.NewErrorf("forced error that can be reproposed at a higher index")

s/1/proposalIllegalLeaseIndex/?

ajwerner

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/replica_application.go, line 132 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I see why the other case is necessary, but why here? Nothing in this batch has applied yet.

Totally isn't.

pkg/storage/replica_application.go, line 881 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Nice job spotting this issue! It seems like it would also cause problems. I'm fine with this going in without a test for now.

Interestingly, the refactor I'm working on splits applying the side effects of each command in a batch from acknowledging the commands in a batch, so it wouldn't run into this kind of issue. The need for this logic further justifies performing the steps in that order.

Added a test.

ajwerner · 2019-08-01T13:40:36Z

This is somehow failing

*storagepb.RaftCommand: missing fixture! Please adjust belowRaftGoldenProtos if necessary
*storagepb.RaftCommandFooter: missing fixture! Please adjust belowRaftGoldenProtos if necessary

but I don't think this PR touches any of the methods where protos are tracked.

Release note: None

nvb

Reviewed 5 of 5 files at r5, 1 of 1 files at r6, 4 of 4 files at r7, 2 of 2 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/storage/replica_raft.go, line 759 at r6 (raw file):

due to errors

which are considered fatal.

pkg/storage/replica_raft.go, line 267 at r7 (raw file):

	// If an error occurs reset the command's MaxLeaseIndex to its initial value.
	// Failure to propose will propagate to the client. An invariant of this
	// package is that proposals which are finished carry raft command with a

commands

The deferring of updateProposalQuota led to a confusing invariant violation. The reason for the defer was to ensure that updateProposalQuota was called both in the normal case and in the one non-error early return case. This commit explicitly calls updateProposalQuota in the two non-error cases and moves the explanation for calling it to the call sites. Release note: None

This commit ensure that when a command has been reproposed more than once that the a reproposal with the highest MaxLeaseIndex ultimately finishes the command. This commit and associate test eliminate panics we've seen due to later reproposals recording events into spans which have already been finished. The next commit will deal preventing recording into a context from a later reproposal at the same MaxLeaseIndex. Release note (bug fix): prevent panic due to recording into finished tracing spans caused by acknowledging an earlier failed reproposal when a later reproposal exists.

… used This commit deals with the case where more than one reproposal of a command exists at the same MaxAppliedIndex. Code that already exist prevents the command from being finished more than once but leaves open the possibility of writing events into a potentially closed span. Release note: None

ajwerner · 2019-08-01T16:41:14Z

TFTR!

bors r=nvanbenschoten

39203: storage: prevent recording events into closed tracing spans r=nvanbenschoten a=ajwerner In #39188 we saw a surprising assertion failure due to a panic not being captured in a deferred function, the second commit deals with this obfuscation. The first commit adds a useful testing knobs utilized by the third commit. The third commit ensures that the if a command is reproposed more than one time at different MaxLeaseIndex values in such a way that the reproposals fail that the latter, not the earlier failure will report to the client. This prevents the panics seen in #39188. The fourth commit deals with cases where the same command is reproposed more than once at the same MaxLeaseIndex. The motivation for all of this is the logging of events into a closed recording trace span. In order to test such reuse the tests enable the [`trace.DebugUseAfterFinish`](https://github.com/golang/net/blob/ca1201d0de80cfde86cb01aea620983605dfe99b/trace/trace.go#L85-L87) and then read the logs to determine if the debugging code detected a use after finish. The code also adds an assertion that commands are finished at the highest MaxLeaseIndex at which they have been proposed. Fixes #39188. Fixes #39190. Co-authored-by: Andrew Werner <[email protected]>

craig · 2019-08-01T17:00:47Z

Build succeeded

GitHub CI (Cockroach)

The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on cockroachdb#38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for thing like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in cockroachdb#17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in cockroachdb#38954 in a much cleaner way. This is demonstrated in next commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Release note: None

39254: storage/apply: create apply package for raft entry application r=nvanbenschoten a=nvanbenschoten The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on #38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (#38976, #39064, #39135, #39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for things like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in #17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in #38954 in a much cleaner way. This is demonstrated in the second commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Co-authored-by: Nathan VanBenschoten <[email protected]>

…ation This change adjusts our handling of contexts and replicatedCmds. We no longer assign a local proposal's context to its replicatedCmds directly. Instead, we always create a new tracing span that follows from this context. This eliminates a whole class of bugs that we have fought to fix in changes like cockroachdb#39203. In fact, we are still seeing issues after the recent refactor when stressing cockroachdb#39390. The change also paves the way for tracing the application of command application on follower replicas. Release note: None

39425: storage: create new tracing span for each replicatedCmd during application r=nvanbenschoten a=nvanbenschoten This change adjusts our handling of contexts and replicatedCmds. We no longer assign a local proposal's context to its replicatedCmds directly. Instead, we always create a new tracing span that follows from this context. This eliminates a whole class of bugs that we have fought to fix in changes like #39203. In fact, we are still seeing issues after the recent refactor when stressing #39390. The change also paves the way for tracing the application of command application on follower replicas. Co-authored-by: Nathan VanBenschoten <[email protected]>

The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine. This was inspired by four driving forces: - We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction. - Initial discussion on cockroachdb#38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management. - Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for thing like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up. - The proposed optimization in cockroachdb#17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live. The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions: - What are the stages of raft entry application? - What is the difference between a "raft entry" and a "replicated command"? - What can a command do besides apply its write batch to the storage engine? - What does it mean for a successfully replicated command to be rejected during application? - When can we acknowledge the outcome of a raft proposal? The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all. Finally, the refactor paves the way for making the proposed change in cockroachdb#38954 in a much cleaner way. This is demonstrated in next commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR. Release note: None

ajwerner requested review from a team and nvb August 1, 2019 02:17

ajwerner mentioned this pull request Aug 1, 2019

storage: check for panics before updating the proposalQuota #39193

Closed

nvb reviewed Aug 1, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/fix-event-after-finish-panic branch 4 times, most recently from 86cd843 to 2018549 Compare August 1, 2019 12:13

ajwerner changed the title ~~storage: finish commands at last reproposal and prevent reuse of closed spans~~ storage: prevent recording events into closed tracing spans Aug 1, 2019

ajwerner force-pushed the ajwerner/fix-event-after-finish-panic branch from 2018549 to 16e26f5 Compare August 1, 2019 12:30

ajwerner commented Aug 1, 2019

View reviewed changes

storage: add MaxApplicationBatchSize to StoreTestingKnobs

026e1f4

Release note: None

ajwerner force-pushed the ajwerner/fix-event-after-finish-panic branch from 16e26f5 to 2056e86 Compare August 1, 2019 15:43

nvb approved these changes Aug 1, 2019

View reviewed changes

Andrew Werner added 3 commits August 1, 2019 12:00

ajwerner force-pushed the ajwerner/fix-event-after-finish-panic branch from 2056e86 to 29f6109 Compare August 1, 2019 16:02

craig bot merged commit 29f6109 into cockroachdb:master Aug 1, 2019

nvb mentioned this pull request Aug 1, 2019

roachtest: kv/contention/nodes=4 failed #39239

Closed

nvb mentioned this pull request Aug 2, 2019

storage/apply: create apply package for raft entry application #39254

Merged

nvb mentioned this pull request Aug 2, 2019

roachtest: kv/contention/nodes=4 failed #39262

Closed

nvb mentioned this pull request Aug 7, 2019

storage: create new tracing span for each replicatedCmd during application #39425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

storage: prevent recording events into closed tracing spans #39203

storage: prevent recording events into closed tracing spans #39203

Uh oh!

ajwerner commented Aug 1, 2019 •

edited

Loading

Uh oh!

cockroach-teamcity commented Aug 1, 2019

Uh oh!

ajwerner commented Aug 1, 2019

Uh oh!

nvb left a comment

Uh oh!

ajwerner left a comment

Uh oh!

ajwerner commented Aug 1, 2019

Uh oh!

nvb left a comment

Uh oh!

ajwerner commented Aug 1, 2019

Uh oh!

craig bot commented Aug 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

storage: prevent recording events into closed tracing spans #39203

storage: prevent recording events into closed tracing spans #39203

Uh oh!

Conversation

ajwerner commented Aug 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Aug 1, 2019

Uh oh!

ajwerner commented Aug 1, 2019

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner commented Aug 1, 2019

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner commented Aug 1, 2019

Uh oh!

craig bot commented Aug 1, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ajwerner commented Aug 1, 2019 •

edited

Loading