-
Notifications
You must be signed in to change notification settings - Fork 4.1k
go.mod: bump raft #89632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
go.mod: bump raft #89632
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
erikgrinaker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fellow readers, v3.6.0-alpha.0 is the etcd version that we're importing Raft from, not the library version. There is no library version (which presents its own set of issues, but here we are). At least it's early in the release cycle, so we have time to catch issues.
|
We're picking up the following here:
|
|
That all looks fine at a cursory glance. |
|
I just reviewed them in-depth and checked them all off. |
|
Yeah, I clicked through as well, all good. |
7f90e89 to
e7250df
Compare
|
Ugh, |
|
Ok looks like I have another odyssey to go through https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1665564865598529 |
6a0e377 to
a8b45c9
Compare
|
Ok it now builds but now there are a bunch of test failures. Going to look into them. |
3f7cbbd to
4237b66
Compare
|
Tests are fixed, at least the ones that I saw were flaky, PTAL |
knz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with question. @erikgrinaker you probably want a 2nd look too.
Reviewed 1 of 1 files at r1, 28 of 28 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_application_state_machine_test.go line 260 at r2 (raw file):
{ k := tc.repl.Desc().EndKey.AsRawKey().Prevish(10)
What is this new bit of test code helping with?
|
Patched up a bunch of tests but there's more, going to take this for a spin for a few hrs on gceworker:
|
|
re: Nathan's comment about benchmarking with early ack disabled, I'm running with this diff diff --git a/pkg/kv/kvserver/replica_application_cmd.go b/pkg/kv/kvserver/replica_application_cmd.go
index e6cbbc0424..67879c3329 100644
--- a/pkg/kv/kvserver/replica_application_cmd.go
+++ b/pkg/kv/kvserver/replica_application_cmd.go
@@ -153,7 +153,7 @@ func (c *replicatedCmd) CanAckBeforeApplication() bool {
// We don't try to ack async consensus writes before application because we
// know that there isn't a client waiting for the result.
req := c.proposal.Request
- return req.IsIntentWrite() && !req.AsyncConsensus
+ return req.IsIntentWrite() && !req.AsyncConsensus && false
}
// AckSuccess implements the apply.CheckedCommand interface.and getting similar latencies. So it's likely that the benchmark isn't capturing what I think it is. Thanks for the callout! Will need to look into this more. For reference, the PR that introduce early-ack is #38954. |
|
Hmm, I think it's more that I botched the initial experiment. I just re-ran the I need to revisit this properly. From the looks of it now it seems like we're regressing to way worse than "never ack anything early". In fact, doesn't seem to matter for this test whether we ack early, so it's maybe not even hitting that path, not sure. |
|
Ran @nvanbenschoten clearly we're regressing here in the single-node case. I did check (using I'm not sure why the regression happens. We only early-ack the previous raft handling loop's index, so previously it would be
Now it's
so naively one could hope for the new version to be faster (since we don't wait for application, i.e. we're actually using early acks now), though in a saturated pipeline we will be waiting for someone else's entries instead of our own so it should all sort of cancel out. Of course we're also doing more work because we're pulling the entries into a Ready twice. And we need to wait for a raft scheduler round-trip. It would be nice to confirm why exactly we're slowing down but I'm finding it somewhat difficult to pick the right tool for the job. Would you look at CPU profiles? The difference in performance is apparent with I'm also curious if you know what kind of regression is acceptable for the single node case. I'm running multi-node experiments now to make sure these stay comparable, as we expect them to. |
|
No regression on multi-node.
|
|
Hmm now I don't know what to believe any more. Was trying to repro the difference in Going to give the kv0 roachtest another run for the money. |
Okay maybe just noise. Going to 5x these tomorrow. |
Worse tail latency as is expected (since we now need another round of goroutine scheduling to ack the cmd) but otherwise looks comparable enough. |
Early-acking splits, merges, etc, isn't useful. At best, it is inconsequential but at worst it causes flakes because the split hasn't actually completed by the time it is acked and so it isn't yet reflected in the descriptor, whose in-memory copy is frequently accessed during allocation decisions. Release note: None
Release note: None
Release note: None
``` go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead. go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0 go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66 ``` This picks up - etcd-io/etcd#14413 - etcd-io/etcd#14538 Closes cockroachdb#87264. Release note: None
Release note: None
nvb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mod our discussion on Slack about one more round of benchmarking with early-ack disabled.
Reviewed 4 of 38 files at r15, 22 of 23 files at r17, 10 of 10 files at r18, 2 of 2 files at r19, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @tbg)
We were reusing a `*BatchRequest` across two `tc.Sender().Send` calls,
which is now racy since due to early command acks raft may be applying
the first command (thus accessing the batch) while we're reusing it.
```
==================
WARNING: DATA RACE
Read at 0x00c009afb910 by goroutine 905193:
github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).hasFlag()
github.com/cockroachdb/cockroach/pkg/roachpb/pkg/roachpb/batch.go:402 +0x34
github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).AppliesTimestampCache()
github.com/cockroachdb/cockroach/pkg/roachpb/pkg/roachpb/batch.go:160 +0x7c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).assertNoWriteBelowClosedTimestamp()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:1094 +0x4c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).Stage()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:513 +0x24c
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.Batch.Stage-fm()
<autogenerated>:1 +0x68
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCmdIter()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:184 +0x140
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:274 +0x110
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:241 +0xa0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1043 +0x1818
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:664 +0xec
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:641 +0xfc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:333 +0x27c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker-fm()
<autogenerated>:1 +0x44
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489 +0x14c
Previous write at 0x00c009afb910 by goroutine 905151:
github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).Add()
github.com/cockroachdb/cockroach/pkg/roachpb/pkg/roachpb/batch.go:634 +0x128
github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestErrorsDontCarryWriteTooOldFlag()
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_test.go:4660 +0x84c
testing.tRunner()
GOROOT/src/testing/testing.go:1446 +0x188
testing.(*T).Run.func1()
GOROOT/src/testing/testing.go:1493 +0x40
```
Release note: None
|
bors r=nvanbenschoten |
|
This PR was included in a batch that timed out, it will be automatically retried |
|
Build succeeded: |
Fixes cockroachdb#96266. The test became flaky after cockroachdb#89632, which made it possible for single-replica tests to see the effects of pre-application raft proposal acks. This was tripping up the MVCC GC in this benchmark, leading to `request to GC non-deleted, latest value of "test"` errors. Release note: None
100661: kv: deflake BenchmarkMVCCGCWithForegroundTraffic r=irfansharif a=nvanbenschoten Fixes #96266. The test became flaky after #89632, which made it possible for single-replica tests to see the effects of pre-application raft proposal acks. This was tripping up the MVCC GC in this benchmark, leading to `request to GC non-deleted, latest value of "test"` errors. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
This picks up
Compared single-node performance on gceworker via
Closes #87264.
Release note: None