Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented Oct 10, 2022

go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f

go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead.
go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0
go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66

This picks up

Compared single-node performance on gceworker via

#!/bin/bash
set -euxo pipefail
pkill -9 cockroach || true
rm -rf cockroach-data
cr=./cockroach-$1
$cr start-single-node --background --insecure
$cr workload init kv
$cr workload run kv --splits 100 --max-rate 2000 --duration 10m --read-percent 0 --min-block-bytes 10 --max-block-bytes 10 | tee $1.txt
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0        1199604         1999.3      6.7      7.1     10.0     11.0     75.5  write #master
  600.0s        0        1199614         1999.4      6.8      7.1     10.0     11.0     79.7  write #PR

Closes #87264.

  • make it build
  • remove the maxIndex param and handling from Task.AckCommittedEntriesBeforeApplication
  • check that single node write latencies don't regress

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg mentioned this pull request Oct 10, 2022
3 tasks
@tbg tbg requested a review from erikgrinaker October 10, 2022 08:51
@tbg tbg requested a review from nvb October 10, 2022 10:19
@tbg

This comment was marked as resolved.

@tbg tbg marked this pull request as ready for review October 12, 2022 07:25
@tbg tbg requested review from a team as code owners October 12, 2022 07:25
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fellow readers, v3.6.0-alpha.0 is the etcd version that we're importing Raft from, not the library version. There is no library version (which presents its own set of issues, but here we are). At least it's early in the release cycle, so we have time to catch issues.

@tbg
Copy link
Member Author

tbg commented Oct 12, 2022

We're picking up the following here:

$ git log d19116e6ee66 --not e51c697ec6e8 --oneline --merges . | sed 's~#~https://github.com/etcd-io/etcd/pull/~' | sed 's~^~- [ ] ~' | pbcopy

@erikgrinaker
Copy link
Contributor

That all looks fine at a cursory glance.

@tbg
Copy link
Member Author

tbg commented Oct 12, 2022

I just reviewed them in-depth and checked them all off.

@erikgrinaker
Copy link
Contributor

Yeah, I clicked through as well, all good.

@tbg tbg force-pushed the tmpdelme branch 2 times, most recently from 7f90e89 to e7250df Compare October 12, 2022 08:11
@tbg
Copy link
Member Author

tbg commented Oct 12, 2022

@tbg
Copy link
Member Author

tbg commented Oct 12, 2022

Ok looks like I have another odyssey to go through

https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1665564865598529

@tbg tbg force-pushed the tmpdelme branch 2 times, most recently from 6a0e377 to a8b45c9 Compare October 12, 2022 10:26
@tbg
Copy link
Member Author

tbg commented Oct 12, 2022

Ok it now builds but now there are a bunch of test failures. Going to look into them.

@tbg tbg force-pushed the tmpdelme branch 2 times, most recently from 3f7cbbd to 4237b66 Compare October 13, 2022 12:29
@tbg tbg requested a review from erikgrinaker October 13, 2022 12:30
@tbg
Copy link
Member Author

tbg commented Oct 13, 2022

Tests are fixed, at least the ones that I saw were flaky, PTAL

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with question. @erikgrinaker you probably want a 2nd look too.

Reviewed 1 of 1 files at r1, 28 of 28 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_application_state_machine_test.go line 260 at r2 (raw file):

		{
			k := tc.repl.Desc().EndKey.AsRawKey().Prevish(10)

What is this new bit of test code helping with?

@tbg
Copy link
Member Author

tbg commented Oct 13, 2022

Patched up a bunch of tests but there's more, going to take this for a spin for a few hrs on gceworker:

/dev test --stress-args '-bazel -maxfails 50 -p 5 -stderr=false' --stress ./pkg/kv/kvserver/ 2>&1 > stress.txt

@tbg
Copy link
Member Author

tbg commented Oct 28, 2022

re: Nathan's comment about benchmarking with early ack disabled, I'm running with this diff

diff --git a/pkg/kv/kvserver/replica_application_cmd.go b/pkg/kv/kvserver/replica_application_cmd.go
index e6cbbc0424..67879c3329 100644
--- a/pkg/kv/kvserver/replica_application_cmd.go
+++ b/pkg/kv/kvserver/replica_application_cmd.go
@@ -153,7 +153,7 @@ func (c *replicatedCmd) CanAckBeforeApplication() bool {
        // We don't try to ack async consensus writes before application because we
        // know that there isn't a client waiting for the result.
        req := c.proposal.Request
-       return req.IsIntentWrite() && !req.AsyncConsensus
+       return req.IsIntentWrite() && !req.AsyncConsensus && false
 }

 // AckSuccess implements the apply.CheckedCommand interface.

and getting similar latencies. So it's likely that the benchmark isn't capturing what I think it is. Thanks for the callout! Will need to look into this more.

For reference, the PR that introduce early-ack is #38954.

@tbg
Copy link
Member Author

tbg commented Oct 28, 2022

Hmm, I think it's more that I botched the initial experiment. I just re-ran the master (without a diff) and it's a lot better:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0        1196104         1993.5      4.6      4.7      6.6      7.3    738.2 write master (redo)
  600.0s        0        1195939         1993.2      4.5      4.7      6.6      7.1    771.8 write master (no early ack)
  600.0s        0        1199614         1999.4      6.8      7.1     10.0     11.0     79.7  write PR
  600.0s        0        1199977         2000.0      7.0      7.6     10.0     11.0     30.4 write PR (redo)

I need to revisit this properly. From the looks of it now it seems like we're regressing to way worse than "never ack anything early". In fact, doesn't seem to matter for this test whether we ack early, so it's maybe not even hitting that path, not sure.

@tbg
Copy link
Member Author

tbg commented Oct 31, 2022

Ran ./pkg/cmd/roachtest/roachstress.sh -u -c 1 kv0/enc=false/nodes=1/cpu=32 on the PR and the merge-base.

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1800.0s        0       37890867        21050.5      3.0      3.3      5.2      7.1     28.3    PR
 1800.0s        0       41961889        23312.2      2.7      2.6      4.7      6.6    104.9    master base

@nvanbenschoten clearly we're regressing here in the single-node case. I did check (using BenchmarkSQL/Cockroach/Insert/1) that we're hitting the early-ack path.

I'm not sure why the regression happens. We only early-ack the previous raft handling loop's index, so previously it would be

  • handleRaftReady runs, appends, applies, lastIndex now 100, acks 100
  • repeat

Now it's

  • handleRaftReadyRuns, appends, lastIndex now 100
  • handleRaftReady runs, early acks 100, applies
  • repeat

so naively one could hope for the new version to be faster (since we don't wait for application, i.e. we're actually using early acks now), though in a saturated pipeline we will be waiting for someone else's entries instead of our own so it should all sort of cancel out. Of course we're also doing more work because we're pulling the entries into a Ready twice. And we need to wait for a raft scheduler round-trip.

It would be nice to confirm why exactly we're slowing down but I'm finding it somewhat difficult to pick the right tool for the job. Would you look at CPU profiles? The difference in performance is apparent with --max-rate as well so I'm not sure the "we're doing more work" theory is very convincing.

I'm also curious if you know what kind of regression is acceptable for the single node case.

I'm running multi-node experiments now to make sure these stay comparable, as we expect them to.

@tbg
Copy link
Member Author

tbg commented Nov 1, 2022

No regression on multi-node.

kv0/enc=false/nodes=3/cpu=32

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1800.0s        0       67889324        37716.3      5.1      4.7      9.4     12.6    352.3 # master base
 1800.0s        0       67791909        37662.2      5.1      4.7      9.4     12.6    335.5 # PR

@tbg
Copy link
Member Author

tbg commented Nov 1, 2022

Hmm now I don't know what to believe any more. Was trying to repro the difference in
single node test on gceworker but now it all looks the same.

Going to give the kv0 roachtest another run for the money.

# --max-rate=1000 --sequential
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   70.0s        0          69997          999.9      4.0      4.1      6.6      8.9     23.1 master
   70.0s        0          69998         1000.0      4.1      4.1      6.6      8.9     17.8 PR
# --max-rate=1000
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   70.0s        0          69996          999.9      4.3      4.5      5.8      7.9     14.2 master
   70.0s        0          69997         1000.0      4.3      4.5      5.5      7.6     15.7 PR

# --max-rate=2000
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   70.0s        0         139992         1999.9      4.5      4.5      6.6      8.4     15.7 master
   70.0s        0         139993         1999.9      4.5      4.5      6.8      8.9     18.9 PR

# --max-rate=3000
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   70.0s        0         279945         3999.1      5.0      4.7      7.9      9.4     18.9 master
   70.0s        0         279949         3999.3      5.0      4.7      7.9      9.4     19.9

# --max-rate=inf
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
   70.0s        0         613579         8765.3      5.5      5.0      9.4     10.5     28.3 master
   70.0s        0         621244         8874.9      5.4      4.7      9.4     10.5     30.4 PR

@tbg
Copy link
Member Author

tbg commented Nov 1, 2022

 1800.0s        0       42653967        23696.6      2.7      2.8      4.7      6.6     46.1 PR
 1800.0s        0       39263334        21813.0      2.9      3.1      5.2      7.1     25.2 master

Okay maybe just noise. Going to 5x these tomorrow.

@tbg
Copy link
Member Author

tbg commented Nov 2, 2022

$ for f in artifacts-$t/kv0/enc\=false/nodes\=1/cpu\=32/run_*/*run_kv.log; do tail -n 3 $f | head -n 2; done
t=pr:
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1800.0s        0       42763708        23757.6      2.7      2.8      4.7      6.3     37.7
 1800.0s        0       42653967        23696.6      2.7      2.8      4.7      6.6     46.1
 1800.0s        0       42054718        23363.7      2.7      2.8      4.7      6.8    209.7
 1800.0s        0       41959800        23311.0      2.7      2.8      4.7      6.8     75.5
 1800.0s        0       42693432        23718.6      2.7      2.8      4.7      6.3     26.2
 1800.0s        0       42655456        23697.5      2.7      2.8      4.7      6.3     27.3

t=master:

artifacts-$t/kv0/enc\=false/nodes\=1/cpu\=32/run_*/*run_kv.log; do tail -n 3 $f | head -n 2; done
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1800.0s        0       44175250        24541.8      2.6      2.6      4.7      6.3     29.4
 1800.0s        0       39263334        21813.0      2.9      3.1      5.2      7.1     25.2
 1800.0s        0       44116001        24508.9      2.6      2.6      4.7      6.3     28.3
 1800.0s        0       43811266        24339.6      2.6      2.6      4.7      6.3     31.5
 1800.0s        0       43845262        24358.5      2.6      2.6      4.7      6.3     25.2

Worse tail latency as is expected (since we now need another round of goroutine scheduling to ack the cmd) but otherwise looks comparable enough.

tbg added 5 commits November 2, 2022 17:31
Early-acking splits, merges, etc, isn't useful. At best, it is
inconsequential but at worst it causes flakes because the split hasn't
actually completed by the time it is acked and so it isn't yet reflected
in the descriptor, whose in-memory copy is frequently accessed during
allocation decisions.

Release note: None
```
go get go.etcd.io/etcd/raft/v3@d19116e6ee66e52a5fd8cce2e10f9422fb80e42f

go: downloading go.etcd.io/etcd/raft/v3 v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
go: module github.com/golang/protobuf is deprecated: Use the "google.golang.org/protobuf" module instead.
go: upgraded go.etcd.io/etcd/api/v3 v3.5.0 => v3.6.0-alpha.0
go: upgraded go.etcd.io/etcd/raft/v3 v3.0.0-20210320072418-e51c697ec6e8 => v3.6.0-alpha.0.0.20221009201006-d19116e6ee66
```

This picks up

- etcd-io/etcd#14413
- etcd-io/etcd#14538

Closes cockroachdb#87264.

Release note: None
Copy link
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: mod our discussion on Slack about one more round of benchmarking with early-ack disabled.

Reviewed 4 of 38 files at r15, 22 of 23 files at r17, 10 of 10 files at r18, 2 of 2 files at r19, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @tbg)

@tbg
Copy link
Member Author

tbg commented Nov 2, 2022

:lgtm: mod our discussion on Slack about one more round of benchmarking with early-ack disabled.

Ran this and it meshes with our understanding. We don't see a regression without early acks until this PR, where there's a significant p50 bump.

p50 for --max-rate 2000 --duration 70s --read-percent 0 --min-block-bytes 10 --max-block-bytes 10

master 4.9ms
master w/o early ack 5.0ms
PR 4.9ms
PR w/o early ack 6.3ms
Details
$ for typ in {master,pr}{,-noearlyack}; do ./run.sh $typ; done 2>&1 | tee experiment.log
$ grep -A 3 'Highest' experiment.log
Highest sequence written: 139947. Can be passed as --write-seq=R139947 to the next run.

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   70.0s        0         139939         1999.1      4.9      4.7      7.3      9.4     46.1  write
--
Highest sequence written: 139966. Can be passed as --write-seq=R139966 to the next run.

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   70.0s        0         139954         1999.3      5.0      5.0      7.6     10.0     37.7  write
--
Highest sequence written: 139955. Can be passed as --write-seq=R139955 to the next run.

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   70.0s        0         139946         1999.2      4.9      4.7      7.3     10.0     41.9  write
--
Highest sequence written: 139972. Can be passed as --write-seq=R139972 to the next run.

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
   70.0s        0         139957         1999.4      6.3      5.5     10.5     12.6     35.7  write

We were reusing a `*BatchRequest` across two `tc.Sender().Send` calls,
which is now racy since due to early command acks raft may be applying
the first command (thus accessing the batch) while we're reusing it.

```
==================
WARNING: DATA RACE
Read at 0x00c009afb910 by goroutine 905193:
  github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).hasFlag()
      github.com/cockroachdb/cockroach/pkg/roachpb/pkg/roachpb/batch.go:402 +0x34
  github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).AppliesTimestampCache()
      github.com/cockroachdb/cockroach/pkg/roachpb/pkg/roachpb/batch.go:160 +0x7c
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).assertNoWriteBelowClosedTimestamp()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:1094 +0x4c
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).Stage()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:513 +0x24c
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.Batch.Stage-fm()
      <autogenerated>:1 +0x68
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCmdIter()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:184 +0x140
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:274 +0x110
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:241 +0xa0
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1043 +0x1818
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:664 +0xec
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:641 +0xfc
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:333 +0x27c
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker-fm()
      <autogenerated>:1 +0x44
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
      github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489 +0x14c

Previous write at 0x00c009afb910 by goroutine 905151:
  github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).Add()
      github.com/cockroachdb/cockroach/pkg/roachpb/pkg/roachpb/batch.go:634 +0x128
  github.com/cockroachdb/cockroach/pkg/kv/kvserver.TestErrorsDontCarryWriteTooOldFlag()
      github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_test.go:4660 +0x84c
  testing.tRunner()
      GOROOT/src/testing/testing.go:1446 +0x188
  testing.(*T).Run.func1()
      GOROOT/src/testing/testing.go:1493 +0x40

```

Release note: None
@tbg
Copy link
Member Author

tbg commented Nov 2, 2022

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Nov 3, 2022

This PR was included in a batch that timed out, it will be automatically retried

@craig
Copy link
Contributor

craig bot commented Nov 3, 2022

Build succeeded:

@craig craig bot merged commit 9c9d55d into cockroachdb:master Nov 3, 2022
@tbg tbg deleted the tmpdelme branch November 3, 2022 11:15
nvb added a commit to nvb/cockroach that referenced this pull request Apr 4, 2023
Fixes cockroachdb#96266.

The test became flaky after cockroachdb#89632, which made it possible for
single-replica tests to see the effects of pre-application raft
proposal acks. This was tripping up the MVCC GC in this benchmark,
leading to `request to GC non-deleted, latest value of "test"`
errors.

Release note: None
craig bot pushed a commit that referenced this pull request Apr 7, 2023
100661: kv: deflake BenchmarkMVCCGCWithForegroundTraffic r=irfansharif a=nvanbenschoten

Fixes #96266.

The test became flaky after #89632, which made it possible for single-replica tests to see the effects of pre-application raft proposal acks. This was tripping up the MVCC GC in this benchmark, leading to `request to GC non-deleted, latest value of "test"` errors.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

raft: re-work leader self-ack mechanism

5 participants