integration: deflake TestMaintenanceSnapshotCancel #17235

fuweid · 2024-01-12T11:35:28Z

Unlike SnapshotWithVersion, the client.Snapshot doesn't wait for first response. The server could open db after we close connection or shutdown the server. We can read few bytes to ensure server opens boltdb.

Closes #17223

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Unlike SnapshotWithVersion, the client.Snapshot doesn't wait for first response. The server could open db after we close connection or shutdown the server. We can read few bytes to ensure server opens boltdb. Signed-off-by: Wei Fu <[email protected]>

serathius · 2024-01-12T16:42:26Z

/retest

serathius · 2024-01-12T16:43:07Z

Fixes the issue on my side.

From

...
/tmp/go-stress-20240112T174046-3291411157
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xb5c146]

goroutine 113 [running]:
go.etcd.io/bbolt.(*Cursor).seek(0xc00092cd68, {0x196b42c?, 0x40?, 0x101e160?})
        /usr/local/google/home/siarkowicz/pkg/mod/go.etcd.io/[email protected]/cursor.go:159 +0x26
go.etcd.io/bbolt.(*Bucket).Bucket(0xc00019e0f8, {0x196b42c, 0x4, 0x4126c5?})
        /usr/local/google/home/siarkowicz/pkg/mod/go.etcd.io/[email protected]/bucket.go:105 +0xb9
go.etcd.io/bbolt.(*Tx).Bucket(...)
        /usr/local/google/home/siarkowicz/pkg/mod/go.etcd.io/[email protected]/tx.go:102
go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).UnsafeRange(0xc000428b00, {0x122ff60, 0x19d2720}, {0x196c4c0, 0x10, 0x10}, {0x0, 0x0, 0x0}, 0x0)
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/storage/backend/batch_tx.go:174 +0x91
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadConsistentIndex({0x7f40fc5d4318, 0xc000428b00})
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/storage/schema/cindex.go:41 +0x5d
go.etcd.io/etcd/server/v3/storage/schema.unsafeUpdateConsistentIndex.func1()
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/storage/schema/cindex.go:80 +0x3e
go.etcd.io/etcd/client/pkg/v3/verify.Verify(0xc00092d078)
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/client/pkg/verify/verify.go:71 +0x33
go.etcd.io/etcd/server/v3/storage/schema.unsafeUpdateConsistentIndex({0x1234480, 0xc000428b00}, 0x7, 0x2, 0x0)
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/storage/schema/cindex.go:79 +0xd7
go.etcd.io/etcd/server/v3/storage/schema.UnsafeUpdateConsistentIndex(...)
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/storage/schema/cindex.go:67
go.etcd.io/etcd/server/v3/etcdserver/cindex.(*consistentIndex).UnsafeSave(0x4271c5?, {0x1234480?, 0xc000428b00?})
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/etcdserver/cindex/cindex.go:121 +0x30
go.etcd.io/etcd/server/v3/storage.(*BackendHooks).OnPreCommitUnsafe(0xc00012a630, {0
…
40s: 1485 runs so far, 6 failures (0.40%)

To

$ stress  ./clientv3.test -test.run=TestMaintenanceSnapshotCancel 
5s: 101 runs so far, 0 failures
10s: 217 runs so far, 0 failures
15s: 336 runs so far, 0 failures
20s: 438 runs so far, 0 failures
25s: 552 runs so far, 0 failures
30s: 657 runs so far, 0 failures
35s: 766 runs so far, 0 failures
40s: 874 runs so far, 0 failures
45s: 979 runs so far, 0 failures
50s: 1089 runs so far, 0 failures
55s: 1195 runs so far, 0 failures
1m0s: 1298 runs so far, 0 failures
1m5s: 1403 runs so far, 0 failures
1m10s: 1510 runs so far, 0 failures
1m15s: 1616 runs so far, 0 failures

ahrtr · 2024-01-13T12:38:38Z

Thanks @fuweid for the fix.

Are you sure this is only a test issue? As mentioned in the #17223, there should be no any panicking on processing any client requests. Probably we should also add some protection to prevent any such panicking as well.

fuweid · 2024-01-14T12:46:40Z

Hi @ahrtr , The panic happened after we close etcd server.

    cluster.go:1421: ========= Cluster termination started =====================
    logger.go:130: 2024-01-09T15:30:32.412Z	INFO	grpc	[[core] [Channel #862] Channel Connectivity change to SHUTDOWN]
    logger.go:130: 2024-01-09T15:30:32.412Z	INFO	grpc	[[core] [Channel #862] Closing the name resolver]
    logger.go:130: 2024-01-09T15:30:32.412Z	INFO	grpc	[[core] [Channel #862] ccBalancerWrapper: closing]
    logger.go:130: 2024-01-09T15:30:32.413Z	INFO	grpc	[[core] [Channel #862 SubChannel #863] Subchannel Connectivity change to SHUTDOWN]
    logger.go:130: 2024-01-09T15:30:32.413Z	INFO	grpc	[[core] [Channel #862 SubChannel #863] Subchannel deleted]
    logger.go:130: 2024-01-09T15:30:32.413Z	INFO	grpc	[[transport] [client-transport 0xc0006866c0] Closing: rpc error: code = Canceled desc = grpc: the client connection is closing]
    logger.go:130: 2024-01-09T15:30:32.414Z	INFO	grpc	[[transport] [server-transport 0xc000753380] Closing: EOF]
    logger.go:130: 2024-01-09T15:30:32.414Z	INFO	grpc	[[core] [Channel #862] Channel deleted]
    logger.go:130: 2024-01-09T15:30:32.414Z	INFO	grpc	[[transport] [server-transport 0xc000753380] loopyWriter exiting with error: transport closed by client]
    logger.go:130: 2024-01-09T15:30:32.414Z	INFO	grpc	[[transport] [client-transport 0xc0006866c0] loopyWriter exiting with error: transport closed by client]
    logger.go:130: 2024-01-09T15:30:32.414Z	INFO	m0	terminating a member	{"member": "m0", "name": "m0", "advertise-peer-urls": ["unix://127.0.0.1:2125409716"], "listen-client-urls": ["unix://127.0.0.1:2125509716"], "grpc-url": "unix://localhost:m0"}
    logger.go:130: 2024-01-09T15:30:32.415Z	INFO	grpc	[[core] [Server #860 ListenSocket #861] ListenSocket deleted]
    logger.go:130: 2024-01-09T15:30:32.415Z	INFO	m0	server has stopped; stopping cluster version's monitor	{"member": "m0"}
    logger.go:130: 2024-01-09T15:30:32.415Z	INFO	m0	server has stopped; stopping storage version's monitor	{"member": "m0"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa6c27c]

goroutine 19029 [running]:

Both client and server have closed connection. However, snapshot streaming function is still ongoing.
The cleanup deletes all the data. snapshot streaming function isn't aware of this.

Integration testcases don't invoke new process. Even if the server has been closed, the request handlers can still work.
IMO, it's unlikely to run into this issue in production or e2e, because the etcd process will exit.

Not sure that it's worth to add some protection right now.

ahrtr · 2024-01-14T13:08:39Z

IMO, it's unlikely to run into this issue in production or e2e, because the etcd process will exit.

I think it might be easy to reproduce this issue in an e2e test.

Firstly, let a client submits a snapshot request; while on the server side, add a sleep failpoint before conducting the real snapshot;
Secondly, shutdown the etcd member, but add a sleep failpoint after the backend has been closed but before the process exits;
Eventually when the snapshot gets continue to run, it may run into panicking.

I do not get time to verify the thought, please feel free to investigate and create a test case to reproduce it.

fuweid · 2024-01-14T13:22:03Z

@ahrtr I think it's not scope of this pull request. I can verify it in the follow-up. does it make senses to you?

ahrtr · 2024-01-14T13:24:37Z

OK, let's keep the issue #17223 open for now.

serathius approved these changes Jan 12, 2024

View reviewed changes

ahrtr approved these changes Jan 14, 2024

View reviewed changes

ahrtr merged commit c8b4b16 into etcd-io:main Jan 14, 2024

fuweid deleted the fix-17223 branch January 14, 2024 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

integration: deflake TestMaintenanceSnapshotCancel #17235

integration: deflake TestMaintenanceSnapshotCancel #17235

Uh oh!

fuweid commented Jan 12, 2024

Uh oh!

serathius commented Jan 12, 2024

Uh oh!

serathius commented Jan 12, 2024 •

edited

Loading

Uh oh!

ahrtr commented Jan 13, 2024

Uh oh!

fuweid commented Jan 14, 2024

Uh oh!

ahrtr commented Jan 14, 2024

Uh oh!

fuweid commented Jan 14, 2024

Uh oh!

ahrtr commented Jan 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

integration: deflake TestMaintenanceSnapshotCancel #17235

integration: deflake TestMaintenanceSnapshotCancel #17235

Uh oh!

Conversation

fuweid commented Jan 12, 2024

Uh oh!

serathius commented Jan 12, 2024

Uh oh!

serathius commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahrtr commented Jan 13, 2024

Uh oh!

fuweid commented Jan 14, 2024

Uh oh!

ahrtr commented Jan 14, 2024

Uh oh!

fuweid commented Jan 14, 2024

Uh oh!

ahrtr commented Jan 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

serathius commented Jan 12, 2024 •

edited

Loading