Skip to content

Conversation

@fuweid
Copy link
Member

@fuweid fuweid commented Jan 12, 2024

Unlike SnapshotWithVersion, the client.Snapshot doesn't wait for first response. The server could open db after we close connection or shutdown the server. We can read few bytes to ensure server opens boltdb.

Closes #17223

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Unlike SnapshotWithVersion, the client.Snapshot doesn't wait for first
response. The server could open db after we close connection or shutdown
the server. We can read few bytes to ensure server opens boltdb.

Signed-off-by: Wei Fu <[email protected]>
@serathius
Copy link
Member

/retest

@serathius
Copy link
Member

serathius commented Jan 12, 2024

Fixes the issue on my side.

From

...
/tmp/go-stress-20240112T174046-3291411157
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xb5c146]

goroutine 113 [running]:
go.etcd.io/bbolt.(*Cursor).seek(0xc00092cd68, {0x196b42c?, 0x40?, 0x101e160?})
        /usr/local/google/home/siarkowicz/pkg/mod/go.etcd.io/[email protected]/cursor.go:159 +0x26
go.etcd.io/bbolt.(*Bucket).Bucket(0xc00019e0f8, {0x196b42c, 0x4, 0x4126c5?})
        /usr/local/google/home/siarkowicz/pkg/mod/go.etcd.io/[email protected]/bucket.go:105 +0xb9
go.etcd.io/bbolt.(*Tx).Bucket(...)
        /usr/local/google/home/siarkowicz/pkg/mod/go.etcd.io/[email protected]/tx.go:102
go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).UnsafeRange(0xc000428b00, {0x122ff60, 0x19d2720}, {0x196c4c0, 0x10, 0x10}, {0x0, 0x0, 0x0}, 0x0)
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/storage/backend/batch_tx.go:174 +0x91
go.etcd.io/etcd/server/v3/storage/schema.UnsafeReadConsistentIndex({0x7f40fc5d4318, 0xc000428b00})
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/storage/schema/cindex.go:41 +0x5d
go.etcd.io/etcd/server/v3/storage/schema.unsafeUpdateConsistentIndex.func1()
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/storage/schema/cindex.go:80 +0x3e
go.etcd.io/etcd/client/pkg/v3/verify.Verify(0xc00092d078)
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/client/pkg/verify/verify.go:71 +0x33
go.etcd.io/etcd/server/v3/storage/schema.unsafeUpdateConsistentIndex({0x1234480, 0xc000428b00}, 0x7, 0x2, 0x0)
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/storage/schema/cindex.go:79 +0xd7
go.etcd.io/etcd/server/v3/storage/schema.UnsafeUpdateConsistentIndex(...)
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/storage/schema/cindex.go:67
go.etcd.io/etcd/server/v3/etcdserver/cindex.(*consistentIndex).UnsafeSave(0x4271c5?, {0x1234480?, 0xc000428b00?})
        /usr/local/google/home/siarkowicz/src/go.etcd.io/etcd/server/etcdserver/cindex/cindex.go:121 +0x30
go.etcd.io/etcd/server/v3/storage.(*BackendHooks).OnPreCommitUnsafe(0xc00012a630, {0
…
40s: 1485 runs so far, 6 failures (0.40%)

To

$ stress  ./clientv3.test -test.run=TestMaintenanceSnapshotCancel 
5s: 101 runs so far, 0 failures
10s: 217 runs so far, 0 failures
15s: 336 runs so far, 0 failures
20s: 438 runs so far, 0 failures
25s: 552 runs so far, 0 failures
30s: 657 runs so far, 0 failures
35s: 766 runs so far, 0 failures
40s: 874 runs so far, 0 failures
45s: 979 runs so far, 0 failures
50s: 1089 runs so far, 0 failures
55s: 1195 runs so far, 0 failures
1m0s: 1298 runs so far, 0 failures
1m5s: 1403 runs so far, 0 failures
1m10s: 1510 runs so far, 0 failures
1m15s: 1616 runs so far, 0 failures

@ahrtr
Copy link
Member

ahrtr commented Jan 13, 2024

Thanks @fuweid for the fix.

Are you sure this is only a test issue? As mentioned in the #17223, there should be no any panicking on processing any client requests. Probably we should also add some protection to prevent any such panicking as well.

@fuweid
Copy link
Member Author

fuweid commented Jan 14, 2024

Hi @ahrtr , The panic happened after we close etcd server.

    cluster.go:1421: ========= Cluster termination started =====================
    logger.go:130: 2024-01-09T15:30:32.412Z	INFO	grpc	[[core] [Channel #862] Channel Connectivity change to SHUTDOWN]
    logger.go:130: 2024-01-09T15:30:32.412Z	INFO	grpc	[[core] [Channel #862] Closing the name resolver]
    logger.go:130: 2024-01-09T15:30:32.412Z	INFO	grpc	[[core] [Channel #862] ccBalancerWrapper: closing]
    logger.go:130: 2024-01-09T15:30:32.413Z	INFO	grpc	[[core] [Channel #862 SubChannel #863] Subchannel Connectivity change to SHUTDOWN]
    logger.go:130: 2024-01-09T15:30:32.413Z	INFO	grpc	[[core] [Channel #862 SubChannel #863] Subchannel deleted]
    logger.go:130: 2024-01-09T15:30:32.413Z	INFO	grpc	[[transport] [client-transport 0xc0006866c0] Closing: rpc error: code = Canceled desc = grpc: the client connection is closing]
    logger.go:130: 2024-01-09T15:30:32.414Z	INFO	grpc	[[transport] [server-transport 0xc000753380] Closing: EOF]
    logger.go:130: 2024-01-09T15:30:32.414Z	INFO	grpc	[[core] [Channel #862] Channel deleted]
    logger.go:130: 2024-01-09T15:30:32.414Z	INFO	grpc	[[transport] [server-transport 0xc000753380] loopyWriter exiting with error: transport closed by client]
    logger.go:130: 2024-01-09T15:30:32.414Z	INFO	grpc	[[transport] [client-transport 0xc0006866c0] loopyWriter exiting with error: transport closed by client]
    logger.go:130: 2024-01-09T15:30:32.414Z	INFO	m0	terminating a member	{"member": "m0", "name": "m0", "advertise-peer-urls": ["unix://127.0.0.1:2125409716"], "listen-client-urls": ["unix://127.0.0.1:2125509716"], "grpc-url": "unix://localhost:m0"}
    logger.go:130: 2024-01-09T15:30:32.415Z	INFO	grpc	[[core] [Server #860 ListenSocket #861] ListenSocket deleted]
    logger.go:130: 2024-01-09T15:30:32.415Z	INFO	m0	server has stopped; stopping cluster version's monitor	{"member": "m0"}
    logger.go:130: 2024-01-09T15:30:32.415Z	INFO	m0	server has stopped; stopping storage version's monitor	{"member": "m0"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa6c27c]

goroutine 19029 [running]:

Both client and server have closed connection. However, snapshot streaming function is still ongoing.
The cleanup deletes all the data. snapshot streaming function isn't aware of this.

Integration testcases don't invoke new process. Even if the server has been closed, the request handlers can still work.
IMO, it's unlikely to run into this issue in production or e2e, because the etcd process will exit.

Not sure that it's worth to add some protection right now.

@ahrtr
Copy link
Member

ahrtr commented Jan 14, 2024

IMO, it's unlikely to run into this issue in production or e2e, because the etcd process will exit.

I think it might be easy to reproduce this issue in an e2e test.

  • Firstly, let a client submits a snapshot request; while on the server side, add a sleep failpoint before conducting the real snapshot;
  • Secondly, shutdown the etcd member, but add a sleep failpoint after the backend has been closed but before the process exits;
  • Eventually when the snapshot gets continue to run, it may run into panicking.

I do not get time to verify the thought, please feel free to investigate and create a test case to reproduce it.

@fuweid
Copy link
Member Author

fuweid commented Jan 14, 2024

@ahrtr I think it's not scope of this pull request. I can verify it in the follow-up. does it make senses to you?

@ahrtr
Copy link
Member

ahrtr commented Jan 14, 2024

OK, let's keep the issue #17223 open for now.

@ahrtr ahrtr merged commit c8b4b16 into etcd-io:main Jan 14, 2024
@fuweid fuweid deleted the fix-17223 branch January 14, 2024 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

UnsafeRange panicking during shutdown

3 participants