*: fix GracefulStop issue when using cmux for TLS #17790

fuweid · 2024-04-15T06:10:36Z

The gRPC server supports to use GracefulStop to drain all the inflight RPCs, including streaming RPCs.

When we use non-cmux mode to start gRPC server (non-TLS or TLS+gRPC-only), we always invoke GracefulStop to drain requests.

For cmux mode (gRPC.ServeHTTP), since the connection is maintained by http server, gRPC server is unable to send GOAWAY control frame to client. So, it's always force close all the connections and doesn't drain requests by default.

In gRPC v1.61.0 version, it introduces new experimental feature WaitForHandlers to block gRPC.Stop() until all the RPCs finish. This patch is to use WaitForHandlers for cmux mode's graceful shutdown.

This patch also introduces v3rpcBeforeSnapshot failpoint. That's used to verify cmux mode's graceful shutdown behaviour.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

k8s-ci-robot · 2024-04-15T06:10:38Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

The gRPC server supports to use GracefulStop to drain all the inflight RPCs, including streaming RPCs. When we use non-cmux mode to start gRPC server (non-TLS or TLS+gRPC-only), we always invoke GracefulStop to drain requests. For cmux mode (gRPC.ServeHTTP), since the connection is maintained by http server, gRPC server is unable to send GOAWAY control frame to client. So, it's always force close all the connections and doesn't drain requests by default. In gRPC v1.61.0 version, it introduces new experimental feature `WaitForHandlers` to block gRPC.Stop() until all the RPCs finish. This patch is to use `WaitForHandlers` for cmux mode's graceful shutdown. This patch also introduces `v3rpcBeforeSnapshot` failpoint. That's used to verify cmux mode's graceful shutdown behaviour. For TestAuthGracefulDisable (tests/common) case, increased timeout from 10s to 15s because we try to graceful shutdown after connection closed and it takes more time than before. Signed-off-by: Wei Fu <[email protected]>

fuweid · 2024-04-15T09:11:14Z

ping @ahrtr @serathius @jmhbnz @siyuanfoundation

siyuanfoundation · 2024-04-15T17:11:15Z

server/embed/serve.go

+					if httpEnabled {
+						gs.Stop()
+					} else {
+						gs.GracefulStop()


What happens if the GracefulStop takes too long?

GracefulStop might be hang when there is deadlock during applying changes. I run into issue that the Snapshot streaming RPCs sends data in very slow rate. The GracefulStop will be blocked until Snapshot finished. However, the connection is closed. Readiness probe failed and then kubelet sends SIGKILL. If there is no probe and force to kill it, the server will be blocked until all the RPCs finished.

siyuanfoundation · 2024-04-15T17:11:43Z

tests/common/auth_test.go

 		watchCh := rootAuthClient.Watch(wCtx, "key", config.WatchOptions{Revision: 1})
 		wantedLen := 1
-		watchTimeout := 10 * time.Second
+		watchTimeout := 15 * time.Second


why is this test affected?

Because I change gs.Stop to gs.GracefulStop in serving code's cleanup.

etcd/server/embed/serve.go

Lines 159 to 163 in 0cd5999

defer func(gs *grpc.Server) {

if err != nil {

sctx.lg.Warn("stopping insecure grpc server due to error", zap.Error(err))

gs.Stop()

sctx.lg.Warn("stopped insecure grpc server due to error", zap.Error(err))

ETCD calls http.Shutdown during stopServers.

etcd/server/embed/etcd.go

Lines 458 to 465 in 0cd5999

func stopServers(ctx context.Context, ss *servers) {

// first, close the http.Server

if ss.http != nil {

ss.http.Shutdown(ctx)

}

if ss.grpc == nil {

return

}

That call will close the net.Listener. So both http.Serve and grpc.Serve will exit because of using closed connection. Before this patch, we always call gs.Stop in the following code. It's kind of conflict with stopServers logic which wants graceful shutdown. So, I change it and it takes a little longger than before. Hope it can help

etcd/server/embed/serve.go

Lines 159 to 163 in 0cd5999

defer func(gs *grpc.Server) {

if err != nil {

sctx.lg.Warn("stopping insecure grpc server due to error", zap.Error(err))

gs.Stop()

sctx.lg.Warn("stopped insecure grpc server due to error", zap.Error(err))

ahrtr · 2024-04-20T17:51:32Z

Based on my discussion with @fuweid last week, we might not need this PR?

Only certain gRPC version (1.60?) has the problem. Current 3.5 and 3.4 are not depending on gRPC 1.60
also GracefulStop has no timeout, in theory it may be blocked forever.

fuweid · 2024-04-21T10:20:28Z

Hi @ahrtr

Only certain gRPC version (1.60?) has the problem. Current 3.5 and 3.4 are not depending on gRPC 1.60

grpc/grpc-go#6922 fixed gracefulstop issue introduced by v1.60 version.

For the cmux mode, both http handlers and gRPC handlers share one port, we don't have GracefulStop.
IMO, we should support cmux mode with gracefulstop, because we already support gracefulstop for other modes, like separate ports and http://.

also GracefulStop has no timeout, in theory it may be blocked forever.

I think we can create new issue to track this. For example, add a new flag for timeout of gracefulstop.

ahrtr · 2024-04-21T12:10:10Z

we should support cmux mode with gracefulstop

We are going to get rid of cmux. Let's revisit this after we finish that. Thanks.

stale · 2025-04-26T07:16:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions · 2025-08-22T00:12:12Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

k8s-ci-robot · 2025-08-26T10:19:10Z

@fuweid: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci-etcd-robustness-release34-amd64	`ac95dd7`	link	true	`/test ci-etcd-robustness-release34-amd64`
ci-etcd-robustness-release36-amd64	`ac95dd7`	link	true	`/test ci-etcd-robustness-release36-amd64`
ci-etcd-robustness-release35-amd64	`ac95dd7`	link	true	`/test ci-etcd-robustness-release35-amd64`
pull-etcd-govulncheck-main	`ac95dd7`	link	true	`/test pull-etcd-govulncheck-main`
pull-etcd-govulncheck	`ac95dd7`	link	true	`/test pull-etcd-govulncheck`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added the do-not-merge/work-in-progress label Apr 15, 2024

fuweid force-pushed the fix-shutdown-issue branch from 6596962 to 18a4dc8 Compare April 15, 2024 06:47

fuweid mentioned this pull request Apr 15, 2024

UnsafeRange panicking during shutdown #17223

Closed

4 tasks

fuweid force-pushed the fix-shutdown-issue branch from 18a4dc8 to ac95dd7 Compare April 15, 2024 07:43

fuweid marked this pull request as ready for review April 15, 2024 08:26

k8s-ci-robot removed the do-not-merge/work-in-progress label Apr 15, 2024

siyuanfoundation reviewed Apr 15, 2024

View reviewed changes

fuweid marked this pull request as draft April 23, 2024 10:49

k8s-ci-robot added the do-not-merge/work-in-progress label Apr 23, 2024

stale bot added the stale label Apr 26, 2025

fuweid removed the stale label Apr 26, 2025

github-actions bot added the stale label Aug 22, 2025

fuweid closed this Aug 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

*: fix GracefulStop issue when using cmux for TLS #17790

*: fix GracefulStop issue when using cmux for TLS #17790

Uh oh!

fuweid commented Apr 15, 2024

Uh oh!

k8s-ci-robot commented Apr 15, 2024

Uh oh!

fuweid commented Apr 15, 2024

Uh oh!

siyuanfoundation Apr 15, 2024

Uh oh!

fuweid Apr 16, 2024

Uh oh!

siyuanfoundation Apr 15, 2024

Uh oh!

fuweid Apr 16, 2024

Uh oh!

ahrtr commented Apr 20, 2024

Uh oh!

fuweid commented Apr 21, 2024

Uh oh!

ahrtr commented Apr 21, 2024

Uh oh!

stale bot commented Apr 26, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

k8s-ci-robot commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

	defer func(gs *grpc.Server) {
	if err != nil {
	sctx.lg.Warn("stopping insecure grpc server due to error", zap.Error(err))
	gs.Stop()
	sctx.lg.Warn("stopped insecure grpc server due to error", zap.Error(err))

	func stopServers(ctx context.Context, ss *servers) {
	// first, close the http.Server
	if ss.http != nil {
	ss.http.Shutdown(ctx)
	}
	if ss.grpc == nil {
	return
	}

*: fix GracefulStop issue when using cmux for TLS #17790

*: fix GracefulStop issue when using cmux for TLS #17790

Uh oh!

Conversation

fuweid commented Apr 15, 2024

Uh oh!

k8s-ci-robot commented Apr 15, 2024

Uh oh!

fuweid commented Apr 15, 2024

Uh oh!

siyuanfoundation Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

fuweid Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

siyuanfoundation Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

fuweid Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

ahrtr commented Apr 20, 2024

Uh oh!

fuweid commented Apr 21, 2024

Uh oh!

ahrtr commented Apr 21, 2024

Uh oh!

stale bot commented Apr 26, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

k8s-ci-robot commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants