-
Notifications
You must be signed in to change notification settings - Fork 10.2k
*: fix GracefulStop issue when using cmux for TLS #17790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Skipping CI for Draft Pull Request. |
6596962 to
18a4dc8
Compare
The gRPC server supports to use GracefulStop to drain all the inflight RPCs, including streaming RPCs. When we use non-cmux mode to start gRPC server (non-TLS or TLS+gRPC-only), we always invoke GracefulStop to drain requests. For cmux mode (gRPC.ServeHTTP), since the connection is maintained by http server, gRPC server is unable to send GOAWAY control frame to client. So, it's always force close all the connections and doesn't drain requests by default. In gRPC v1.61.0 version, it introduces new experimental feature `WaitForHandlers` to block gRPC.Stop() until all the RPCs finish. This patch is to use `WaitForHandlers` for cmux mode's graceful shutdown. This patch also introduces `v3rpcBeforeSnapshot` failpoint. That's used to verify cmux mode's graceful shutdown behaviour. For TestAuthGracefulDisable (tests/common) case, increased timeout from 10s to 15s because we try to graceful shutdown after connection closed and it takes more time than before. Signed-off-by: Wei Fu <[email protected]>
18a4dc8 to
ac95dd7
Compare
| if httpEnabled { | ||
| gs.Stop() | ||
| } else { | ||
| gs.GracefulStop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the GracefulStop takes too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GracefulStop might be hang when there is deadlock during applying changes. I run into issue that the Snapshot streaming RPCs sends data in very slow rate. The GracefulStop will be blocked until Snapshot finished. However, the connection is closed. Readiness probe failed and then kubelet sends SIGKILL. If there is no probe and force to kill it, the server will be blocked until all the RPCs finished.
| watchCh := rootAuthClient.Watch(wCtx, "key", config.WatchOptions{Revision: 1}) | ||
| wantedLen := 1 | ||
| watchTimeout := 10 * time.Second | ||
| watchTimeout := 15 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this test affected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I change gs.Stop to gs.GracefulStop in serving code's cleanup.
Lines 159 to 163 in 0cd5999
| defer func(gs *grpc.Server) { | |
| if err != nil { | |
| sctx.lg.Warn("stopping insecure grpc server due to error", zap.Error(err)) | |
| gs.Stop() | |
| sctx.lg.Warn("stopped insecure grpc server due to error", zap.Error(err)) |
ETCD calls http.Shutdown during stopServers.
Lines 458 to 465 in 0cd5999
| func stopServers(ctx context.Context, ss *servers) { | |
| // first, close the http.Server | |
| if ss.http != nil { | |
| ss.http.Shutdown(ctx) | |
| } | |
| if ss.grpc == nil { | |
| return | |
| } |
That call will close the net.Listener. So both http.Serve and grpc.Serve will exit because of using closed connection. Before this patch, we always call gs.Stop in the following code. It's kind of conflict with stopServers logic which wants graceful shutdown. So, I change it and it takes a little longger than before. Hope it can help
Lines 159 to 163 in 0cd5999
| defer func(gs *grpc.Server) { | |
| if err != nil { | |
| sctx.lg.Warn("stopping insecure grpc server due to error", zap.Error(err)) | |
| gs.Stop() | |
| sctx.lg.Warn("stopped insecure grpc server due to error", zap.Error(err)) |
|
Based on my discussion with @fuweid last week, we might not need this PR?
|
|
Hi @ahrtr
grpc/grpc-go#6922 fixed gracefulstop issue introduced by v1.60 version. For the cmux mode, both http handlers and gRPC handlers share one port, we don't have GracefulStop.
I think we can create new issue to track this. For example, add a new flag for timeout of gracefulstop. |
We are going to get rid of cmux. Let's revisit this after we finish that. Thanks. |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
|
@fuweid: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The gRPC server supports to use GracefulStop to drain all the inflight RPCs, including streaming RPCs.
When we use non-cmux mode to start gRPC server (non-TLS or TLS+gRPC-only), we always invoke GracefulStop to drain requests.
For cmux mode (gRPC.ServeHTTP), since the connection is maintained by http server, gRPC server is unable to send GOAWAY control frame to client. So, it's always force close all the connections and doesn't drain requests by default.
In gRPC v1.61.0 version, it introduces new experimental feature
WaitForHandlersto block gRPC.Stop() until all the RPCs finish. This patch is to useWaitForHandlersfor cmux mode's graceful shutdown.This patch also introduces
v3rpcBeforeSnapshotfailpoint. That's used to verify cmux mode's graceful shutdown behaviour.Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.