-
Notifications
You must be signed in to change notification settings - Fork 4.6k
grpc: hold ac.mu while calling resetTransport to prevent concurrent connection attempts #7390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
c3a3d1c
Update conn state to prevent concurrent connection attempts
arjan-bal 6214c9d
Make callers of resetBackoff() lock the mutex
arjan-bal ff977b3
Add doc comment for resetTransportAndUnlock
arjan-bal 76ef33f
Merge remote-tracking branch 'source/master' into fix_conn_connect_race
arjan-bal File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably file a bug for this otherwise it will be a behavior change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you look at #7365 (comment) for the details of this bug?
What change in behaviour are you concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was just asking with respect to release notes because currently the bug points to test flake. Anyways, I just checked it doesn't matter because release notes refer to the fix PR and not the issue. Although in the release notes, we should prefix the package
balancer: Fix race condition that could lead to multiple transports being created in parallelThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it looks like without your fix, there is a case where resetTransport can error out and return without updating the connectivity state
grpc-go/clientconn.go
Line 1237 in bdd707e
May be we can make the resetTransport() in the same critical section instead of releasing lock and aquiring again in resetTransport()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any benefit in adding the
acCtcx.Errcheck inconnectbecause even resetTransport sets the the state toConnectingand releases the lock:grpc-go/clientconn.go
Lines 1262 to 1263 in bdd707e
This means that the context can be cancelled (and subsequently addrConn shutdown) after the channel is in
connectingstate even without the change.IIUC we just need to ensure that we don't set connecting state after the channel enters shutdown.
The test for shutdown state on top should be enough protection to ensure shutdown state comes only after we enter
connecting.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could rename
resetTransporttoresetTransportLockedand expect the callers to hold the lock while calling this method. However,resetTransportreleases the lock temporarily. Add to this that ac.updateAddrs callsresetTransportin a new go routine so it can't hold the lock tillresetTransportcompletes. It feels a little risky to make that change. I don't know for sure, but I feel we could end up in a situation where the lock is not released correctly resulting in a deadlock.I don't want do make that change as the first option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code it looks like in case of
acCtcx.Err, resetTransport() doesn't update the state and return so state will be still idle but after your fix in case ofacCtcx.Errstate will be updated toconnecting. Am I missing something?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding,
ac.ctxis used to control the creation of remote connections whileac.stateis used to synchronize all the state transitions for theaddrConn.ac.connect()doesn't deal with creating remote connections, so it doesn't need to checkac.ctx.Err(). It needs to ensure the transition toConnectingis valid, which it does by locking the mutex and verifying thatac.state != Shutdownac.ctxis used to avoid doing throw away work which takes significant time (creating a remote conn).Please let me know if your understanding is different.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: it doesn't matter if resetTransport() returns error after state being updated to
connecting