-
Notifications
You must be signed in to change notification settings - Fork 4.6k
grpclb: teach the manual resolver to handle restarts #6635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
balancer/grpclb/grpclb_util.go
Outdated
| // When ResolveNow() is called, it calls ResolveNow() on the parent ClientConn, | ||
| // so when grpclb client lose contact with remote balancers, the parent | ||
| // ClientConn's resolver will re-resolve. | ||
| type lbManualResolver struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... this has its own manual resolver too? Can we reuse the one in resolver/manual?
If not, we probably should fix that to be tolerant to idleness by setting bootstrapState here:
grpc-go/resolver/manual/manual.go
Line 102 in 62726d4
| err := r.CC.UpdateState(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can reuse the manual resolver here with some changes. I will send a separate PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to use the manual resolver, but we still need this lbManualResolver type since it provides some level of wrapping of the one particular piece of functionality that is provided here. Otherwise, we will have to move all this functionality to the top-level lbBalancer type and also store the resolver.ClientConn in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it provides some level of wrapping of the one particular piece of functionality that is provided here
...which is what, specifically? Calling ResolveNow on the balancer.ClientConn when the ResolveNowCallback is invoked? Does that necessitate a wrapping type though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. We don't need this type at all. Got it to work by just using the manual.Resolver.
balancer/grpclb/grpclb_util.go
Outdated
| r.stateUpdateRecv = true | ||
| r.stateMu.Unlock() | ||
|
|
||
| r.ccr.UpdateState(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... so what if the channel this is attached to went idle and then UpdateState is called during that time? This will just be ignored? Maybe that's fine? But there's definitely a race here with leaving idleness now, with Build being called again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ccr here is the resolver.ClientConn passed to the manual resolver, and is the one implemented by the resolver_conn_wrapper on the grpclb channel. The parent channel going IDLE should not affect any of this interaction. If the grpclb channel went IDLE, we shouldn't have this resolver active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the grpclb channel went IDLE, we shouldn't have this resolver active.
I'm not sure what this means. If the grpclb channel is idle (due to a bug, since we always have an active stream) then we could still try to update it while it's exiting idle, causing a race between calling the old CC's UpdateState and the new CC being set in ccr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have this type anymore. But thinking about the case you are mentioning, let's say the following happens:
grpclbchannel isIDLEbecause of a bug.- the parent
grpclbLB policy receives aClientConnUpdateand it has a new set of balancer addresses, and it callsUpdateStateon the manual resolver - At the same time, the
grpclbchannel is exiting IDLE.- This will cause a new
manual.Resolverto be built and will update theresolver.ClientConninside it.
- This will cause a new
- If the call to
UpdateStatehappens first, then forwarding of that update to the oldresolver.ClientConnwithin themanual.Resolverwill have no effect. But it will still update thelastSeenStatefield. So, when thegrpclbchannel eventually exits IDLE and a newmanual.Resolveris built, it will still send this most recentlastSeenStateto the channel. - If the
grpclbchannel exits IDLE first, we will see two updates pushed on to the channel. One with the previouslastSeenStateand one with the new state pushed by thegrpclbLB policy.
Eventually, both are consistent. Am I still missing something?
5d8f0ae to
9b88f99
Compare
easwars
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
balancer/grpclb/grpclb_util.go
Outdated
| // When ResolveNow() is called, it calls ResolveNow() on the parent ClientConn, | ||
| // so when grpclb client lose contact with remote balancers, the parent | ||
| // ClientConn's resolver will re-resolve. | ||
| type lbManualResolver struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to use the manual resolver, but we still need this lbManualResolver type since it provides some level of wrapping of the one particular piece of functionality that is provided here. Otherwise, we will have to move all this functionality to the top-level lbBalancer type and also store the resolver.ClientConn in there.
balancer/grpclb/grpclb_util.go
Outdated
| r.stateUpdateRecv = true | ||
| r.stateMu.Unlock() | ||
|
|
||
| r.ccr.UpdateState(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the grpclb channel went IDLE, we shouldn't have this resolver active.
I'm not sure what this means. If the grpclb channel is idle (due to a bug, since we always have an active stream) then we could still try to update it while it's exiting idle, causing a race between calling the old CC's UpdateState and the new CC being set in ccr.
balancer/grpclb/grpclb_util.go
Outdated
| r.ccb.ResolveNow(o) | ||
| mr.BuildCallback = func(_ resolver.Target, ccr resolver.ClientConn, _ resolver.BuildOptions) { | ||
| r.mu.Lock() | ||
| r.ccr = ccr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this functionality (and UpdateState) handled by the manual resolver internals? Or is there something different here that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it is. I didn't realize it was being handled there. So, we don't this type anymore.
balancer/grpclb/grpclb_util.go
Outdated
| resolver.Builder // The underlying manual resolver builder. | ||
| resolver.Resolver // The underlying manual resolver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this just embed *manual.Resolver instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have this type anymore.
balancer/grpclb/grpclb_util.go
Outdated
| mr.ResolveNowCallback = func(o resolver.ResolveNowOptions) { | ||
| r.ccb.ResolveNow(o) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do: mr.ResolveNowCallback = r.ccb.ResolveNow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
balancer/grpclb/grpclb_util.go
Outdated
| // When ResolveNow() is called, it calls ResolveNow() on the parent ClientConn, | ||
| // so when grpclb client lose contact with remote balancers, the parent | ||
| // ClientConn's resolver will re-resolve. | ||
| type lbManualResolver struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it provides some level of wrapping of the one particular piece of functionality that is provided here
...which is what, specifically? Calling ResolveNow on the balancer.ClientConn when the ResolveNowCallback is invoked? Does that necessitate a wrapping type though?
| // The parent ClientConn should re-resolve when grpclb loses connection to the | ||
| // remote balancer. When the ClientConn inside grpclb gets a TransientFailure, | ||
| // it calls lbManualResolver.ResolveNow(), which calls parent ClientConn's | ||
| // ResolveNow, and eventually results in re-resolve happening in parent | ||
| // ClientConn's resolver (DNS for example). | ||
| // | ||
| // parent | ||
| // ClientConn | ||
| // +-----------------------------------------------------------------+ | ||
| // | parent +---------------------------------+ | | ||
| // | DNS ClientConn | grpclb | | | ||
| // | resolver balancerWrapper | | | | ||
| // | + + | grpclb grpclb | | | ||
| // | | | | ManualResolver ClientConn | | | ||
| // | | | | + + | | | ||
| // | | | | | | Transient | | | ||
| // | | | | | | Failure | | | ||
| // | | | | | <--------- | | | | ||
| // | | | <--------------- | ResolveNow | | | | ||
| // | | <--------- | ResolveNow | | | | | | ||
| // | | ResolveNow | | | | | | | ||
| // | | | | | | | | | ||
| // | + + | + + | | | ||
| // | +---------------------------------+ | | ||
| // +-----------------------------------------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not worth keeping? It still seems largely accurate for now, except s/lbManualResolver/manual.Resolver/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I retained only the last paragraph from this comment, and added that to the place where we set the ResolveNowCallback on the manual resolver. I feel that it conveys enough and that probably don't need this nicely drawn ASCII art. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine. This comment was in a pretty random place anyway.
An instance of
lbManualResolveris created when thegrpclbpolicy is built. After this, every time the LB policy receives new balancer addresses as part of its configuration, it pushes those addresses to thegrpclbchannel via this manual resolver.If the
grpclbchannel goesIDLE, which it technically should never because it always has a stream open to the remote balancer, grpc shuts down the manual name resolver and when it exitsIDLE, grpc recreates the name resolver. But at this point, there is no configuration update on the parentgrpclbLB policy. So, thegrpclbchannel is stuck forever waiting for addresses.This PR caches the most recent list of addresses received in the manual resolver, and replays the same when it is recreated (when exiting IDLE).
RELEASE NOTES:
grpclbchannel goes IDLE