-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost #43084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e is lost Signed-off-by: Paulo Dias <[email protected]>
| kr.wg.Add(1) | ||
| kr.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this PR does other than adding extra synchronization safeguards (working group and the mutex)... Doesn't the existing implementation use the same "standby" approach with kr.cancel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conversation/idea came from this PR review.
If my understanding is correct, by stopping the components completely, there might be a situation that the leader competitors from all Collector instances are all stopped and there is no one left to get the leadership lock.
This PR puts the component in stand-by mode instead of entirely stopping it. That’s the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver. The additional synchronization here enables us to pause and resume safely, rather than stopping it completely. PTAL, and let me know if this matches your understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."
Can you clarify what you mean by "fully tear down"?
Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.
What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see what's the confusion here. stopReceiver and Shutdown perform the same action internally (cancel the receiver via kr.cancel()), but they are invoked in different lifecycles. At the same time the leader elector is not stopped and the start callback can still re-start the receiver on leadership acquisition. That was not clear to me at #42330 (comment) that's why I thought that the receiver instance is not a leader candidate anymore after its stopped.
I still find this a bit confusing and a bit cryptic behaviour but I'm not sure if and how this could be improved. Maybe just commenting within the code will help future readers. Whatever we decide it should be consistent across components that use the leader elector extension.
Signed-off-by: Paulo Dias <[email protected]>
ChrsMark
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with 2 nits.
Signed-off-by: Paulo Dias <[email protected]>
Signed-off-by: Paulo Dias <[email protected]>
|
this needs another look from @dmitryax before it gets in. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
/label -stale |
| rw.mu.RLock() | ||
| has := len(rw.metadataConsumers) != 0 || rw.entityLogConsumer != nil | ||
| rw.mu.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes are extra concurrency guardrails unrelated to the PR. Should be done separately.
| kr.wg.Add(1) | ||
| kr.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."
Can you clarify what you mean by "fully tear down"?
Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.
What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.
Description
Similar to what we did in #42330, this PR ensures similarity and puts the receiver in stand-by instead of shutdown when k8sleaderelector is used.
Link to tracking issue
Fixes #42707
Testing
Tested locally and added new tests to cover the new behaviour.