Skip to content

Conversation

@paulojmdias
Copy link
Member

Description

Similar to what we did in #42330, this PR ensures similarity and puts the receiver in stand-by instead of shutdown when k8sleaderelector is used.

Link to tracking issue

Fixes #42707

Testing

Tested locally and added new tests to cover the new behaviour.

@paulojmdias paulojmdias marked this pull request as ready for review September 30, 2025 22:04
@github-actions github-actions bot requested a review from povilasv September 30, 2025 22:04
Comment on lines 73 to 74
kr.wg.Add(1)
kr.mu.Unlock()
Copy link
Member

@dmitryax dmitryax Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this PR does other than adding extra synchronization safeguards (working group and the mutex)... Doesn't the existing implementation use the same "standby" approach with kr.cancel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversation/idea came from this PR review.

If my understanding is correct, by stopping the components completely, there might be a situation that the leader competitors from all Collector instances are all stopped and there is no one left to get the leadership lock.

This PR puts the component in stand-by mode instead of entirely stopping it. That’s the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver. The additional synchronization here enables us to pause and resume safely, rather than stopping it completely. PTAL, and let me know if this matches your understanding.

Copy link
Member

@dmitryax dmitryax Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."

Can you clarify what you mean by "fully tear down"?

Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.

What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see what's the confusion here. stopReceiver and Shutdown perform the same action internally (cancel the receiver via kr.cancel()), but they are invoked in different lifecycles. At the same time the leader elector is not stopped and the start callback can still re-start the receiver on leadership acquisition. That was not clear to me at #42330 (comment) that's why I thought that the receiver instance is not a leader candidate anymore after its stopped.

I still find this a bit confusing and a bit cryptic behaviour but I'm not sure if and how this could be improved. Maybe just commenting within the code will help future readers. Whatever we decide it should be consistent across components that use the leader elector extension.

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with 2 nits.

@odubajDT odubajDT requested a review from dmitryax October 6, 2025 12:43
@atoulme
Copy link
Contributor

atoulme commented Oct 14, 2025

this needs another look from @dmitryax before it gets in.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 1, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 1, 2025
@paulojmdias
Copy link
Member Author

/label -stale

@github-actions github-actions bot removed the Stale label Nov 2, 2025
Comment on lines +347 to +349
rw.mu.RLock()
has := len(rw.metadataConsumers) != 0 || rw.entityLogConsumer != nil
rw.mu.RUnlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are extra concurrency guardrails unrelated to the PR. Should be done separately.

Comment on lines 73 to 74
kr.wg.Add(1)
kr.mu.Unlock()
Copy link
Member

@dmitryax dmitryax Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."

Can you clarify what you mean by "fully tear down"?

Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.

What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost

6 participants