[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost #43084

paulojmdias · 2025-09-30T21:44:33Z

Description

Similar to what we did in #42330, this PR ensures similarity and puts the receiver in stand-by instead of shutdown when k8sleaderelector is used.

Link to tracking issue

Fixes #42707

Testing

Tested locally and added new tests to cover the new behaviour.

…e is lost Signed-off-by: Paulo Dias <[email protected]>

receiver/k8sclusterreceiver/receiver.go

dmitryax · 2025-10-01T02:04:43Z

receiver/k8sclusterreceiver/receiver.go

+	kr.wg.Add(1)
+	kr.mu.Unlock()


I'm not sure what this PR does other than adding extra synchronization safeguards (working group and the mutex)... Doesn't the existing implementation use the same "standby" approach with kr.cancel?

The conversation/idea came from this PR review.

If my understanding is correct, by stopping the components completely, there might be a situation that the leader competitors from all Collector instances are all stopped and there is no one left to get the leadership lock.

This PR puts the component in stand-by mode instead of entirely stopping it. That’s the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver. The additional synchronization here enables us to pause and resume safely, rather than stopping it completely. PTAL, and let me know if this matches your understanding.

I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."

Can you clarify what you mean by "fully tear down"?

Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.

What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.

Hmm, I see what's the confusion here. stopReceiver and Shutdown perform the same action internally (cancel the receiver via kr.cancel()), but they are invoked in different lifecycles. At the same time the leader elector is not stopped and the start callback can still re-start the receiver on leadership acquisition. That was not clear to me at #42330 (comment) that's why I thought that the receiver instance is not a leader candidate anymore after its stopped.

I still find this a bit confusing and a bit cryptic behaviour but I'm not sure if and how this could be improved. Maybe just commenting within the code will help future readers. Whatever we decide it should be consistent across components that use the leader elector extension.

Signed-off-by: Paulo Dias <[email protected]>

ChrsMark

LGTM, with 2 nits.

receiver/k8sclusterreceiver/receiver.go

Signed-off-by: Paulo Dias <[email protected]>

atoulme · 2025-10-14T22:46:53Z

this needs another look from @dmitryax before it gets in.

github-actions · 2025-11-01T05:21:04Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

paulojmdias · 2025-11-01T19:02:45Z

/label -stale

dmitryax · 2025-11-06T02:37:39Z

receiver/k8sclusterreceiver/watcher.go

+	rw.mu.RLock()
+	has := len(rw.metadataConsumers) != 0 || rw.entityLogConsumer != nil
+	rw.mu.RUnlock()


These changes are extra concurrency guardrails unrelated to the PR. Should be done separately.

dmitryax · 2025-11-06T02:39:17Z

receiver/k8sclusterreceiver/receiver.go

+	kr.wg.Add(1)
+	kr.mu.Unlock()


I'm confused by this statement: "That's the key difference compared to the previous kr.cancel approach, which would fully tear down the receiver."

Can you clarify what you mean by "fully tear down"?

Looking at the code, both approaches cancel the context driving the same goroutine. Both call initialize() which explicitly sets informerFactories = nil, destroying all cached state. Both require full cache resyncs on leadership changes.

What specific part of the receiver is "standing by" in the new approach that wasn't before? The synchronization primitives (WaitGroup/mutex) prevent race conditions, but they don't change the fundamental teardown/rebuild behavior.

[receiver/k8sclusterreceiver] Switch to standby mode when leader leas…

bd5a99f

…e is lost Signed-off-by: Paulo Dias <[email protected]>

paulojmdias marked this pull request as ready for review September 30, 2025 22:04

paulojmdias requested review from a team, ChrsMark, TylerHelmuth and dmitryax as code owners September 30, 2025 22:04

github-actions bot assigned crobert-1 Sep 30, 2025

github-actions bot added the receiver/k8scluster label Sep 30, 2025

github-actions bot requested a review from povilasv September 30, 2025 22:04

dmitryax reviewed Oct 1, 2025

View reviewed changes

feat: Reset per-leadership-session state informers

8365f3e

Signed-off-by: Paulo Dias <[email protected]>

odubajDT approved these changes Oct 2, 2025

View reviewed changes

ChrsMark approved these changes Oct 6, 2025

View reviewed changes

receiver/k8sclusterreceiver/receiver.go Outdated Show resolved Hide resolved

receiver/k8sclusterreceiver/receiver.go Show resolved Hide resolved

odubajDT requested a review from dmitryax October 6, 2025 12:43

paulojmdias and others added 7 commits October 8, 2025 10:40

feat: move wg to be session scoped

2f2c72e

Signed-off-by: Paulo Dias <[email protected]>

Merge branch 'main' into feat/42707

25b78d9

chore: fix changelog

a39cf6b

Signed-off-by: Paulo Dias <[email protected]>

Merge branch 'main' into feat/42707

8e7582f

Merge branch 'main' into feat/42707

34ab664

Merge branch 'main' into feat/42707

359536d

Merge branch 'main' into feat/42707

e1d376e

atoulme added the waiting-for-code-owners label Oct 14, 2025

paulojmdias added 2 commits October 15, 2025 22:15

Merge branch 'main' into feat/42707

2567dc7

Merge branch 'main' into feat/42707

57ec170

github-actions bot added the Stale label Nov 1, 2025

Merge branch 'main' into feat/42707

d951582

github-actions bot removed the Stale label Nov 2, 2025

paulojmdias added 2 commits November 4, 2025 21:20

Merge branch 'main' into feat/42707

8ff8532

Merge branch 'main' into feat/42707

b17ebf8

dmitryax reviewed Nov 6, 2025

View reviewed changes

Merge branch 'main' into feat/42707

cbb4c45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost #43084

[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost #43084

Uh oh!

paulojmdias commented Sep 30, 2025

Uh oh!

Uh oh!

dmitryax Oct 1, 2025 •

edited

Loading

Uh oh!

paulojmdias Oct 1, 2025

Uh oh!

dmitryax Nov 6, 2025 •

edited

Loading

Uh oh!

ChrsMark Nov 6, 2025

Uh oh!

ChrsMark left a comment

Uh oh!

Uh oh!

Uh oh!

atoulme commented Oct 14, 2025

Uh oh!

github-actions bot commented Nov 1, 2025

Uh oh!

paulojmdias commented Nov 1, 2025

Uh oh!

dmitryax Nov 6, 2025

Uh oh!

dmitryax Nov 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost #43084

Are you sure you want to change the base?

[receiver/k8sclusterreceiver] Switch to standby mode when leader lease is lost #43084

Uh oh!

Conversation

paulojmdias commented Sep 30, 2025

Description

Link to tracking issue

Testing

Uh oh!

Uh oh!

dmitryax Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulojmdias Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

dmitryax Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrsMark Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

ChrsMark left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

atoulme commented Oct 14, 2025

Uh oh!

github-actions bot commented Nov 1, 2025

Uh oh!

paulojmdias commented Nov 1, 2025

Uh oh!

dmitryax Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

dmitryax Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dmitryax Oct 1, 2025 •

edited

Loading

dmitryax Nov 6, 2025 •

edited

Loading

dmitryax Nov 6, 2025 •

edited

Loading