refactor: simplify leader election #2152

acuteaura · 2024-01-31T14:31:35Z

mostly based on the upstream example:
https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go

this also incidentally fixes bugs where run returns nil (i.e. when failing to reach the Admin API) by causing the context to cancel via defer.

Bugfix
Refactor

acuteaura · 2024-01-31T14:57:50Z

some remarks on this PR:

a bunch of code seems to treat leader status as "skipping writes" (via the Elector). this is pretty hard to reason about and prone to errors, so this PR changes it to a more traditional standby (with informers warmed up) that doesn't run any controllers.
alongside that, the controller now hard exits when it loses leader status. this likely only happens when a node netsplits from k8s apiserver, so it wouldn't be able to update things there anyway.
run got an error return value, because a lot of error conditions were just silently discarded. A future PR should improve these errors with some kind of wrapping and remove explicit logging in run itself so it all bubbles up
I did not touch the API server, because I'm not 100% about all the things it does, but we should consider failing a readiness probe if it is not leader if code in there relies on controllers running - though it did not have access to Elector until this point either.

Revolyssup · 2024-01-31T17:13:43Z

@acuteaura Can you fix the merge conflicts?

mostly based on the upstream example: https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go

Revolyssup · 2024-02-01T07:41:16Z

@acuteaura Unit tests failing

Revolyssup · 2024-02-01T08:18:45Z

pkg/providers/controller.go

 				)
+				// rootCancel might be to slow, and controllers may have bugs that cause them to not yield
+				// the safest way to step down is to simply cause a pod restart
+				os.Exit(0)


Makes sense

Revolyssup · 2024-02-01T08:28:16Z

an e2e test related to leader election was also failing. I reran it

acuteaura · 2024-02-01T12:33:33Z

The unit test is testing specific strings in the output instead of functionality, I think it can go.

The e2e test.... I don't even know, it seems to fail before setup, but the Restart Count between the describe and the get are different and... it doesn't make sense to me why this test would fail its BeforeRun.

acuteaura · 2024-02-07T11:15:19Z

@Revolyssup could you have a look at this failed test? I'm not really sure how it could fail in the beforeEach step when that's all shared among tests. Or maybe re-run the tests to make sure this isn't a flake.

It's unfortunately not so trivial to run this test locally, but I'm getting to it...

zll600 · 2024-03-19T14:26:19Z

ping @Revolyssup

github-actions · 2024-05-20T01:26:02Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 30 days if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-06-19T01:27:32Z

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

bstasz-bonrepublic · 2024-07-12T08:38:30Z

Hello,
@acuteaura Would you reopen this PR so that we can have new test result logs? I would like to look into this and possibly fix it.

acuteaura · 2024-07-12T11:26:35Z

I can't, but I can rebase and re-submit later.

acuteaura added 4 commits January 31, 2024 18:24

refactor: simplify leader election

425d2fe

mostly based on the upstream example: https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go

clean up run function

6c9df3a

fix: return err in run

f374164

chore: spelling, comments

90f4a4c

acuteaura force-pushed the refactor/leader-election branch from 652edb7 to 90f4a4c Compare January 31, 2024 17:24

Revolyssup reviewed Feb 1, 2024

View reviewed changes

acuteaura mentioned this pull request Feb 8, 2024

bug: ingress-controller doesn't recover from failed sync #1980

Closed

Ben10k mentioned this pull request May 15, 2024

fix(leader-election): exit after Leader status is lost #2236

Closed

5 tasks

github-actions bot added the stale label May 20, 2024

github-actions bot closed this Jun 19, 2024

acuteaura mentioned this pull request Jul 12, 2024

fix: attempt to shut down when provider init fails #2263

Merged

refactor: simplify leader election #2152

refactor: simplify leader election #2152

Uh oh!

Conversation

acuteaura commented Jan 31, 2024

Uh oh!

acuteaura commented Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Revolyssup commented Jan 31, 2024

Uh oh!

Revolyssup commented Feb 1, 2024

Uh oh!

Revolyssup Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

Revolyssup commented Feb 1, 2024

Uh oh!

acuteaura commented Feb 1, 2024

Uh oh!

acuteaura commented Feb 7, 2024

Uh oh!

zll600 commented Mar 19, 2024

Uh oh!

github-actions bot commented May 20, 2024

Uh oh!

github-actions bot commented Jun 19, 2024

Uh oh!

bstasz-bonrepublic commented Jul 12, 2024

Uh oh!

acuteaura commented Jul 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

acuteaura commented Jan 31, 2024 •

edited

Loading