Skip to content

Conversation

@acuteaura
Copy link
Contributor

mostly based on the upstream example:
https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go

this also incidentally fixes bugs where run returns nil (i.e. when failing to reach the Admin API) by causing the context to cancel via defer.

  • Bugfix
  • Refactor

@acuteaura
Copy link
Contributor Author

acuteaura commented Jan 31, 2024

some remarks on this PR:

  • a bunch of code seems to treat leader status as "skipping writes" (via the Elector). this is pretty hard to reason about and prone to errors, so this PR changes it to a more traditional standby (with informers warmed up) that doesn't run any controllers.
  • alongside that, the controller now hard exits when it loses leader status. this likely only happens when a node netsplits from k8s apiserver, so it wouldn't be able to update things there anyway.
  • run got an error return value, because a lot of error conditions were just silently discarded. A future PR should improve these errors with some kind of wrapping and remove explicit logging in run itself so it all bubbles up
  • I did not touch the API server, because I'm not 100% about all the things it does, but we should consider failing a readiness probe if it is not leader if code in there relies on controllers running - though it did not have access to Elector until this point either.

@Revolyssup
Copy link
Contributor

@acuteaura Can you fix the merge conflicts?

@acuteaura acuteaura force-pushed the refactor/leader-election branch from 652edb7 to 90f4a4c Compare January 31, 2024 17:24
@Revolyssup
Copy link
Contributor

@acuteaura Unit tests failing

)
// rootCancel might be to slow, and controllers may have bugs that cause them to not yield
// the safest way to step down is to simply cause a pod restart
os.Exit(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@Revolyssup
Copy link
Contributor

an e2e test related to leader election was also failing. I reran it

@acuteaura
Copy link
Contributor Author

The unit test is testing specific strings in the output instead of functionality, I think it can go.

The e2e test.... I don't even know, it seems to fail before setup, but the Restart Count between the describe and the get are different and... it doesn't make sense to me why this test would fail its BeforeRun.

@acuteaura
Copy link
Contributor Author

@Revolyssup could you have a look at this failed test? I'm not really sure how it could fail in the beforeEach step when that's all shared among tests. Or maybe re-run the tests to make sure this isn't a flake.

It's unfortunately not so trivial to run this test locally, but I'm getting to it...

@zll600
Copy link
Contributor

zll600 commented Mar 19, 2024

ping @Revolyssup

@github-actions
Copy link

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 30 days if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label May 20, 2024
@github-actions
Copy link

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jun 19, 2024
@bstasz-bonrepublic
Copy link

Hello,
@acuteaura Would you reopen this PR so that we can have new test result logs? I would like to look into this and possibly fix it.

@acuteaura
Copy link
Contributor Author

I can't, but I can rebase and re-submit later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants