fix(leader-election): exit after Leader status is lost #2236

Ben10k · 2024-05-15T10:22:34Z

This PR is a subset of @acuteaura's PR #2152
The current problem (v1.8.2) is that whenever a controller loses it's leader status, it does not exit gracefully, thus it fails silently. In order to prevent this, an os.exit has been implemented to shut itself down, and depend on Kubernetes to bring it back up.

Type of change:

Bugfix

What this PR does / why we need it:

ingress-controller doesn't recover from failed sync
#1980

Pre-submission checklist:

Did you explain what problem does this PR solve? Or what new features have been added?
Have you added corresponding test cases?
Have you modified the corresponding document?
Is this PR backward compatible? If it is not backward compatible, please discuss on the mailing list first

acuteaura · 2024-05-15T11:14:03Z

okay, sorry, i should've been more clear. the leader status is never lost, and that's the problem. the controller holds on to it indefinitely (at wg.Wait()), even though it really shouldn't.

you can keep this bit of code and then call rootCancel after run is invoked here, because that controller should never be returning when there's no error or the context is cancelled, and the context that keeps the leader election loop alive is derived from it.

Ben10k · 2024-05-16T07:59:06Z

I have just pushed an update. @acuteaura can you take a look?

acuteaura · 2024-05-16T08:17:31Z

Hm, I hadn't considered that the process also calls run and starts all the controller when it's not leader (at least without my PR), so you're probably better off just hard-exiting instead of cancelling the context, because a follower node would never hit OnStoppedLeading. This will leave you with no leader until the lease expires, which is short enough (?!) so it shouldn't be a problem (it's essentially a "pod vanishes without trace" reenactment for the failover).

Leaves you with a chance of killing the E2E suite again though, because this function can never cleanly exit now, so you may actually have cherrypick the changes from run (adding the error return value, returning errors) so you can gate the os.Exit on that.

acuteaura · 2024-05-16T08:18:17Z

@Revolyssup Would this arguably unclean but more reliable fix work for you?

wofr · 2024-07-01T07:12:22Z

Would be great if this fix could be merged. We are running on GKE with out-of-date updates. After each update, we have a fifty-fifty chance of needing a manual restart of the ingress-controller.

bstasz-bonrepublic · 2024-07-12T07:12:23Z

Hello,
@acuteaura @Revolyssup What is missing or needed here to get this PR merged? Currently we are not able to deploy more than 1 instances as the Leader election will fail in 1-2 days.

Looking at the commit history, it feels like Apisix Ingress Controller is in maintenance mode. Is this a correct assessment or just deemed feature complete?

acuteaura · 2024-07-12T07:26:35Z

this one's not correct. the correct way would be to revivie #2152 and fix the e2e test. or at least cherry-pick run to have an error return value and hard exit when it's not nil so it doesn't just... give up when the server isn't available at boot.

i wouldn't consider this project "dead" or "in maintenance mode", it's just very driven by individual contributors implementing what they need and some extra volunteers. if you or @wofr need this now and not someday, I'd suggest you PR it.

github-actions · 2024-09-11T01:28:16Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 30 days if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-10-11T01:28:26Z

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

fix(leader-election): exit after Leader status is lost

d513e59

fix(leader-election): cancel context when c.run() returns

bbda1fa

acuteaura mentioned this pull request Jul 12, 2024

fix: attempt to shut down when provider init fails #2263

Merged

github-actions bot added the stale label Sep 11, 2024

github-actions bot closed this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(leader-election): exit after Leader status is lost #2236

fix(leader-election): exit after Leader status is lost #2236

Uh oh!

Ben10k commented May 15, 2024

Uh oh!

acuteaura commented May 15, 2024 •

edited

Loading

Uh oh!

Ben10k commented May 16, 2024

Uh oh!

acuteaura commented May 16, 2024 •

edited

Loading

Uh oh!

acuteaura commented May 16, 2024

Uh oh!

wofr commented Jul 1, 2024

Uh oh!

bstasz-bonrepublic commented Jul 12, 2024

Uh oh!

acuteaura commented Jul 12, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Sep 11, 2024

Uh oh!

github-actions bot commented Oct 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(leader-election): exit after Leader status is lost #2236

fix(leader-election): exit after Leader status is lost #2236

Uh oh!

Conversation

Ben10k commented May 15, 2024

Type of change:

What this PR does / why we need it:

Pre-submission checklist:

Uh oh!

acuteaura commented May 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ben10k commented May 16, 2024

Uh oh!

acuteaura commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acuteaura commented May 16, 2024

Uh oh!

wofr commented Jul 1, 2024

Uh oh!

bstasz-bonrepublic commented Jul 12, 2024

Uh oh!

acuteaura commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 11, 2024

Uh oh!

github-actions bot commented Oct 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

acuteaura commented May 15, 2024 •

edited

Loading

acuteaura commented May 16, 2024 •

edited

Loading

acuteaura commented Jul 12, 2024 •

edited

Loading