Skip to content

Conversation

@Ben10k
Copy link

@Ben10k Ben10k commented May 15, 2024

This PR is a subset of @acuteaura's PR #2152
The current problem (v1.8.2) is that whenever a controller loses it's leader status, it does not exit gracefully, thus it fails silently. In order to prevent this, an os.exit has been implemented to shut itself down, and depend on Kubernetes to bring it back up.

Type of change:

  • Bugfix

What this PR does / why we need it:

ingress-controller doesn't recover from failed sync
#1980

Pre-submission checklist:

  • Did you explain what problem does this PR solve? Or what new features have been added?
  • Have you added corresponding test cases?
  • Have you modified the corresponding document?
  • Is this PR backward compatible? If it is not backward compatible, please discuss on the mailing list first

@acuteaura
Copy link
Contributor

acuteaura commented May 15, 2024

okay, sorry, i should've been more clear. the leader status is never lost, and that's the problem. the controller holds on to it indefinitely (at wg.Wait()), even though it really shouldn't.

you can keep this bit of code and then call rootCancel after run is invoked here, because that controller should never be returning when there's no error or the context is cancelled, and the context that keeps the leader election loop alive is derived from it.

@Ben10k
Copy link
Author

Ben10k commented May 16, 2024

I have just pushed an update. @acuteaura can you take a look?

@acuteaura
Copy link
Contributor

acuteaura commented May 16, 2024

Hm, I hadn't considered that the process also calls run and starts all the controller when it's not leader (at least without my PR), so you're probably better off just hard-exiting instead of cancelling the context, because a follower node would never hit OnStoppedLeading. This will leave you with no leader until the lease expires, which is short enough (?!) so it shouldn't be a problem (it's essentially a "pod vanishes without trace" reenactment for the failover).

Leaves you with a chance of killing the E2E suite again though, because this function can never cleanly exit now, so you may actually have cherrypick the changes from run (adding the error return value, returning errors) so you can gate the os.Exit on that.

@acuteaura
Copy link
Contributor

@Revolyssup Would this arguably unclean but more reliable fix work for you?

@wofr
Copy link

wofr commented Jul 1, 2024

Would be great if this fix could be merged. We are running on GKE with out-of-date updates. After each update, we have a fifty-fifty chance of needing a manual restart of the ingress-controller.

@bstasz-bonrepublic
Copy link

Hello,
@acuteaura @Revolyssup What is missing or needed here to get this PR merged? Currently we are not able to deploy more than 1 instances as the Leader election will fail in 1-2 days.

Looking at the commit history, it feels like Apisix Ingress Controller is in maintenance mode. Is this a correct assessment or just deemed feature complete?

@acuteaura
Copy link
Contributor

acuteaura commented Jul 12, 2024

this one's not correct. the correct way would be to revivie #2152 and fix the e2e test. or at least cherry-pick run to have an error return value and hard exit when it's not nil so it doesn't just... give up when the server isn't available at boot.

i wouldn't consider this project "dead" or "in maintenance mode", it's just very driven by individual contributors implementing what they need and some extra volunteers. if you or @wofr need this now and not someday, I'd suggest you PR it.

@github-actions
Copy link

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 30 days if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 11, 2024
@github-actions
Copy link

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants