Skip to content

fix: skip NodeEvaluation upsert when evaluation did not run#218

Open
sahitya-chandra wants to merge 2 commits intokubernetes-sigs:mainfrom
sahitya-chandra:fix/empty-nodeevaluation-status-patch
Open

fix: skip NodeEvaluation upsert when evaluation did not run#218
sahitya-chandra wants to merge 2 commits intokubernetes-sigs:mainfrom
sahitya-chandra:fix/empty-nodeevaluation-status-patch

Conversation

@sahitya-chandra
Copy link
Copy Markdown
Contributor

@sahitya-chandra sahitya-chandra commented May 6, 2026

Description

processNodeAgainstAllRules unconditionally upserted a NodeEvaluation entry for the current node after evaluateRuleForNode, even on the failure path. When taint patching returned a non-conflict error (e.g. RBAC denial, persistent API failure), evaluateRuleForNode exited before updateNodeEvaluationStatus ran, so the in-memory rule had no evaluation for that node and the upsert wrote a zero-value NodeEvaluation{NodeName: ""}. NodeEvaluation.NodeName is annotated MinLength=1 plus a hostname Pattern, so the API server rejected the whole Status().Patch with 422, and the FailedNodes update bundled into the same patch was lost with it

This change captures the error from evaluateRuleForNode and only runs the NodeEvaluation upsert when the evaluation succeeded. On the failure path the persisted entry is left untouched and only FailedNodes is updated. This also avoids overwriting a fresh persisted entry with a stale one from the rule cache when a separate reconcile has updated status since the cache was last refreshed

Related Issue

Fixes #217

Type of Change

/kind bug

Testing

  • make test passes locally (envtest, Kubernetes 1.34); controller package coverage rose from 72.2% to 74.5%.
  • make lint passes locally
  • Regression test 1: a fake client fails Patch on the node, runs NodeReconciler.Reconcile, and asserts that the FailedNodes entry lands, no empty NodeEvaluation slipped in, and an unrelated pre-existing NodeEvaluation was preserved. Without the fix, the test fails on the empty-NodeName assertion
  • Regression test 2 covers the stale-cache case: persisted status has TaintStatus=Absent, the cached rule snapshot has stale TaintStatus=Present, evaluation fails, and the persisted entry must stay Absent

Checklist

  • make test passes
  • make lint passes

Does this PR introduce a user-facing change?

Fix a bug where a transient taint patch failure on a node could drop the corresponding FailedNodes entry from a NodeReadinessRule's status, or overwrite a fresh persisted NodeEvaluation with a stale one from the controller's rule cache

processNodeAgainstAllRules unconditionally wrote a NodeEvaluation entry
for the node it just processed, even when evaluateRuleForNode returned
an error before updateNodeEvaluationStatus could populate one. The
zero-value NodeEvaluation either clobbered a valid prior entry or
appended one with an empty NodeName. The CRD requires NodeName
MinLength=1, so the API server rejected the whole status patch with
422, and the FailedNodes update bundled into the same patch was lost
along with it.

Skip the upsert when the in-memory rule has no evaluation for this
node, and let the FailedNodes update through on its own. Add a
regression test that fails Patch on the node, asserts FailedNodes is
recorded, and asserts no empty NodeEvaluation slips into status.
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 6, 2026

Deploy Preview for node-readiness-controller canceled.

Name Link
🔨 Latest commit 339a0ac
🔍 Latest deploy log https://app.netlify.com/projects/node-readiness-controller/deploys/69faa9b0465afe00084a07da

@k8s-ci-robot k8s-ci-robot requested a review from ajaysundark May 6, 2026 02:30
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sahitya-chandra
Once this PR has been reviewed and has the lgtm label, please assign ajaysundark for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from haircommander May 6, 2026 02:30
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @sahitya-chandra. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

processNodeAgainstAllRules can write an empty NodeEvaluation that fails CRD validation and drops FailedNodes updates

2 participants