Skip to content

WIP: fix(e2e): add retry logic to fix flaky tests#6249

Open
jimyag wants to merge 4 commits intokubeovn:masterfrom
jimyag:fix/flaky-e2e-tests
Open

WIP: fix(e2e): add retry logic to fix flaky tests#6249
jimyag wants to merge 4 commits intokubeovn:masterfrom
jimyag:fix/flaky-e2e-tests

Conversation

@jimyag
Copy link
Contributor

@jimyag jimyag commented Feb 2, 2026

Summary

Add reconnection and retry mechanisms to address two CI instabilities that cause intermittent test failures.

Changes

1. Fix libovsdb cache inconsistent error (test/e2e/framework/ovn_address_set.go)

  • Modified WaitForAddressSetCondition to check client connection status before each poll iteration
  • Reconnect automatically when disconnected or on List errors
  • Return nil error to continue polling instead of failing immediately
  • Added getOVNNbAddr() helper function to separate address resolution logic

2. Fix HA test timing issue (test/e2e/ha/ha_test.go)

  • Added wait.PollUntilContextTimeout with 60s timeout to getDbSidsFromClusterStatus
  • Retry when cluster hasn't fully converged after scale up/down
  • Check server count consistency across all pods before returning

Root Cause Analysis

  1. libovsdb cache inconsistent: The E2E test framework creates OVN NB client once and reuses it. When the libovsdb cache becomes inconsistent (due to rapid OVN NB database changes), the client disconnects but there was no recovery mechanism.

  2. HA test timing: The getDbSidsFromClusterStatus function immediately checks cluster status after scale operations without waiting for the OVN Raft cluster to fully converge, causing intermittent failures.

Test Plan

  • CI passes on this PR
  • The previously flaky tests (Kube-OVN Conformance E2E, Kube-OVN HA E2E) should be more stable

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 2, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jimyag, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the stability of end-to-end tests by integrating resilient retry and reconnection strategies. It targets common sources of test flakiness, such as transient OVN NB client disconnections and race conditions during OVN cluster state convergence, ensuring more reliable test outcomes without altering core application logic.

Highlights

  • OVN NB Client Reconnection Logic: Implemented robust reconnection and retry mechanisms within WaitForAddressSetCondition in test/e2e/framework/ovn_address_set.go. The OVN NB client now checks its connection status before each poll iteration, automatically reconnecting if disconnected or if a List operation fails, preventing immediate test failures due to inconsistent libovsdb cache.
  • HA Test Cluster Convergence Retry: Introduced wait.PollUntilContextTimeout with a 60-second timeout in getDbSidsFromClusterStatus in test/e2e/ha/ha_test.go. This change adds retry logic to ensure the OVN Raft cluster has fully converged and server counts are consistent across all pods after scale operations, addressing timing-related flakiness in HA tests.
  • Refactored OVN NB Address Resolution: Extracted the logic for resolving the OVN NB address into a new helper function getOVNNbAddr() to improve code organization and reusability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added the test automation tests label Feb 2, 2026
@jimyag jimyag force-pushed the fix/flaky-e2e-tests branch from e5c7b7e to c1d8ecf Compare February 2, 2026 09:53
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces retry logic to address flakiness in E2E tests, which is a great improvement for CI stability. The changes in test/e2e/framework/ovn_address_set.go correctly add a reconnection mechanism for the libovsdb client within a polling loop, handling both connection and list errors gracefully. Similarly, the modifications in test/e2e/ha/ha_test.go effectively use wait.PollUntilContextTimeout to ensure the OVN cluster has fully converged before proceeding with checks. The logic in both cases is sound. I have one minor suggestion for ha_test.go to improve code clarity. Overall, this is a solid contribution to improving test reliability.

@coveralls
Copy link

coveralls commented Feb 2, 2026

Pull Request Test Coverage Report for Build 21614972697

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 28 (0.0%) changed or added relevant lines in 1 file are covered.
  • 6 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.01%) to 23.029%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller/vip.go 0 28 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller/vip.go 6 0.0%
Totals Coverage Status
Change from base Build 21585158152: -0.01%
Covered Lines: 12411
Relevant Lines: 53892

💛 - Coveralls

@jimyag jimyag force-pushed the fix/flaky-e2e-tests branch from 9b913c4 to 1c4bbb3 Compare February 2, 2026 10:10
Add reconnection and retry mechanisms to address two CI instabilities:

1. libovsdb cache inconsistent error in WaitForAddressSetCondition:
   - Check client connection status before each poll iteration
   - Reconnect automatically when disconnected or on List errors
   - Return nil error to continue polling instead of failing immediately

2. HA test timing issue in getDbSidsFromClusterStatus:
   - Add wait.PollUntilContextTimeout with 60s timeout
   - Retry when cluster hasn't fully converged after scale up/down
   - Check server count consistency across all pods before returning

Signed-off-by: jimyag <[email protected]>
Remove the unused helper function flagged by golangci-lint unused check.

Signed-off-by: jimyag <[email protected]>
Remove intermediate expectedServerCount variable and use
int(*deploy.Spec.Replicas) directly for clarity.

Signed-off-by: jimyag <[email protected]>
@jimyag jimyag force-pushed the fix/flaky-e2e-tests branch from eca0fc4 to 54b1278 Compare February 2, 2026 10:59
When a VIP is created by user (via kubectl or test), handleAddVirtualIP
is called but it didn't ensure the finalizer was added. The
createOrUpdateVipCR function's else branch only updates labels and
status without adding finalizer.

This caused the flaky test "Test vip subnet status update with finalizer"
to fail because the test's WaitToBeReady only waits for Status.V4ip to be
set, which happens before the finalizer is added.

Changes:
- Call handleAddOrUpdateVipFinalizer at the end of handleAddVirtualIP
  for all VIP types to ensure the finalizer is always present
- Add detailed V(3) level logs for debugging finalizer operations
- Skip unnecessary patch when finalizer is already up to date
- Fix error message typo (ovn eip -> vip)
@jimyag jimyag changed the title fix(e2e): add retry logic to fix flaky tests WIP: fix(e2e): add retry logic to fix flaky tests Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files. test automation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments