WIP: fix(e2e): add retry logic to fix flaky tests by jimyag · Pull Request #6249 · kubeovn/kube-ovn

jimyag · 2026-02-02T09:52:18Z

Summary

Add reconnection and retry mechanisms to address two CI instabilities that cause intermittent test failures.

Changes

1. Fix libovsdb cache inconsistent error (test/e2e/framework/ovn_address_set.go)

Modified WaitForAddressSetCondition to check client connection status before each poll iteration
Reconnect automatically when disconnected or on List errors
Return nil error to continue polling instead of failing immediately
Added getOVNNbAddr() helper function to separate address resolution logic

2. Fix HA test timing issue (test/e2e/ha/ha_test.go)

Added wait.PollUntilContextTimeout with 60s timeout to getDbSidsFromClusterStatus
Retry when cluster hasn't fully converged after scale up/down
Check server count consistency across all pods before returning

Root Cause Analysis

libovsdb cache inconsistent: The E2E test framework creates OVN NB client once and reuses it. When the libovsdb cache becomes inconsistent (due to rapid OVN NB database changes), the client disconnects but there was no recovery mechanism.
HA test timing: The getDbSidsFromClusterStatus function immediately checks cluster status after scale operations without waiting for the OVN Raft cluster to fully converge, causing intermittent failures.

Test Plan

CI passes on this PR
The previously flaky tests (Kube-OVN Conformance E2E, Kube-OVN HA E2E) should be more stable

gemini-code-assist · 2026-02-02T09:52:36Z

Summary of Changes

Hello @jimyag, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the stability of end-to-end tests by integrating resilient retry and reconnection strategies. It targets common sources of test flakiness, such as transient OVN NB client disconnections and race conditions during OVN cluster state convergence, ensuring more reliable test outcomes without altering core application logic.

Highlights

OVN NB Client Reconnection Logic: Implemented robust reconnection and retry mechanisms within WaitForAddressSetCondition in test/e2e/framework/ovn_address_set.go. The OVN NB client now checks its connection status before each poll iteration, automatically reconnecting if disconnected or if a List operation fails, preventing immediate test failures due to inconsistent libovsdb cache.
HA Test Cluster Convergence Retry: Introduced wait.PollUntilContextTimeout with a 60-second timeout in getDbSidsFromClusterStatus in test/e2e/ha/ha_test.go. This change adds retry logic to ensure the OVN Raft cluster has fully converged and server counts are consistent across all pods after scale operations, addressing timing-related flakiness in HA tests.
Refactored OVN NB Address Resolution: Extracted the logic for resolving the OVN NB address into a new helper function getOVNNbAddr() to improve code organization and reusability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces retry logic to address flakiness in E2E tests, which is a great improvement for CI stability. The changes in test/e2e/framework/ovn_address_set.go correctly add a reconnection mechanism for the libovsdb client within a polling loop, handling both connection and list errors gracefully. Similarly, the modifications in test/e2e/ha/ha_test.go effectively use wait.PollUntilContextTimeout to ensure the OVN cluster has fully converged before proceeding with checks. The logic in both cases is sound. I have one minor suggestion for ha_test.go to improve code clarity. Overall, this is a solid contribution to improving test reliability.

test/e2e/ha/ha_test.go

coveralls · 2026-02-02T09:57:26Z

Pull Request Test Coverage Report for Build 21614972697

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 28 (0.0%) changed or added relevant lines in 1 file are covered.
6 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.01%) to 23.029%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controller/vip.go	0	28	0.0%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller/vip.go	6	0.0%

Totals
Change from base Build 21585158152:	-0.01%
Covered Lines:	12411
Relevant Lines:	53892

💛 - Coveralls

Add reconnection and retry mechanisms to address two CI instabilities: 1. libovsdb cache inconsistent error in WaitForAddressSetCondition: - Check client connection status before each poll iteration - Reconnect automatically when disconnected or on List errors - Return nil error to continue polling instead of failing immediately 2. HA test timing issue in getDbSidsFromClusterStatus: - Add wait.PollUntilContextTimeout with 60s timeout - Retry when cluster hasn't fully converged after scale up/down - Check server count consistency across all pods before returning Signed-off-by: jimyag <[email protected]>

Remove the unused helper function flagged by golangci-lint unused check. Signed-off-by: jimyag <[email protected]>

Remove intermediate expectedServerCount variable and use int(*deploy.Spec.Replicas) directly for clarity. Signed-off-by: jimyag <[email protected]>

When a VIP is created by user (via kubectl or test), handleAddVirtualIP is called but it didn't ensure the finalizer was added. The createOrUpdateVipCR function's else branch only updates labels and status without adding finalizer. This caused the flaky test "Test vip subnet status update with finalizer" to fail because the test's WaitToBeReady only waits for Status.V4ip to be set, which happens before the finalizer is added. Changes: - Call handleAddOrUpdateVipFinalizer at the end of handleAddVirtualIP for all VIP types to ensure the finalizer is always present - Add detailed V(3) level logs for debugging finalizer operations - Skip unnecessary patch when finalizer is already up to date - Fix error message typo (ovn eip -> vip)

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 2, 2026

dosubot bot added the test automation tests label Feb 2, 2026

jimyag force-pushed the fix/flaky-e2e-tests branch from e5c7b7e to c1d8ecf Compare February 2, 2026 09:53

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

test/e2e/ha/ha_test.go Show resolved Hide resolved

jimyag force-pushed the fix/flaky-e2e-tests branch from 9b913c4 to 1c4bbb3 Compare February 2, 2026 10:10

jimyag added 3 commits February 2, 2026 18:59

fix(e2e): remove unused getOVNNbClient function

74ebcb0

Remove the unused helper function flagged by golangci-lint unused check. Signed-off-by: jimyag <[email protected]>

refactor(e2e): use deploy.Spec.Replicas directly as source of truth

54b1278

Remove intermediate expectedServerCount variable and use int(*deploy.Spec.Replicas) directly for clarity. Signed-off-by: jimyag <[email protected]>

jimyag force-pushed the fix/flaky-e2e-tests branch from eca0fc4 to 54b1278 Compare February 2, 2026 10:59

jimyag changed the title ~~fix(e2e): add retry logic to fix flaky tests~~ WIP: fix(e2e): add retry logic to fix flaky tests Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: fix(e2e): add retry logic to fix flaky tests#6249

WIP: fix(e2e): add retry logic to fix flaky tests#6249
jimyag wants to merge 4 commits intokubeovn:masterfrom
jimyag:fix/flaky-e2e-tests

jimyag commented Feb 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

coveralls commented Feb 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

jimyag commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Root Cause Analysis

Test Plan

Uh oh!

gemini-code-assist bot commented Feb 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coveralls commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21614972697

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

jimyag commented Feb 2, 2026 •

edited

Loading

coveralls commented Feb 2, 2026 •

edited

Loading