WIP: fix(e2e): add retry logic to fix flaky tests#6249
WIP: fix(e2e): add retry logic to fix flaky tests#6249jimyag wants to merge 4 commits intokubeovn:masterfrom
Conversation
Summary of ChangesHello @jimyag, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the stability of end-to-end tests by integrating resilient retry and reconnection strategies. It targets common sources of test flakiness, such as transient OVN NB client disconnections and race conditions during OVN cluster state convergence, ensuring more reliable test outcomes without altering core application logic. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
e5c7b7e to
c1d8ecf
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces retry logic to address flakiness in E2E tests, which is a great improvement for CI stability. The changes in test/e2e/framework/ovn_address_set.go correctly add a reconnection mechanism for the libovsdb client within a polling loop, handling both connection and list errors gracefully. Similarly, the modifications in test/e2e/ha/ha_test.go effectively use wait.PollUntilContextTimeout to ensure the OVN cluster has fully converged before proceeding with checks. The logic in both cases is sound. I have one minor suggestion for ha_test.go to improve code clarity. Overall, this is a solid contribution to improving test reliability.
Pull Request Test Coverage Report for Build 21614972697Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
9b913c4 to
1c4bbb3
Compare
Add reconnection and retry mechanisms to address two CI instabilities: 1. libovsdb cache inconsistent error in WaitForAddressSetCondition: - Check client connection status before each poll iteration - Reconnect automatically when disconnected or on List errors - Return nil error to continue polling instead of failing immediately 2. HA test timing issue in getDbSidsFromClusterStatus: - Add wait.PollUntilContextTimeout with 60s timeout - Retry when cluster hasn't fully converged after scale up/down - Check server count consistency across all pods before returning Signed-off-by: jimyag <[email protected]>
Remove the unused helper function flagged by golangci-lint unused check. Signed-off-by: jimyag <[email protected]>
Remove intermediate expectedServerCount variable and use int(*deploy.Spec.Replicas) directly for clarity. Signed-off-by: jimyag <[email protected]>
eca0fc4 to
54b1278
Compare
When a VIP is created by user (via kubectl or test), handleAddVirtualIP is called but it didn't ensure the finalizer was added. The createOrUpdateVipCR function's else branch only updates labels and status without adding finalizer. This caused the flaky test "Test vip subnet status update with finalizer" to fail because the test's WaitToBeReady only waits for Status.V4ip to be set, which happens before the finalizer is added. Changes: - Call handleAddOrUpdateVipFinalizer at the end of handleAddVirtualIP for all VIP types to ensure the finalizer is always present - Add detailed V(3) level logs for debugging finalizer operations - Skip unnecessary patch when finalizer is already up to date - Fix error message typo (ovn eip -> vip)
Summary
Add reconnection and retry mechanisms to address two CI instabilities that cause intermittent test failures.
Changes
1. Fix libovsdb cache inconsistent error (
test/e2e/framework/ovn_address_set.go)WaitForAddressSetConditionto check client connection status before each poll iterationListerrorsnilerror to continue polling instead of failing immediatelygetOVNNbAddr()helper function to separate address resolution logic2. Fix HA test timing issue (
test/e2e/ha/ha_test.go)wait.PollUntilContextTimeoutwith 60s timeout togetDbSidsFromClusterStatusRoot Cause Analysis
libovsdb cache inconsistent: The E2E test framework creates OVN NB client once and reuses it. When the libovsdb cache becomes inconsistent (due to rapid OVN NB database changes), the client disconnects but there was no recovery mechanism.
HA test timing: The
getDbSidsFromClusterStatusfunction immediately checks cluster status after scale operations without waiting for the OVN Raft cluster to fully converge, causing intermittent failures.Test Plan
Kube-OVN Conformance E2E,Kube-OVN HA E2E) should be more stable