Skip to content

🔪 flaky and Zombienet tests#8600

Merged
athei merged 19 commits intomasterfrom
oty-shortcircuit-ci
May 22, 2025
Merged

🔪 flaky and Zombienet tests#8600
athei merged 19 commits intomasterfrom
oty-shortcircuit-ci

Conversation

@ggwpez
Copy link
Copy Markdown
Member

@ggwpez ggwpez commented May 21, 2025

Commenting out all flaky tests and tracking them here #48

Changes:

  • Disable flaky Rust tests by adding a new disabled feature. #[ignore] attribute is not possible since CI runs with --ignored

  • Disable all Zombienet tests

  • Waiting for CI what other tests fail.

Signed-off-by: Oliver Tale-Yazdi <[email protected]>
@ggwpez ggwpez added the T10-tests This PR/Issue is related to tests. label May 21, 2025
ggwpez added 2 commits May 21, 2025 19:39
Signed-off-by: Oliver Tale-Yazdi <[email protected]>
Signed-off-by: Oliver Tale-Yazdi <[email protected]>
@ggwpez ggwpez requested review from a team as code owners May 21, 2025 17:40
ggwpez added 5 commits May 21, 2025 19:42
Signed-off-by: Oliver Tale-Yazdi <[email protected]>
Signed-off-by: Oliver Tale-Yazdi <[email protected]>
Signed-off-by: Oliver Tale-Yazdi <[email protected]>
Signed-off-by: Oliver Tale-Yazdi <[email protected]>
@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/15184421552
Failed job name: fmt

Signed-off-by: Oliver Tale-Yazdi <[email protected]>
@pepoviola pepoviola added the R0-no-crate-publish-required The change does not require any crates to be re-published. label May 22, 2025
Signed-off-by: Oliver Tale-Yazdi <[email protected]>
@ggwpez ggwpez changed the title 🔪 flaky tests 🔪 flaky and Zombienet tests May 22, 2025
@ggwpez
Copy link
Copy Markdown
Member Author

ggwpez commented May 22, 2025

Initially this was disabling the flaky ones, why all zombienet tests now? Some of them were passing afaik.

Yes but i kept disabling more and more, so now I disabled all since there is no way to check if they are not flaky without running the CI many times.

Maybe we can go the other way and initially disable all and then enable the non-flaky ones one-by-one.

@bkchr
Copy link
Copy Markdown
Member

bkchr commented May 22, 2025

Maybe we could just disable the zombienet tests in the merge queue?

@athei
Copy link
Copy Markdown
Member

athei commented May 22, 2025

Maybe we could just disable the zombienet tests in the merge queue?

In the PR they are just noise. We always merge with them red because they almost always are. I see no reason to keep them.

@athei athei added this pull request to the merge queue May 22, 2025
Merged via the queue into master with commit e723cfa May 22, 2025
182 checks passed
@athei athei deleted the oty-shortcircuit-ci branch May 22, 2025 23:05
@alvicsam alvicsam mentioned this pull request May 23, 2025
github-merge-queue bot pushed a commit that referenced this pull request May 23, 2025
Since zombienet [has been
disabled](#8600) to
improve stability, it makes no sense to keep the GitLab configuration.
ordian added a commit that referenced this pull request May 27, 2025
* master: (99 commits)
  Snowbridge: Remove asset location check for compatibility (#8473)
  add poke_deposit extrinsic to pallet-bounties (#8382)
  litep2p/peerset: Reject non-reserved peers in the reserved-only mode (#8650)
  Charge deposit based on key length (#8648)
  [pallet-revive] make subscription task panic on error (#8587)
  tx/metrics: Add metrics for the RPC v2 `transactionWatch_v1_submitAndWatch` (#8345)
  Bridges: Fix - Improve try-state for pallet-xcm-bridge-hub (#8615)
  Introduce CreateBare, deprecated CreateInherent (#7597)
  Use hashbrown hashmap/hashset in validation context (#8606)
  ci: rm gitlab config (#8622)
  🔪 flaky and Zombienet tests (#8600)
  cumulus: adjust unincluded segment size metric buckets (#8617)
  Benchmark storage access on block validation (#8069)
  Revert 7934 es/remove tj changes (#8611)
  collator-protocol: add more collation observability (#8230)
  `fatxpool`: add fallback for ready at light (#8533)
  txpool: fix tx removal from unlocks set (#8500)
  XCMP weight metering: account for the MQ page position (#8344)
  fix epmb solution duplicate issue + add remote mining apparatus to epm (#8585)
  Fix generated address returned by Substrate RPC runtime call (#8504)
  ...
@alindima
Copy link
Copy Markdown
Contributor

alindima commented May 28, 2025

I see this PR disabled all zombienet tests, which I think was a bad idea.
They aren't just noise. Certainly not all of them were flaky and most of them were flaky for infrastructure reasons (running on spot instances or undersized hardware), based on my experience of running them locally to check. Us in the Parachains team have been working on porting them to zombienet-sdk and trying to make them more reliable.

I'd suggest to just not make them mandatory for merging the PR. We should still run them but take the results with a grain of salt. There are certainly some tests that are failing almost all the time (and we could disable these altogether maybe), but at least we also had plenty of tests that were failing extremely rarely (or never).
They are valuable even as informational only. Maybe not for everyone, depending on which area of the code you're usually touching

@bkchr
Copy link
Copy Markdown
Member

bkchr commented May 28, 2025

Us in the Parachains team have been working on porting them to zombienet-sdk and trying to make them more reliable.

Porting to zombienet-sdk will not change anything in regards to their flakiness.

They are valuable even as informational only.

No. we already have merged prs in the past with failing tests and then realizing that they actually broke. It is also not really easy too check them etc.

The only proper fix is to make them rock stable. Stuff like spot instances need to be detected by zombienet itself and not counted as an error. But we also had this discussion already multiple times and these tests are still flaky and it doesn't improve.

@alindima
Copy link
Copy Markdown
Contributor

Porting to zombienet-sdk will not change anything in regards to their flakiness.

We don't only do a direct translation. We try to test them thoroughly and investigate their flakyness. zombienet-sdk sometimes enables us to make more precise assertions.

No. we already have merged prs in the past with failing tests and then realizing that they actually broke. It is also not really easy too check them etc.

Right, there's a chance of merging some buggy code. But now we can almost be sure that we'll merge some buggy code

@alexggh
Copy link
Copy Markdown
Contributor

alexggh commented May 28, 2025

The only proper fix is to make them rock stable.

I don't think anyone disagrees here, but are we in a better situation with all of them disabled ?

@pepoviola
Copy link
Copy Markdown
Contributor

pepoviola commented May 28, 2025

Hi, I'm also agree on this

The only proper fix is to make them rock stable.

And we are working to reach that stability, just as update:

  • We are working on re-enable the dashboard (tracking tests and make it easy to visualize them)
  • Split the deployment and test phases to easily categorize (and re-try) tests when the deployment phase fail (related to infra issues).
  • Audit failing test , get the root cause (infra or logic related) and fix them.

Also, in order to get more information of the test but without causing friction we propose to only run them when certain label is present (so people form parachains/node sdk can run them easily) and also make them not required.

Any other feedback/opinion is wellcome.

@athei
Copy link
Copy Markdown
Member

athei commented May 28, 2025

I'd suggest to just not make them mandatory for merging the PR. We should still run them but take the results with a grain of salt.

They were never mandatory. That would have been insane. But you still had to wait for them to finish running. And I am not checking the results and nobody is checking the result of a job that fails > 50% of the time. I just merge just as anybody else. They provide ZERO value.

It makes no sense to have optional tests. Just make them solid or don't add them. We need to do better. But until then they just don't run in a PR.

@alindima
Copy link
Copy Markdown
Contributor

They were never mandatory. That would have been insane. But you still had to wait for them to finish running. And I am not checking the results and nobody is checking the result of a job that fails > 50% of the time. I just merge just as anybody else. They provide ZERO value.

I for one (and I assume others too) got a good memory of which particular tests are flaky (or at least you could check on other PRs which are the most common ones). Definitely not all of them are (especially to the 50% ratio that you mention).
If I'm writing a particularly large PR that adds a lot of code it's useful to have the CI run and for the couple of flaky tests that are reported I'd run them locally to make sure they are passing.
Me and others have been doing this for a while. It's frustrating but certainly better than just not running any of them.

I can understand that this is not valuable to everyone, but for some it is. Having the label is a good compromise for the time being. At least I can test my code and the code I'm reviewing.

@ggwpez
Copy link
Copy Markdown
Member Author

ggwpez commented May 28, 2025

I can understand that this is not valuable to everyone, but for some it is. Having the label is a good compromise for the time being. At least I can test my code and the code I'm reviewing.

Yea lets go with the label for now.

I for one (and I assume others too) got a good memory of which particular tests are flaky (or at least you could check on other PRs which are the most common ones).

But to all other people working on Polkadot SDK it is confusing AF. This is a team project!
New developers come in and think they made a mistake and get frustrated because CI is red.
Disabling all ZN tests is my way of escalating the issue since it has not been taken seriously for the last two years!

@alexggh
Copy link
Copy Markdown
Contributor

alexggh commented May 28, 2025

Disabling all ZN tests is my way of escalating the issue since it has not been taken seriously for the last two years!

You should've made it the other way around make them mandatory :D, so that everyone stops and fix them, but I guess we tried that as well :D.

Anyways, let's go with the label for now, but my instinct tells me this has the reverse impact where they get even lower priority.

github-merge-queue bot pushed a commit that referenced this pull request May 29, 2025
# Description
Zombienet tests are no longer automatically triggered with polkadot-sdk
CI because of their flakiness (see
#8600).
This PR allows to conditionally trigger Zombienet tests if label
'T18-zombienet_tests' is set.
This is to have an option to trigger them anyway until they are
stabilized.
It will help us to monitor the current status of the tests while
stabilizing.

---------

Co-authored-by: Javier Viola <[email protected]>
Co-authored-by: Javier Viola <[email protected]>
@bkontur bkontur mentioned this pull request Jun 9, 2025
3 tasks
pgherveou pushed a commit that referenced this pull request Jun 11, 2025
Commenting out all flaky tests and tracking them here
#48

Changes:
- Disable flaky Rust tests by adding a new disabled feature. `#[ignore]`
attribute is not possible since CI runs with `--ignored`
- Disable all Zombienet tests

- [ ] Waiting for CI what other tests fail.

---------

Signed-off-by: Oliver Tale-Yazdi <[email protected]>
pgherveou pushed a commit that referenced this pull request Jun 11, 2025
Since zombienet [has been
disabled](#8600) to
improve stability, it makes no sense to keep the GitLab configuration.
pgherveou pushed a commit that referenced this pull request Jun 11, 2025
# Description
Zombienet tests are no longer automatically triggered with polkadot-sdk
CI because of their flakiness (see
#8600).
This PR allows to conditionally trigger Zombienet tests if label
'T18-zombienet_tests' is set.
This is to have an option to trigger them anyway until they are
stabilized.
It will help us to monitor the current status of the tests while
stabilizing.

---------

Co-authored-by: Javier Viola <[email protected]>
Co-authored-by: Javier Viola <[email protected]>
alvicsam pushed a commit that referenced this pull request Oct 17, 2025
Commenting out all flaky tests and tracking them here
#48

Changes:
- Disable flaky Rust tests by adding a new disabled feature. `#[ignore]`
attribute is not possible since CI runs with `--ignored`
- Disable all Zombienet tests

- [ ] Waiting for CI what other tests fail.

---------

Signed-off-by: Oliver Tale-Yazdi <[email protected]>
alvicsam added a commit that referenced this pull request Oct 17, 2025
Since zombienet [has been
disabled](#8600) to
improve stability, it makes no sense to keep the GitLab configuration.
alvicsam pushed a commit that referenced this pull request Oct 17, 2025
# Description
Zombienet tests are no longer automatically triggered with polkadot-sdk
CI because of their flakiness (see
#8600).
This PR allows to conditionally trigger Zombienet tests if label
'T18-zombienet_tests' is set.
This is to have an option to trigger them anyway until they are
stabilized.
It will help us to monitor the current status of the tests while
stabilizing.

---------

Co-authored-by: Javier Viola <[email protected]>
Co-authored-by: Javier Viola <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

R0-no-crate-publish-required The change does not require any crates to be re-published. T10-tests This PR/Issue is related to tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants