Skip to content

Conversation

@DilumAluthge
Copy link
Member

@DilumAluthge DilumAluthge commented Oct 22, 2021

Follow-up to #42499


This pull request is intended to fix the following test failure, which I am seeing very rarely in the tester_linux64_mt Buildkite job:

Distributed                         (1) |        started at 2021-10-22T10:10:08.433
Test Failed at /cache/build/amdci7-2/julialang/julia-master/julia-609a4a0a1e/share/julia/stdlib/v1.8/Distributed/test/distributed_exec.jl:270
  Expression: remotecall_fetch((k->begin
                haskey(Distributed.PGRP.refs, k)
            end), wid1, rrid) == true
   Evaluated: false == true
ERROR: LoadError: There was an error during testing
in expression starting at /cache/build/amdci7-2/julialang/julia-master/julia-609a4a0a1e/share/julia/stdlib/v1.8/Distributed/test/distributed_exec.jl:263
Distributed                         (1) |         failed at 2021-10-22T10:10:33.207

Full log: https://buildkite.com/julialang/julia-master/builds/5028#dfdf3a45-5510-402b-99d4-9daa860f0f09

@DilumAluthge DilumAluthge added parallelism Parallel or distributed computation test This change adds or pertains to unit tests backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Oct 22, 2021
@DilumAluthge
Copy link
Member Author

I'm not sure if this change actually makes any sense.

Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's from multi-threaded tests, I don't think this is unexpected? I think it'd be better to move tester_linux64_mt (or only Distributed's tests) to experimental or skip this test when nthreads()> 1.

I think moving multithreaded tests for Distributed to experimental would be better than skipping it since we can do some post-hoc analysis on how much flaky it has been.

@DilumAluthge
Copy link
Member Author

Hmmm. I definitely don't want to mark the entire tester_linux64_mt job as experimental, since all other test sets pass

I guess we could make a separate job that specifically runs the Distributed job with multithreading enabled, and mark that job as experimental.

@DilumAluthge
Copy link
Member Author

I guess we could make a separate job that specifically runs the Distributed job with multithreading enabled, and mark that job as experimental.

Hmmm. This doesn't fix the case where someone runs the tests locally with e.g. JULIA_NUM_THREADS=16 ./julia test/runtests.jl all.

If this test is known/expected to sometimes fail when Julia was started with multiple threads, I think I would prefer to go with your suggestion of skipping those specific tests if Threads.nthreads > 1.

@DilumAluthge
Copy link
Member Author

Actually, we can do both. We can skip it by default, but have a Buildkite job that doesn't skip it.

@DilumAluthge
Copy link
Member Author

I'll do this in two parts:

The first part (skip these tests when Threads.nthreads() > 1, whether we are on CI or running locally) is here: #42764

I'll make the PR for the second part (experimental Buildkite job that runs the Distributed test suite with Threads.nthreads() > 1) once #42764 is merged.

@DilumAluthge DilumAluthge deleted the dpa/distributed-test-poll-while branch October 25, 2021 03:34
@KristofferC KristofferC removed backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parallelism Parallel or distributed computation test This change adds or pertains to unit tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants