Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 5, 2020

What changes were proposed in this pull request?

This PR aims to reduce the required test resources in WorkerDecommissionExtendedSuite.

Why are the changes needed?

When Jenkins farms is crowded, the following failure happens currently here

java.util.concurrent.TimeoutException: Can't find 20 executors before 60000 milliseconds elapsed
	at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:326)
	at org.apache.spark.scheduler.WorkerDecommissionExtendedSuite.$anonfun$new$2(WorkerDecommissionExtendedSuite.scala:45)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the Jenkins.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-32100][TESTS][FOLLOWUP] Reduce the required test resources [SPARK-32100][CORE][TESTS][FOLLOWUP] Reduce the required test resources Jul 5, 2020
@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124962 has finished for PR 29001 at commit 8b29c74.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124965 has finished for PR 29001 at commit 8b29c74.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124976 has finished for PR 29001 at commit 8b29c74.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #124980 has finished for PR 29001 at commit 8b29c74.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #124996 has finished for PR 29001 at commit 8b29c74.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Thank you, @HyukjinKwon !

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #124998 has finished for PR 29001 at commit 8b29c74.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private val conf = new org.apache.spark.SparkConf()
.setAppName(getClass.getName)
.set(SPARK_MASTER, "local-cluster[20,1,512]")
.set(SPARK_MASTER, "local-cluster[10,1,512]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason we need 10 or 20 executors? It's still too many compares to other tests, whose average number should be 2 or 3. cc @holdenk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this test came from the situation where we were experiencing a deadlock and we wanted to make sure we re-created the potential deadlock which happened when we decommissioned most of the executors. Now this deadlock never made it into OSS Spark, but having the test here to catch it just incase is good. I think we could catch the same deadlock with 5 executors and decommissioning 4 of them, but @dongjoon-hyun is the one who found this potential issue so I'll let him clarify :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your explanation.

@holdenk
Copy link
Contributor

holdenk commented Jul 6, 2020

LGTM, thanks for working around the crowded Jenkins env :) I think we can explore if we want to reduce it further in a follow on, but I think removing causes of test flakiness sooner rather than later is better for everyone working on Spark.

@dongjoon-hyun
Copy link
Member Author

Thank you, @holdenk and @Ngone51 . I'll reduce it to 5 in the community.

HyukjinKwon pushed a commit that referenced this pull request Jul 6, 2020
### What changes were proposed in this pull request?

This PR aims to disable dependency tests(test-dependencies.sh) from Jenkins.

### Why are the changes needed?

- First of all, GitHub Action provides the same test capability already and stabler.
- Second, currently, `test-dependencies.sh` fails very frequently in AmpLab Jenkins environment. For example, in the following irrelevant PR, it fails 5 times during 6 hours.
   - #29001

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins without `test-dependencies.sh` invocation.

Closes #29004 from dongjoon-hyun/SPARK-32178.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@Ngone51
Copy link
Member

Ngone51 commented Jul 6, 2020

+1, LGTM

@dongjoon-hyun
Copy link
Member Author

This is a test-only PR and I verified this manually.

$ build/sbt "core/testOnly *.WorkerDecommissionExtendedSuite"
...
[info] WorkerDecommissionExtendedSuite:
[info] - Worker decommission and executor idle timeout (14 seconds, 679 milliseconds)
[info] - Decommission 4 executors from 5 executors in total (21 seconds, 740 milliseconds)
[info] ScalaTest
[info] Run completed in 38 seconds, 271 milliseconds.
[info] Total number of tests run: 2
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

@dongjoon-hyun
Copy link
Member Author

Merged to master.

@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125010 has finished for PR 29001 at commit 3790334.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
This PR aims to reduce the required test resources in WorkerDecommissionExtendedSuite.

When Jenkins farms is crowded, the following failure happens currently [here](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2-hive-2.3/890/testReport/junit/org.apache.spark.scheduler/WorkerDecommissionExtendedSuite/Worker_decommission_and_executor_idle_timeout/)
```
java.util.concurrent.TimeoutException: Can't find 20 executors before 60000 milliseconds elapsed
	at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:326)
	at org.apache.spark.scheduler.WorkerDecommissionExtendedSuite.$anonfun$new$2(WorkerDecommissionExtendedSuite.scala:45)
```

No.

Pass the Jenkins.

Closes apache#29001 from dongjoon-hyun/SPARK-32100-2.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants