[SPARK-7721][INFRA][FOLLOW-UP] Remove cloned coverage repo after posting HTMLs #23729

HyukjinKwon · 2019-02-02T16:16:29Z

What changes were proposed in this pull request?

This PR proposes to remove cloned pyspark-coverage-site repo.

it doesn't looks a problem in PR builder but somehow it's problematic in spark-master-test-sbt-hadoop-2.7.

How was this patch tested?

Jenkins.

Authored-by: shane knapp <[email protected]> Signed-off-by: shane knapp <[email protected]>

HyukjinKwon · 2019-02-02T16:21:09Z

Hey, @shaneknapp, sorry for bugging you about this multiple times :( ..

Looks, at https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/ws/, pyspark-coverage-site is left over for some reasons. I thought Jenkins cleans up the workspace after the test so far (maybe I misunderstood?). It looks no problem in PR builder.

Anyhow, I added one line to remove the cloned repo to be 100% safe. However, I think here I need your guidance.

shaneknapp · 2019-02-02T16:41:51Z

i don't think this is necessary at all. the build clones the git repo, issues a git clean -fdx and then runs dev/run-tests.

HyukjinKwon · 2019-02-02T17:03:04Z

Yes .. so I was wondering why pyspark-coverage-site is left over in the build spark-master-test-sbt-hadoop-2.7 specifically. It's being failed because pyspark-coverage-site files are being checked if it has a license header or not, which is being done at the first phase of the build ..

pyspark-coverage-site is being cloned almost at the last step so it shouldn't be there. I suspected if the previous affects it somehow .. or I am misunderstanding this completely.

shaneknapp · 2019-02-02T17:24:15Z

ok, so this is weird... i ran a build (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5473/console) that just failed, and i put an ls in the execution step to see what's left after the git clean -fdx runs.

the build failed, which is definitely odd, so i logged in to the machine and manually ran git clean -fdx... and that didn't work at all:

-bash-4.1$ git clean -fdx
Removing R/lib/
Removing R/pkg/man/
Removing dev/sparktestsupport/__init__.pyc
Removing dev/sparktestsupport/modules.pyc
Removing dev/sparktestsupport/shellutils.pyc
Removing dev/sparktestsupport/toposort.pyc
Removing lib/
Removing pyspark-coverage-site/
Removing target/
-bash-4.1$ cd pyspark-coverage-site/
-bash-4.1$ ls
<ALL OF THE PYSPARK COVERAGE FILES>
-bash-4.1$ cd ..
-bash-4.1$ ls -ld pyspark-coverage-site/
drwxr-xr-x. 3 jenkins jenkins 12288 Feb  1 02:31 pyspark-coverage-site/
-bash-4.1$ git clean -fdx
Removing pyspark-coverage-site/
-bash-4.1$ git clean -fdx
Removing pyspark-coverage-site/
-bash-4.1$ chattr -i pyspark-coverage-site/
-bash-4.1$ git clean -fdx
Removing pyspark-coverage-site/
-bash-4.1$ git clean -fdx
Removing pyspark-coverage-site/
-bash-4.1$ cd pyspark-coverage-site/
-bash-4.1$ ls
<ALL OF THE PYSPARK COVERAGE FILES>

umm.... wtf? i have never seen behavior like this before, and since it's my day off, i really don't feel like the deep dive in to why this is happening. to that end, i ticked the 'clean workspace before build' box, and that got us past the RAT checks.

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5474/console

i'll dive in w/strace later and figure out why the git clean -fdx didn't do the job.

shaneknapp · 2019-02-02T18:05:35Z

also, this PR isn't needed. i'd rather keep any filesystem ops in the build config, not the test code itself.

HyukjinKwon · 2019-02-02T18:23:46Z

Ah, Okie. Let me close this then. Thanks for taking a look - o was wondering if there's any way to cope with it within the codes.

SparkQA · 2019-02-02T21:13:23Z

Test build #102017 has finished for PR 23729 at commit a492263.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same. This PR checks the `mapPartitionsInternal` iterator to be non-empty before calling `ArrowPythonRunner` to start computation on the iterator. ## How was this patch tested? Existing tests. Ran the following benchmarks a simple example where most partitions are empty: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import * df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v return pdf.assign(v=(v - v.mean()) / v.std()) df.groupby("id").apply(normalize).count() ``` **Before** ``` In [4]: %timeit df.groupby("id").apply(normalize).count() 1.58 s ± 62.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: %timeit df.groupby("id").apply(normalize).count() 1.52 s ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [6]: %timeit df.groupby("id").apply(normalize).count() 1.52 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` **After this Change** ``` In [2]: %timeit df.groupby("id").apply(normalize).count() 646 ms ± 89.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [3]: %timeit df.groupby("id").apply(normalize).count() 408 ms ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [4]: %timeit df.groupby("id").apply(normalize).count() 381 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` Closes #24926 from BryanCutler/pyspark-pandas_udf-map-agg-skip-empty-parts-SPARK-28128. Authored-by: Bryan Cutler <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

HyukjinKwon · 2019-06-24T14:46:59Z

Let me leave this open in case we want it. cc @shaneknapp and @srowen.

SparkQA · 2019-06-24T16:27:13Z

Test build #106836 has finished for PR 23729 at commit 2e4846a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2019-06-24T17:33:58Z

this PR is fine, but i really like the approach in #24950 better.

SparkQA · 2019-06-24T18:43:10Z

Test build #4808 has finished for PR 23729 at commit 2e4846a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2019-06-24T19:55:23Z

since this was fixed in #24950 (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/6046/console), i would recommend that we close this and not merge the changes in.

========================================================================
Running Apache RAT checks
========================================================================
Attempting to fetch rat
RAT checks passed.

SparkQA · 2019-06-24T21:33:32Z

Test build #4809 has finished for PR 23729 at commit 2e4846a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-06-24T21:47:05Z

I'm OK either way on this one. It seems like good cleanup but isn't essential now.

shaneknapp · 2019-06-24T22:47:50Z

I'm OK either way on this one. It seems like good cleanup but isn't essential now.

same.

HyukjinKwon · 2019-06-25T00:17:50Z

I don't strongly feel about this PR too. Let me just merge this one just to be safer - shouldn't hurt something in any way ...

shaneknapp · 2019-06-25T21:59:43Z

i had to manually clean up all of the workers and remove existing pyspark-coverage-site dirs to get the build green again.

why git clean -fdx isn't actually removing that dir is beyond me. :\

for an example, see: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/6047/console

this is the output of the git clean -fdx command:

[spark-master-test-sbt-hadoop-2.7] $ /bin/bash /tmp/hudson4664062016256370877.sh
Removing R/lib/
Removing R/pkg/man/
Removing dev/sparktestsupport/__init__.pyc
Removing dev/sparktestsupport/modules.pyc
Removing dev/sparktestsupport/shellutils.pyc
Removing dev/sparktestsupport/toposort.pyc
Removing lib/
Removing pyspark-coverage-site/
Removing target/

great! pyspark-coverage-site removed! yay!

however:

========================================================================
Running Apache RAT checks
========================================================================
Attempting to fetch rat
Could not find Apache license headers in the following files:
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/index.html
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/jquery.ba-throttle-debounce.min.js
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/jquery.hotkeys.js
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/jquery.isonscreen.js
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/jquery.min.js
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/jquery.tablesorter.min.js
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/pyspark_cloudpickle_py.html
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/pyspark_heapq3_py.html
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/pyspark_join_py.html

i ran this command to clean stuff up: $ pssh -h centos_workers.txt -i "rm -rf workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/"

then i launched a build, and it passed the RAT check:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/6056/

the coverage command is run AFTER the RAT checks, so we shouldn't have this problem no matter what. why that dir is sticking around is beyond me.

i'll keep an eye on this over the course of this week and if necessary futz more w/the build configs to make things behave. :(

HyukjinKwon and others added 3 commits February 3, 2019 00:01

Remove claned repo after posting HTMLs

77bad92

Test it in PR builder

21594ce

Adding Shane as co-authoer

a492263

Authored-by: shane knapp <[email protected]> Signed-off-by: shane knapp <[email protected]>

HyukjinKwon closed this Feb 2, 2019

HyukjinKwon mentioned this pull request Jun 24, 2019

[MINOR][BUILD] Exclude pyspark-coverage-site/ dir from RAT #24950

Closed

HyukjinKwon reopened this Jun 24, 2019

HyukjinKwon closed this Jun 24, 2019

HyukjinKwon reopened this Jun 24, 2019

Update run-tests.py

2e4846a

HyukjinKwon changed the title ~~[WIP][SPARK-7721][INFRA][FOLLOW-UP] Remove cloned coverage repo after posting HTMLs~~ [SPARK-7721][INFRA][FOLLOW-UP] Remove cloned coverage repo after posting HTMLs Jun 24, 2019

srowen approved these changes Jun 24, 2019

View reviewed changes

HyukjinKwon closed this in 1d36b89 Jun 25, 2019

HyukjinKwon deleted the followup-coverage branch March 3, 2020 01:19

[SPARK-7721][INFRA][FOLLOW-UP] Remove cloned coverage repo after posting HTMLs #23729

[SPARK-7721][INFRA][FOLLOW-UP] Remove cloned coverage repo after posting HTMLs #23729

Uh oh!

Conversation

HyukjinKwon commented Feb 2, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Feb 2, 2019

Uh oh!

shaneknapp commented Feb 2, 2019

Uh oh!

HyukjinKwon commented Feb 2, 2019

Uh oh!

shaneknapp commented Feb 2, 2019

Uh oh!

shaneknapp commented Feb 2, 2019

Uh oh!

HyukjinKwon commented Feb 2, 2019

Uh oh!

SparkQA commented Feb 2, 2019

Uh oh!

HyukjinKwon commented Jun 24, 2019

Uh oh!

SparkQA commented Jun 24, 2019

Uh oh!

shaneknapp commented Jun 24, 2019

Uh oh!

SparkQA commented Jun 24, 2019

Uh oh!

shaneknapp commented Jun 24, 2019

Uh oh!

SparkQA commented Jun 24, 2019

Uh oh!

srowen commented Jun 24, 2019

Uh oh!

shaneknapp commented Jun 24, 2019

Uh oh!

HyukjinKwon commented Jun 25, 2019

Uh oh!

shaneknapp commented Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shaneknapp commented Jun 25, 2019 •

edited

Loading