Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to remove cloned pyspark-coverage-site repo.

it doesn't looks a problem in PR builder but somehow it's problematic in spark-master-test-sbt-hadoop-2.7.

How was this patch tested?

Jenkins.

@HyukjinKwon
Copy link
Member Author

Hey, @shaneknapp, sorry for bugging you about this multiple times :( ..

Looks, at https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/ws/, pyspark-coverage-site is left over for some reasons. I thought Jenkins cleans up the workspace after the test so far (maybe I misunderstood?). It looks no problem in PR builder.

Anyhow, I added one line to remove the cloned repo to be 100% safe. However, I think here I need your guidance.

@shaneknapp
Copy link
Contributor

i don't think this is necessary at all. the build clones the git repo, issues a git clean -fdx and then runs dev/run-tests.

@HyukjinKwon
Copy link
Member Author

Yes .. so I was wondering why pyspark-coverage-site is left over in the build spark-master-test-sbt-hadoop-2.7 specifically. It's being failed because pyspark-coverage-site files are being checked if it has a license header or not, which is being done at the first phase of the build ..

pyspark-coverage-site is being cloned almost at the last step so it shouldn't be there. I suspected if the previous affects it somehow .. or I am misunderstanding this completely.

@shaneknapp
Copy link
Contributor

ok, so this is weird... i ran a build (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5473/console) that just failed, and i put an ls in the execution step to see what's left after the git clean -fdx runs.

the build failed, which is definitely odd, so i logged in to the machine and manually ran git clean -fdx... and that didn't work at all:

-bash-4.1$ git clean -fdx
Removing R/lib/
Removing R/pkg/man/
Removing dev/sparktestsupport/__init__.pyc
Removing dev/sparktestsupport/modules.pyc
Removing dev/sparktestsupport/shellutils.pyc
Removing dev/sparktestsupport/toposort.pyc
Removing lib/
Removing pyspark-coverage-site/
Removing target/
-bash-4.1$ cd pyspark-coverage-site/
-bash-4.1$ ls
<ALL OF THE PYSPARK COVERAGE FILES>
-bash-4.1$ cd ..
-bash-4.1$ ls -ld pyspark-coverage-site/
drwxr-xr-x. 3 jenkins jenkins 12288 Feb  1 02:31 pyspark-coverage-site/
-bash-4.1$ git clean -fdx
Removing pyspark-coverage-site/
-bash-4.1$ git clean -fdx
Removing pyspark-coverage-site/
-bash-4.1$ chattr -i pyspark-coverage-site/
-bash-4.1$ git clean -fdx
Removing pyspark-coverage-site/
-bash-4.1$ git clean -fdx
Removing pyspark-coverage-site/
-bash-4.1$ cd pyspark-coverage-site/
-bash-4.1$ ls
<ALL OF THE PYSPARK COVERAGE FILES>

umm.... wtf? i have never seen behavior like this before, and since it's my day off, i really don't feel like the deep dive in to why this is happening. to that end, i ticked the 'clean workspace before build' box, and that got us past the RAT checks.

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5474/console

i'll dive in w/strace later and figure out why the git clean -fdx didn't do the job.

@shaneknapp
Copy link
Contributor

also, this PR isn't needed. i'd rather keep any filesystem ops in the build config, not the test code itself.

@HyukjinKwon
Copy link
Member Author

Ah, Okie. Let me close this then. Thanks for taking a look - o was wondering if there's any way to cope with it within the codes.

@HyukjinKwon HyukjinKwon closed this Feb 2, 2019
@SparkQA
Copy link

SparkQA commented Feb 2, 2019

Test build #102017 has finished for PR 23729 at commit a492263.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon referenced this pull request Jun 24, 2019
## What changes were proposed in this pull request?

When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same.

This PR checks the `mapPartitionsInternal` iterator to be non-empty before calling `ArrowPythonRunner` to start computation on the iterator.

## How was this patch tested?

Existing tests. Ran the following benchmarks a simple example where most partitions are empty:

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *

df = spark.createDataFrame(
     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
     ("id", "v"))

pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())

df.groupby("id").apply(normalize).count()
```

**Before**
```
In [4]: %timeit df.groupby("id").apply(normalize).count()
1.58 s ± 62.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit df.groupby("id").apply(normalize).count()
1.52 s ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit df.groupby("id").apply(normalize).count()
1.52 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

**After this Change**
```
In [2]: %timeit df.groupby("id").apply(normalize).count()
646 ms ± 89.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df.groupby("id").apply(normalize).count()
408 ms ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df.groupby("id").apply(normalize).count()
381 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Closes #24926 from BryanCutler/pyspark-pandas_udf-map-agg-skip-empty-parts-SPARK-28128.

Authored-by: Bryan Cutler <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon HyukjinKwon reopened this Jun 24, 2019
@HyukjinKwon HyukjinKwon reopened this Jun 24, 2019
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-7721][INFRA][FOLLOW-UP] Remove cloned coverage repo after posting HTMLs [SPARK-7721][INFRA][FOLLOW-UP] Remove cloned coverage repo after posting HTMLs Jun 24, 2019
@HyukjinKwon
Copy link
Member Author

Let me leave this open in case we want it. cc @shaneknapp and @srowen.

@SparkQA
Copy link

SparkQA commented Jun 24, 2019

Test build #106836 has finished for PR 23729 at commit 2e4846a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shaneknapp
Copy link
Contributor

this PR is fine, but i really like the approach in #24950 better.

@SparkQA
Copy link

SparkQA commented Jun 24, 2019

Test build #4808 has finished for PR 23729 at commit 2e4846a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shaneknapp
Copy link
Contributor

since this was fixed in #24950 (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/6046/console), i would recommend that we close this and not merge the changes in.

========================================================================
Running Apache RAT checks
========================================================================
Attempting to fetch rat
RAT checks passed.

@SparkQA
Copy link

SparkQA commented Jun 24, 2019

Test build #4809 has finished for PR 23729 at commit 2e4846a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 24, 2019

I'm OK either way on this one. It seems like good cleanup but isn't essential now.

@shaneknapp
Copy link
Contributor

I'm OK either way on this one. It seems like good cleanup but isn't essential now.

same.

@HyukjinKwon
Copy link
Member Author

I don't strongly feel about this PR too. Let me just merge this one just to be safer - shouldn't hurt something in any way ...

@shaneknapp
Copy link
Contributor

shaneknapp commented Jun 25, 2019

i had to manually clean up all of the workers and remove existing pyspark-coverage-site dirs to get the build green again.

why git clean -fdx isn't actually removing that dir is beyond me. :\

for an example, see: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/6047/console

this is the output of the git clean -fdx command:

[spark-master-test-sbt-hadoop-2.7] $ /bin/bash /tmp/hudson4664062016256370877.sh
Removing R/lib/
Removing R/pkg/man/
Removing dev/sparktestsupport/__init__.pyc
Removing dev/sparktestsupport/modules.pyc
Removing dev/sparktestsupport/shellutils.pyc
Removing dev/sparktestsupport/toposort.pyc
Removing lib/
Removing pyspark-coverage-site/
Removing target/

great! pyspark-coverage-site removed! yay!

however:

========================================================================
Running Apache RAT checks
========================================================================
Attempting to fetch rat
Could not find Apache license headers in the following files:
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/index.html
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/jquery.ba-throttle-debounce.min.js
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/jquery.hotkeys.js
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/jquery.isonscreen.js
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/jquery.min.js
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/jquery.tablesorter.min.js
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/pyspark_cloudpickle_py.html
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/pyspark_heapq3_py.html
 !????? /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/pyspark_join_py.html

i ran this command to clean stuff up: $ pssh -h centos_workers.txt -i "rm -rf workspace/spark-master-test-sbt-hadoop-2.7/pyspark-coverage-site/"

then i launched a build, and it passed the RAT check:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/6056/

the coverage command is run AFTER the RAT checks, so we shouldn't have this problem no matter what. why that dir is sticking around is beyond me.

i'll keep an eye on this over the course of this week and if necessary futz more w/the build configs to make things behave. :(

@HyukjinKwon HyukjinKwon deleted the followup-coverage branch March 3, 2020 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants