-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-7721][INFRA][FOLLOW-UP] Remove cloned coverage repo after posting HTMLs #23729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Authored-by: shane knapp <[email protected]> Signed-off-by: shane knapp <[email protected]>
|
Hey, @shaneknapp, sorry for bugging you about this multiple times :( .. Looks, at Anyhow, I added one line to remove the cloned repo to be 100% safe. However, I think here I need your guidance. |
|
i don't think this is necessary at all. the build clones the git repo, issues a |
|
Yes .. so I was wondering why
|
|
ok, so this is weird... i ran a build (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5473/console) that just failed, and i put an the build failed, which is definitely odd, so i logged in to the machine and manually ran umm.... wtf? i have never seen behavior like this before, and since it's my day off, i really don't feel like the deep dive in to why this is happening. to that end, i ticked the 'clean workspace before build' box, and that got us past the RAT checks. https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5474/console i'll dive in w/strace later and figure out why the |
|
also, this PR isn't needed. i'd rather keep any filesystem ops in the build config, not the test code itself. |
|
Ah, Okie. Let me close this then. Thanks for taking a look - o was wondering if there's any way to cope with it within the codes. |
|
Test build #102017 has finished for PR 23729 at commit
|
## What changes were proposed in this pull request?
When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same.
This PR checks the `mapPartitionsInternal` iterator to be non-empty before calling `ArrowPythonRunner` to start computation on the iterator.
## How was this patch tested?
Existing tests. Ran the following benchmarks a simple example where most partitions are empty:
```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def normalize(pdf):
v = pdf.v
return pdf.assign(v=(v - v.mean()) / v.std())
df.groupby("id").apply(normalize).count()
```
**Before**
```
In [4]: %timeit df.groupby("id").apply(normalize).count()
1.58 s ± 62.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit df.groupby("id").apply(normalize).count()
1.52 s ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit df.groupby("id").apply(normalize).count()
1.52 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
**After this Change**
```
In [2]: %timeit df.groupby("id").apply(normalize).count()
646 ms ± 89.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit df.groupby("id").apply(normalize).count()
408 ms ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit df.groupby("id").apply(normalize).count()
381 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
Closes #24926 from BryanCutler/pyspark-pandas_udf-map-agg-skip-empty-parts-SPARK-28128.
Authored-by: Bryan Cutler <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
|
Let me leave this open in case we want it. cc @shaneknapp and @srowen. |
|
Test build #106836 has finished for PR 23729 at commit
|
|
this PR is fine, but i really like the approach in #24950 better. |
|
Test build #4808 has finished for PR 23729 at commit
|
|
since this was fixed in #24950 (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/6046/console), i would recommend that we close this and not merge the changes in. |
|
Test build #4809 has finished for PR 23729 at commit
|
|
I'm OK either way on this one. It seems like good cleanup but isn't essential now. |
same. |
|
I don't strongly feel about this PR too. Let me just merge this one just to be safer - shouldn't hurt something in any way ... |
|
i had to manually clean up all of the workers and remove existing why for an example, see: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/6047/console this is the output of the great! however: i ran this command to clean stuff up: then i launched a build, and it passed the RAT check: the coverage command is run AFTER the RAT checks, so we shouldn't have this problem no matter what. why that dir is sticking around is beyond me. i'll keep an eye on this over the course of this week and if necessary futz more w/the build configs to make things behave. :( |
What changes were proposed in this pull request?
This PR proposes to remove cloned
pyspark-coverage-siterepo.it doesn't looks a problem in PR builder but somehow it's problematic in
spark-master-test-sbt-hadoop-2.7.How was this patch tested?
Jenkins.