[SPARK-26024][SQL]: Update documentation for repartitionByRange #23025

JulienPeloton · 2018-11-13T21:01:56Z

Following SPARK-26024, I noticed the number of elements in each partition after repartitioning using df.repartitionByRange can vary for the same setup:

// Shuffle numbers from 0 to 1000, and make a DataFrame
val df = Random.shuffle(0.to(1000)).toDF("val")

// Repartition it using 3 partitions
// Sum up number of elements in each partition, and collect it.
// And do it several times
for (i <- 0 to 9) {
  var counts = df.repartitionByRange(3, col("val"))
    .mapPartitions{part => Iterator(part.size)}
    .collect()
  println(counts.toList)
}
// -> the number of elements in each partition varies

This is expected as for performance reasons this method uses sampling to estimate the ranges (with default size of 100). Hence, the output may not be consistent, since sampling can return different values. But documentation was not mentioning it at all, leading to misunderstanding.

What changes were proposed in this pull request?

Update the documentation (Spark & PySpark) to mention the impact of spark.sql.execution.rangeExchange.sampleSizePerPartition on the resulting partitioned DataFrame.

…Spark)

…PySpark)

viirya · 2018-11-16T02:07:32Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

   * When no explicit sort order is specified, "ascending nulls first" is assumed.
   * Note, the rows are not sorted in each partition of the resulting Dataset.
   *
+   * [SPARK-26024] Note that due to performance reasons this method uses sampling to


We can drop [SPARK-26024] here.

Thanks. Done.

viirya · 2018-11-16T02:10:19Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * [SPARK-26024] Note that due to performance reasons this method uses sampling to
+   * estimate the ranges. Hence, the output may not be consistent, since sampling can return
+   * different values. The sample size can be controlled by setting the value of the parameter
+   * {{spark.sql.execution.rangeExchange.sampleSizePerPartition}}.


`spark.sql.execution.rangeExchange.sampleSizePerPartition`.

Thanks. Done.

viirya · 2018-11-16T06:42:36Z

python/pyspark/sql/dataframe.py

        At least one partition-by expression must be specified.
        When no explicit sort order is specified, "ascending nulls first" is assumed.

+        [SPARK-26024] Note that due to performance reasons this method uses sampling to


"[SPARK-26024]" can be removed too.

viirya · 2018-11-16T06:42:38Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

   * Note, the rows are not sorted in each partition of the resulting Dataset.
   *
+   *
+   * [SPARK-26024] Note that due to performance reasons this method uses sampling to


viirya · 2018-11-16T06:42:50Z

cc @cloud-fan

JulienPeloton · 2018-11-16T06:46:00Z

@viirya OK all references to SPARK-26024 removed from the doc.

cloud-fan · 2018-11-19T01:38:41Z

ok to test

cloud-fan · 2018-11-19T01:40:27Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+   * Note that due to performance reasons this method uses sampling to estimate the ranges.
+   * Hence, the output may not be consistent, since sampling can return different values.
+   * The sample size can be controlled by setting the value of the parameter
+   * `spark.sql.execution.rangeExchange.sampleSizePerPartition`.


It's not a parameter but a config. So I'd like to propose

The sample size can be controlled by the config `xxx`

@cloud-fan the sentence has been changed according to your suggestion (in both Spark & PySpark).

SparkQA · 2018-11-19T05:08:34Z

Test build #98987 has finished for PR 23025 at commit 654fed9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-11-19T07:09:03Z

python/pyspark/sql/dataframe.py

        At least one partition-by expression must be specified.
        When no explicit sort order is specified, "ascending nulls first" is assumed.

+        Note that due to performance reasons this method uses sampling to estimate the ranges.


Besides Python, we also have repartitionByRange API in R. Can you also update it?

Oh right, I missed it! Pushed.

SparkQA · 2018-11-19T08:05:01Z

Test build #98992 has finished for PR 23025 at commit 7ca4821.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-19T08:05:01Z

Test build #98991 has finished for PR 23025 at commit f829dfe.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-11-19T08:23:02Z

retest this please.

SparkQA · 2018-11-19T12:03:12Z

Test build #98995 has finished for PR 23025 at commit 7ca4821.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-19T14:25:07Z

thanks, merging to master!

JulienPeloton · 2018-11-19T15:01:49Z

Thanks all for the reviews!

felixcheung · 2018-11-27T17:13:21Z

R/pkg/R/DataFrame.R

 #'                      using \code{spark.sql.shuffle.partitions} as number of partitions.}
 #'}
 #'
+#' At least one partition-by expression must be specified.


this won't be formatted correctly in R doc due to the fact that "empty line" is significant. L769 should be removed to ensure it is in description

I see. What about on 761? I see several docs around here with empty lines (829, 831 below). Are those different? These comments are secondary, but I guess they belong in the public docs as much as anything.

761 is significant also, but correct.

essentially:

first line of the blob is the title (L760)

second text after "empty line" is the description (L762)

third after another "empty line" is the "detail note" which is stashed all the way to the bottom of the doc page

so generally you want "important" part of the description on top and not in the "detail" section because it is easily missed.

this is the most common pattern in this code base. there's another, where multiple function is doc together as a group, eg. collection sql function (in functions.R). other finer control is possible as well but not used today in this code base.

similarly L829 is good, L831 is a bit fuzzy - I'd personally prefer without L831 to keep the whole text in the description section of the doc. for me, generally if the doc text starts with "Note that" I'm ok with it in the "detail" section.

@felixcheung have a look at #23167

@felixcheung Thanks, I did not know about this strict doc formatting rule in R.

@srowen Thanks for taking care of the fix!

Following [SPARK-26024](https://issues.apache.org/jira/browse/SPARK-26024), I noticed the number of elements in each partition after repartitioning using `df.repartitionByRange` can vary for the same setup: ```scala // Shuffle numbers from 0 to 1000, and make a DataFrame val df = Random.shuffle(0.to(1000)).toDF("val") // Repartition it using 3 partitions // Sum up number of elements in each partition, and collect it. // And do it several times for (i <- 0 to 9) { var counts = df.repartitionByRange(3, col("val")) .mapPartitions{part => Iterator(part.size)} .collect() println(counts.toList) } // -> the number of elements in each partition varies ``` This is expected as for performance reasons this method uses sampling to estimate the ranges (with default size of 100). Hence, the output may not be consistent, since sampling can return different values. But documentation was not mentioning it at all, leading to misunderstanding. ## What changes were proposed in this pull request? Update the documentation (Spark & PySpark) to mention the impact of `spark.sql.execution.rangeExchange.sampleSizePerPartition` on the resulting partitioned DataFrame. Closes apache#23025 from JulienPeloton/SPARK-26024. Authored-by: Julien <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

JulienPeloton added 2 commits November 13, 2018 21:49

Update documentation on repartitionByRange according to SPARK-26024 (…

b47b6d0

…Spark)

Update documentation on repartitionByRange according to SPARK-26024 (…

5a50282

…PySpark)

viirya reviewed Nov 16, 2018

View reviewed changes

Update with reviewer comments (text formatting)

5bce520

viirya reviewed Nov 16, 2018

View reviewed changes

Remove remaining reference to JIRA ticket

654fed9

srowen approved these changes Nov 18, 2018

View reviewed changes

cloud-fan reviewed Nov 19, 2018

View reviewed changes

Replace parameter by config according to cloud-fan suggestion

f829dfe

viirya reviewed Nov 19, 2018

View reviewed changes

Update the documentation of repartitionByRange in the R API.

7ca4821

viirya approved these changes Nov 19, 2018

View reviewed changes

asfgit closed this in 35c5516 Nov 20, 2018

felixcheung reviewed Nov 27, 2018

View reviewed changes

[SPARK-26024][SQL]: Update documentation for repartitionByRange #23025

[SPARK-26024][SQL]: Update documentation for repartitionByRange #23025

Uh oh!

Conversation

JulienPeloton commented Nov 13, 2018

What changes were proposed in this pull request?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Nov 16, 2018

Uh oh!

JulienPeloton commented Nov 16, 2018

Uh oh!

cloud-fan commented Nov 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 19, 2018

Uh oh!

SparkQA commented Nov 19, 2018

Uh oh!

viirya commented Nov 19, 2018

Uh oh!

SparkQA commented Nov 19, 2018

Uh oh!

cloud-fan commented Nov 19, 2018

Uh oh!

JulienPeloton commented Nov 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

felixcheung Nov 28, 2018 •

edited

Loading