[SPARK-23928][SQL] Add shuffle collection function. #21802

ueshin · 2018-07-18T12:50:16Z

What changes were proposed in this pull request?

This PR adds a new collection function: shuffle. It generates a random permutation of the given array. This implementation uses the "inside-out" version of Fisher-Yates algorithm.

How was this patch tested?

New tests are added to CollectionExpressionsSuite.scala and DataFrameFunctionsSuite.scala.

ueshin · 2018-07-18T12:50:41Z

cc @pkuwm

mn-mikke · 2018-07-18T13:57:56Z

python/pyspark/sql/functions.py

+    """
+    Collection function: Generates a random permutation of the given array.
+
+    .. note:: The function is non-deterministic because its results depends on order of rows which


Isn't it non-deterministic rather for the fact that the permutation is determined randomly?

Maybe this one would be better? "The function is non-deterministic because it produces
an unbiased permutation: every permutation is equally likely."

The permutation is determined randomly but it is determined for the same query plan if the order of rows is determined, because the analyzer will assign a random seed for it.

given a same input sequence, will this function always return the same permutation?

The seed is fixed when analysis phase, so if we, say, collect() twice or more from the same DataFrame, we will get the same result:

val df = .. .select(shuffle('arr)) df.collect() == df.collect()

but if we create another DataFrame from the same input, we will get different results:

val df1 = .. .select(shuffle('arr)) val df2 = .. .select(shuffle('arr)) df1.collect() != df2.collect()

mn-mikke · 2018-07-18T14:44:44Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

+      null
+    ).toDF("i")
+
+    def checkResult1(): Unit = {


Maybe a different name for the method?

SparkQA · 2018-07-18T17:07:13Z

Test build #93231 has finished for PR 21802 at commit b4cbb55.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Shuffle(child: Expression, randomSeed: Option[Long] = None)

SparkQA · 2018-07-19T07:05:01Z

Test build #93261 has finished for PR 21802 at commit 9081e2f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-07-19T07:40:41Z

Jenkins, retest this please.

SparkQA · 2018-07-19T10:50:56Z

Test build #93268 has finished for PR 21802 at commit 9081e2f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-19T14:06:16Z

Test build #93279 has finished for PR 21802 at commit f38b698.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-07-20T04:47:23Z

cc @cloud-fan @gatorsmile

cloud-fan · 2018-07-20T07:09:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+    override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUp {
+      case p if p.resolved => p
+      case p => p transformExpressionsUp {
+        case Shuffle(child, None) => Shuffle(child, Some(random.nextLong()))


this looks reasonable, do we have any context about why we start doing this? instead of picking a seed at runtime

I followed the implementation of Uuid and when we started doing this was at #20861. cc @viirya

then can we use a single rule to assign seed to these randomized functions?

Yeah, in Uuid we want to make sure the same query plan can return the same result. It is more deterministic between retries.

cloud-fan · 2018-07-20T07:18:43Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+ * Returns a random permutation of the given array.
+ *
+ * This implementation uses the modern version of Fisher-Yates algorithm.
+ * Reference: https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Modern_method


is it safe to use Fisher-Yates here? AFAIK we should not change the input value, in case it's used by other expressions and common subexpression elimination is enabled.

I copy the input array before starting shuffle not to change the input value. Isn't it safe?

if we create a new array, I guess there should be some simpler algorithms without swapping...

Oh, I see. Let me try.

In the latest commit, "inside-out" looks simpler without swapping (using just an assignment).

rxin · 2018-07-20T22:20:35Z

Do we really need full codegen for all of these collection functions? They seem pretty slow and specialization with full codegen won't help perf that much (and might even hurt by blowing up the code size) right?

SparkQA · 2018-07-21T04:44:56Z

Test build #93373 has finished for PR 21802 at commit 2ca1230.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class RandomIndicesGenerator(randomSeed: Long)

kiszk · 2018-07-22T18:10:57Z

...st/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala

+    assert(evaluateWithUnsafeProjection(Shuffle(ai0, seed1)) ===
+      evaluateWithUnsafeProjection(Shuffle(ai0, seed1)))
+
+    val seed2 = Some(r.nextLong())


Do we need to ensure this property (different seeds must generate different result)?
We likely expect this property. However, I think that this test is too strict.

I also followed a test for Uuid MiscExpressionsSuite.scala#L46-L68 here.
@viirya WDYT about this?

I think this is what we expect. The result is decided by the random seed. So if using different random seeds, I think the results should be different.

SparkQA · 2018-07-24T13:35:50Z

Test build #93493 has finished for PR 21802 at commit c56ecc5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-07-26T13:59:40Z

@rxin Generally I agree with you, but currently the whole-stage-codegen doesn't support CodegenFallback, which means, if we don't implement codegen, this expression will stop WSC and hurt perf a lot.

cloud-fan · 2018-07-26T14:40:37Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+    val isPrimitiveType = CodeGenerator.isPrimitiveType(elementType)
+
+    val numElements = ctx.freshName("numElements")
+    val arrayData = ctx.freshName("arrayData")


nit: we don't need the arrayData variable, we can assign ev.value directly.

Actually we need a new variable to use ctx.createUnsafeArray() which declares a new variable in it for now whereas ev.value is already declared.

cloud-fan · 2018-07-26T14:41:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RandomIndicesGenerator.scala

+case class RandomIndicesGenerator(randomSeed: Long) {
+  private val random = new MersenneTwister(randomSeed)
+
+  def getNextIndices(length: Int): Array[Int] = {


it should take a Array[Int], to save the array creation.

We need to create an array to store the shuffled indices anyway. If we want to pass an array to be shuffled, we need to create the array and fill it with 0 until n before we call this. But with this implementation, we don't need to fill the numbers prior to shuffle thanks to the "inside-out" version of Fisher-Yates algorithm. WDYT?

cloud-fan · 2018-07-26T14:41:26Z

LGTM

kiszk · 2018-07-27T06:11:46Z

LGTM

HyukjinKwon · 2018-07-27T08:31:51Z

python/pyspark/sql/functions.py

+    """
+    Collection function: Generates a random permutation of the given array.
+
+    .. note:: The function is non-deterministic because its results depends on order of rows which


typo: results depends found while reading this one.

HyukjinKwon · 2018-07-27T08:34:22Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+       [3, 1, 5, 20]
+      > SELECT _FUNC_(array(1, 20, null, 3));
+       [20, null, 3, 1]
+  """, since = "2.4.0")


We could add note here too.

HyukjinKwon · 2018-07-27T08:34:46Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * Returns a random permutation of the given array.
+   *
+   * @group collection_funcs
+   * @since 2.4.0


Shall we match the documentation here as well?

HyukjinKwon · 2018-07-27T08:36:58Z

python/pyspark/sql/functions.py

+    .. note:: The function is non-deterministic because its results depends on order of rows which
+        may be non-deterministic after a shuffle.
+
+    :param col: name of column or expression


Python doctest looks missing.

HyukjinKwon · 2018-07-27T08:42:15Z

Looks good to me too

SparkQA · 2018-07-27T13:24:57Z

Test build #93661 has finished for PR 21802 at commit 4135690.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-07-27T14:02:44Z

Thanks! merging to master.

huizhilu and others added 2 commits July 18, 2018 13:06

Add shuffle collection function.

a3dbd93

Refactor Shuffle function.

b4cbb55

ueshin mentioned this pull request Jul 18, 2018

[SPARK-23928][SQL][WIP] Add shuffle collection function. #21386

Closed

mn-mikke reviewed Jul 18, 2018

View reviewed changes

ueshin added 2 commits July 19, 2018 12:41

Rename test methods.

5a1de14

Back to implement Fisher-Yates algorithm.

9081e2f

ueshin force-pushed the issues/SPARK-23928/shuffle branch from 5817265 to 9081e2f Compare July 19, 2018 06:25

Small fix.

f38b698

cloud-fan reviewed Jul 20, 2018

View reviewed changes

Use a single rule to resolve random seed.

be56f0e

Use the "inside-out" version of Fisher-Yates algorithm.

2ca1230

kiszk reviewed Jul 22, 2018

View reviewed changes

Split tests.

c56ecc5

cloud-fan reviewed Jul 26, 2018

View reviewed changes

HyukjinKwon reviewed Jul 27, 2018

View reviewed changes

Address comments.

4135690

asfgit closed this in ef6c839 Jul 27, 2018

[SPARK-23928][SQL] Add shuffle collection function. #21802

[SPARK-23928][SQL] Add shuffle collection function. #21802

Uh oh!

Conversation

ueshin commented Jul 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ueshin commented Jul 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 18, 2018

Uh oh!

SparkQA commented Jul 19, 2018

Uh oh!

ueshin commented Jul 19, 2018

Uh oh!

SparkQA commented Jul 19, 2018

Uh oh!

SparkQA commented Jul 19, 2018

Uh oh!

ueshin commented Jul 20, 2018

Uh oh!

cloud-fan Jul 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Jul 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jul 20, 2018

Uh oh!

SparkQA commented Jul 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 24, 2018

Uh oh!

cloud-fan commented Jul 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

ueshin commented Jul 18, 2018 •

edited

Loading

cloud-fan Jul 20, 2018 •

edited

Loading

cloud-fan Jul 20, 2018 •

edited

Loading

ueshin Jul 20, 2018 •

edited

Loading

ueshin Jul 26, 2018 •

edited

Loading