Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Jul 18, 2018

What changes were proposed in this pull request?

This PR adds a new collection function: shuffle. It generates a random permutation of the given array. This implementation uses the "inside-out" version of Fisher-Yates algorithm.

How was this patch tested?

New tests are added to CollectionExpressionsSuite.scala and DataFrameFunctionsSuite.scala.

@ueshin
Copy link
Member Author

ueshin commented Jul 18, 2018

cc @pkuwm

"""
Collection function: Generates a random permutation of the given array.
.. note:: The function is non-deterministic because its results depends on order of rows which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it non-deterministic rather for the fact that the permutation is determined randomly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this one would be better? "The function is non-deterministic because it produces
an unbiased permutation: every permutation is equally likely."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The permutation is determined randomly but it is determined for the same query plan if the order of rows is determined, because the analyzer will assign a random seed for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given a same input sequence, will this function always return the same permutation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed is fixed when analysis phase, so if we, say, collect() twice or more from the same DataFrame, we will get the same result:

val df = .. .select(shuffle('arr))
df.collect() == df.collect()

but if we create another DataFrame from the same input, we will get different results:

val df1 = .. .select(shuffle('arr))
val df2 = .. .select(shuffle('arr))
df1.collect() != df2.collect()

null
).toDF("i")

def checkResult1(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a different name for the method?

@SparkQA
Copy link

SparkQA commented Jul 18, 2018

Test build #93231 has finished for PR 21802 at commit b4cbb55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Shuffle(child: Expression, randomSeed: Option[Long] = None)

@ueshin ueshin force-pushed the issues/SPARK-23928/shuffle branch from 5817265 to 9081e2f Compare July 19, 2018 06:25
@SparkQA
Copy link

SparkQA commented Jul 19, 2018

Test build #93261 has finished for PR 21802 at commit 9081e2f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Jul 19, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 19, 2018

Test build #93268 has finished for PR 21802 at commit 9081e2f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2018

Test build #93279 has finished for PR 21802 at commit f38b698.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Jul 20, 2018

cc @cloud-fan @gatorsmile

override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUp {
case p if p.resolved => p
case p => p transformExpressionsUp {
case Shuffle(child, None) => Shuffle(child, Some(random.nextLong()))
Copy link
Contributor

@cloud-fan cloud-fan Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks reasonable, do we have any context about why we start doing this? instead of picking a seed at runtime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the implementation of Uuid and when we started doing this was at #20861. cc @viirya

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then can we use a single rule to assign seed to these randomized functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in Uuid we want to make sure the same query plan can return the same result. It is more deterministic between retries.

* Returns a random permutation of the given array.
*
* This implementation uses the modern version of Fisher-Yates algorithm.
* Reference: https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Modern_method
Copy link
Contributor

@cloud-fan cloud-fan Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to use Fisher-Yates here? AFAIK we should not change the input value, in case it's used by other expressions and common subexpression elimination is enabled.

Copy link
Member Author

@ueshin ueshin Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copy the input array before starting shuffle not to change the input value. Isn't it safe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we create a new array, I guess there should be some simpler algorithms without swapping...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Let me try.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest commit, "inside-out" looks simpler without swapping (using just an assignment).

@rxin
Copy link
Contributor

rxin commented Jul 20, 2018

Do we really need full codegen for all of these collection functions? They seem pretty slow and specialization with full codegen won't help perf that much (and might even hurt by blowing up the code size) right?

@SparkQA
Copy link

SparkQA commented Jul 21, 2018

Test build #93373 has finished for PR 21802 at commit 2ca1230.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class RandomIndicesGenerator(randomSeed: Long)

assert(evaluateWithUnsafeProjection(Shuffle(ai0, seed1)) ===
evaluateWithUnsafeProjection(Shuffle(ai0, seed1)))

val seed2 = Some(r.nextLong())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to ensure this property (different seeds must generate different result)?
We likely expect this property. However, I think that this test is too strict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also followed a test for Uuid MiscExpressionsSuite.scala#L46-L68 here.
@viirya WDYT about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what we expect. The result is decided by the random seed. So if using different random seeds, I think the results should be different.

@SparkQA
Copy link

SparkQA commented Jul 24, 2018

Test build #93493 has finished for PR 21802 at commit c56ecc5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@rxin Generally I agree with you, but currently the whole-stage-codegen doesn't support CodegenFallback, which means, if we don't implement codegen, this expression will stop WSC and hurt perf a lot.

val isPrimitiveType = CodeGenerator.isPrimitiveType(elementType)

val numElements = ctx.freshName("numElements")
val arrayData = ctx.freshName("arrayData")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need the arrayData variable, we can assign ev.value directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we need a new variable to use ctx.createUnsafeArray() which declares a new variable in it for now whereas ev.value is already declared.

case class RandomIndicesGenerator(randomSeed: Long) {
private val random = new MersenneTwister(randomSeed)

def getNextIndices(length: Int): Array[Int] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should take a Array[Int], to save the array creation.

Copy link
Member Author

@ueshin ueshin Jul 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to create an array to store the shuffled indices anyway. If we want to pass an array to be shuffled, we need to create the array and fill it with 0 until n before we call this. But with this implementation, we don't need to fill the numbers prior to shuffle thanks to the "inside-out" version of Fisher-Yates algorithm. WDYT?

@cloud-fan
Copy link
Contributor

LGTM

1 similar comment
@kiszk
Copy link
Member

kiszk commented Jul 27, 2018

LGTM

"""
Collection function: Generates a random permutation of the given array.
.. note:: The function is non-deterministic because its results depends on order of rows which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: results depends found while reading this one.

[3, 1, 5, 20]
> SELECT _FUNC_(array(1, 20, null, 3));
[20, null, 3, 1]
""", since = "2.4.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add note here too.

* Returns a random permutation of the given array.
*
* @group collection_funcs
* @since 2.4.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we match the documentation here as well?

.. note:: The function is non-deterministic because its results depends on order of rows which
may be non-deterministic after a shuffle.
:param col: name of column or expression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python doctest looks missing.

@HyukjinKwon
Copy link
Member

Looks good to me too

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93661 has finished for PR 21802 at commit 4135690.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Jul 27, 2018

Thanks! merging to master.

@asfgit asfgit closed this in ef6c839 Jul 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants