Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Aug 11, 2025

What changes were proposed in this pull request?

This PR aims to support createArray in SparkCollectionUtils to take advantage of java.util.Arrays.fill() method which is faster than Scala Array.fill's operation.

Apache Spark uses Array.fill() many times.

$ git grep Array.fill | wc -l
     530

Like the following example, new method is much faster.

val df = spark.range(1, 1024, 1, 1).map { _ =>
val byteData = Array.fill[Byte](5 * 1024 * 1024)('X')
byteData
}.toDF()

scala> spark.time((1 until 1024).map(Array.fill[Byte](5 * 1024 * 1024)('X')).size)
Time taken: 15 ms
val res0: Int = 1023

scala> spark.time((1 until 1024).map(org.apache.spark.util.SparkCollectionUtils.createArray[Byte](5 * 1024 * 1024, 'X')).size)
Time taken: 0 ms
val res1: Int = 1023

Why are the changes needed?

To support a better implementation.

$ bin/spark-shell --driver-memory 12G
...
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.1.0-SNAPSHOT
      /_/

Using Scala version 2.13.16 (OpenJDK 64-Bit Server VM, Java 21.0.8)
...
scala> spark.time(Array.fill[Byte](2_000_000_000)(7))
Time taken: 387 ms

scala> spark.time(org.apache.spark.util.SparkCollectionUtils.createArray[Byte](2_000_000_000, 7))
Time taken: 190 ms

Does this PR introduce any user-facing change?

No, this is a new utility method.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review August 11, 2025 05:23
Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@dongjoon-hyun
Copy link
Member Author

Thank you for review and approval, @peter-toth !

Merged to master for Apache Spark 4.1.0.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-53241 branch August 11, 2025 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants