[SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT #12640

cloud-fan · 2016-04-23T15:21:07Z

What changes were proposed in this pull request?

This PR adds fromPrimitiveArray and toPrimitiveArray in UnsafeArrayData, so that we can do the conversion much faster in VectorUDT/MatrixUDT.

How was this patch tested?

existing tests and new test suite UnsafeArraySuite

SparkQA · 2016-04-23T15:23:34Z

Test build #56803 has finished for PR 12640 at commit 4ce27ad.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class SpecializedArrayData extends ArrayData
- class IntArrayData(val values: Array[Int]) extends SpecializedArrayData
- class DoubleArrayData(val values: Array[Double]) extends SpecializedArrayData

cloud-fan · 2016-04-23T15:36:59Z

I ran the simple benchmark in JIRA:

sc.parallelize(0 until 1e4.toInt, 1).map { i =>
  (i, Vectors.dense(Array.fill(1e6.toInt)(1.0)))
}.toDF.rdd.count()

The result for master is: 199306 ms
The result after tiis PR: 69342 ms

It's still much slower than the 1.4 version, but we have unsafe format since 1.5, and for this simple benchmark that has nearly no execution but only serialization, it's reasonable to be slower.

cc @mengxr

SparkQA · 2016-04-23T16:57:49Z

Test build #56804 has finished for PR 12640 at commit 55d3178.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class SpecializedArrayData extends ArrayData
- class IntArrayData(val values: Array[Int]) extends SpecializedArrayData
- class DoubleArrayData(val values: Array[Double]) extends SpecializedArrayData

SparkQA · 2016-04-24T09:32:17Z

Test build #56843 has finished for PR 12640 at commit 13d7eb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class SpecializedArrayData extends ArrayData
- class IntArrayData(val values: Array[Int]) extends SpecializedArrayData
- class DoubleArrayData(val values: Array[Double]) extends SpecializedArrayData

davies · 2016-04-27T06:05:57Z

@cloud-fan Could you also improve the conversion between DoubleArrayData and UnsafeArrayData using memory copy?

cloud-fan · 2016-04-27T06:38:12Z

@davies , it's a good point!

mengxr · 2016-04-28T05:31:10Z

@cloud-fan This is still much slower than 1.4 and adding more subclasses of ArrayData may prevent JIT inline methods like getInt and getDouble. Is it easy to convert to UnsafeArrayData directly with memory copy?

mengxr · 2016-04-28T05:32:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala

This could be very expensive for large arrays because it scans all elements, which is unnecessary to generate the hashCode.

mengxr · 2016-04-28T06:08:31Z

Had an offline discussion with @cloud-fan and we will try converting from/to UnsafeArrayData directly using memory copy and test its performance.

cloud-fan · 2016-04-28T07:26:54Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java

    return arrayCopy;
  }
+
+  public int[] toPrimitiveIntArray() {


I didn't override the toIntArray, but create this special method instead. This operation is dangerous, if some elements are null, we won't return 0, but may crash instead. The reason is we don't write null values, if an element is null, we simply mark it as null in the offset region and skip it. For example, the data size of unsafe int array may be less than 4 * numElements and the memory copy may crash.

Ideally I think we need to improve unsafe array format to handle primitive array better.

It would be hard to tell the difference between toPrimitiveIntArray and toIntArray by name and signature because both returns primitive arrays. How about toIntArrayUnchecked? Please add JavaDoc to explain the difference.

kiszk · 2016-04-28T07:47:46Z

@cloud-fan , @mengxr, it would be worth to add final to a declaration of UnsafeArrayData for encouraging method inlining by JIT compiler, as follows:
public final class UnsafeArrayData extends ArrayData

mengxr · 2016-04-28T07:59:06Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java

+
+  public int[] toPrimitiveIntArray() {
+    int[] result = new int[numElements];
+    Platform.copyMemory(baseObject, baseOffset + 4 + 4 * numElements,


4 * -> 4L * to avoid overflow. Please check other places as well.

~~I don't quite understand the offsetRegionSize. Is it reserved for marking null values or handling variable-length elements in the future? This is quite expensive for primitive arrays.~~ nvm, I saw L364.

mengxr · 2016-04-28T08:11:53Z

@cloud-fan Could you also update the benchmark?

SparkQA · 2016-04-28T08:34:19Z

Test build #57230 has finished for PR 12640 at commit f4d2cbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-04-28T09:23:40Z

mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala

+    serialize                                 380 /  392          0.0      379730.0       1.0X
+    deserialize                               138 /  142          0.0      137816.6       2.8X
+    */
+    benchmark.run()


result on master:

VectorUDT de/serialization: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- serialize 1414 / 1462 0.0 1414104.1 1.0X deserialize 169 / 178 0.0 169323.7 8.4X

The serialize is much faster now, but the deserialize isn't , investigating

did a micro benchmark, the toDoubleArray and the new toDoubleArrayUnchecked don't have much difference(the new one is only 20% faster). Maybe JVM can optimize simple while loop?

def toDoubleArray(): Array[Double] = { val size = numElements() val values = new Array[Double](size) var i = 0 while (i < size) { values(i) = getDouble(i) i += 1 } values }

cc @davies

I think so, could you run the benchmark with more iterations to make sure that the C2 compiler could kick in (especially in Java 8)?

I rerun the benchmark with 5 times higher iterations, but the result shows no difference.

Because we ran the test multiple times, and pick the best one, so that's fine.

SparkQA · 2016-04-28T09:46:49Z

Test build #57237 has finished for PR 12640 at commit a7b7694.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public final class UnsafeArrayData extends ArrayData
- public final class UnsafeMapData extends MapData

SparkQA · 2016-04-28T10:01:07Z

Test build #57241 has finished for PR 12640 at commit d445022.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-28T10:48:59Z

Test build #57239 has finished for PR 12640 at commit f6964f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-28T15:47:47Z

Test build #57253 has finished for PR 12640 at commit c6c3584.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-04-28T21:29:55Z

mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala

+      var sum = 0
+      var i = 0
+      while (i < numRows) {
+        sum += encoder.toRow(vectors(i)).numFields


Can we call VectorUDT.serialize directly instead of encoder.toRows?

it's different, VectorUDT.serialize only turn user object to catalyst data, but the real serialization should also include convert catalyst data into unsafe format.

SparkQA · 2016-04-29T08:20:34Z

Test build #57320 has finished for PR 12640 at commit 537e363.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-30T01:29:38Z

Test build #57387 has finished for PR 12640 at commit b10845c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-04-30T03:43:05Z

retest this please.

mengxr · 2016-04-30T03:49:25Z

LGTM pending Jenkins

SparkQA · 2016-04-30T03:55:08Z

Test build #57399 has finished for PR 12640 at commit b10845c.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-04-30T03:56:53Z

retest this please.

SparkQA · 2016-04-30T05:19:21Z

Test build #57403 has finished for PR 12640 at commit b10845c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-04-30T06:05:07Z

LGTM. Merged into master. Thanks!

tedyu · 2016-04-30T13:36:04Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java

+
+  public static UnsafeArrayData fromPrimitiveArray(int[] arr) {
+    if (arr.length > (Integer.MAX_VALUE - 4) / 8) {
+      throw new UnsupportedOperationException("Cannot convert this array to unsafe format as " +


Include (Integer.MAX_VALUE - 4) / 8 in the message so that the user knows the limit ?

specialize array data

55d3178

cloud-fan force-pushed the ml branch from 4ce27ad to 55d3178 Compare April 23, 2016 15:28

mengxr reviewed Apr 28, 2016
View reviewed changes

update

f4d2cbb

cloud-fan force-pushed the ml branch from 13d7eb7 to f4d2cbb Compare April 28, 2016 07:10

cloud-fan changed the title ~~[SPARK-14850][ML] specialize array data for VectorUDT/MatrixUDT~~ [SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT Apr 28, 2016

cloud-fan reviewed Apr 28, 2016
View reviewed changes

mengxr reviewed Apr 28, 2016
View reviewed changes

cloud-fan added 2 commits April 28, 2016 16:44

address comments

a7b7694

add benchmark

f6964f9

cloud-fan reviewed Apr 28, 2016
View reviewed changes

rename

c6c3584

cloud-fan force-pushed the ml branch from d445022 to c6c3584 Compare April 28, 2016 14:18

mengxr reviewed Apr 28, 2016
View reviewed changes

add size check

537e363

cloud-fan added 2 commits April 30, 2016 08:49

Merge remote-tracking branch 'origin/master' into ml

ae6f365

remove unchecked toArray

b10845c

asfgit closed this in 43b149f Apr 30, 2016

tedyu reviewed Apr 30, 2016
View reviewed changes

cloud-fan mentioned this pull request Jun 23, 2016

[SPARK-16043][SQL] Prepare GenericArrayData implementation specialized for a primitive array #13758

Closed

kiszk mentioned this pull request Jun 23, 2016

[SPARK-15962][SQL] Introduce implementation with a dense format for UnsafeArrayData #13680

Closed

[SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT #12640

[SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT #12640

Uh oh!

Conversation

cloud-fan commented Apr 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 23, 2016

Uh oh!

cloud-fan commented Apr 23, 2016

Uh oh!

SparkQA commented Apr 23, 2016

Uh oh!

SparkQA commented Apr 24, 2016

Uh oh!

davies commented Apr 27, 2016

Uh oh!

cloud-fan commented Apr 27, 2016

Uh oh!

mengxr commented Apr 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Apr 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk commented Apr 28, 2016

Uh oh!

mengxr Apr 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Apr 28, 2016

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

cloud-fan Apr 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 29, 2016

Uh oh!

SparkQA commented Apr 30, 2016

Uh oh!

cloud-fan commented Apr 30, 2016

Uh oh!

mengxr commented Apr 30, 2016

Uh oh!

SparkQA commented Apr 30, 2016

Uh oh!

cloud-fan commented Apr 30, 2016

Uh oh!

SparkQA commented Apr 30, 2016

cloud-fan commented Apr 23, 2016 •

edited

Loading

mengxr Apr 28, 2016 •

edited

Loading

cloud-fan Apr 28, 2016 •

edited

Loading

cloud-fan Apr 28, 2016 •

edited

Loading