Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Apr 23, 2016

What changes were proposed in this pull request?

This PR adds fromPrimitiveArray and toPrimitiveArray in UnsafeArrayData, so that we can do the conversion much faster in VectorUDT/MatrixUDT.

How was this patch tested?

existing tests and new test suite UnsafeArraySuite

@SparkQA
Copy link

SparkQA commented Apr 23, 2016

Test build #56803 has finished for PR 12640 at commit 4ce27ad.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class SpecializedArrayData extends ArrayData
    • class IntArrayData(val values: Array[Int]) extends SpecializedArrayData
    • class DoubleArrayData(val values: Array[Double]) extends SpecializedArrayData

@cloud-fan
Copy link
Contributor Author

I ran the simple benchmark in JIRA:

sc.parallelize(0 until 1e4.toInt, 1).map { i =>
  (i, Vectors.dense(Array.fill(1e6.toInt)(1.0)))
}.toDF.rdd.count()

The result for master is: 199306 ms
The result after tiis PR: 69342 ms

It's still much slower than the 1.4 version, but we have unsafe format since 1.5, and for this simple benchmark that has nearly no execution but only serialization, it's reasonable to be slower.

cc @mengxr

@SparkQA
Copy link

SparkQA commented Apr 23, 2016

Test build #56804 has finished for PR 12640 at commit 55d3178.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class SpecializedArrayData extends ArrayData
    • class IntArrayData(val values: Array[Int]) extends SpecializedArrayData
    • class DoubleArrayData(val values: Array[Double]) extends SpecializedArrayData

@SparkQA
Copy link

SparkQA commented Apr 24, 2016

Test build #56843 has finished for PR 12640 at commit 13d7eb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class SpecializedArrayData extends ArrayData
    • class IntArrayData(val values: Array[Int]) extends SpecializedArrayData
    • class DoubleArrayData(val values: Array[Double]) extends SpecializedArrayData

@davies
Copy link
Contributor

davies commented Apr 27, 2016

@cloud-fan Could you also improve the conversion between DoubleArrayData and UnsafeArrayData using memory copy?

@cloud-fan
Copy link
Contributor Author

@davies , it's a good point!

@mengxr
Copy link
Contributor

mengxr commented Apr 28, 2016

@cloud-fan This is still much slower than 1.4 and adding more subclasses of ArrayData may prevent JIT inline methods like getInt and getDouble. Is it easy to convert to UnsafeArrayData directly with memory copy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be very expensive for large arrays because it scans all elements, which is unnecessary to generate the hashCode.

@mengxr
Copy link
Contributor

mengxr commented Apr 28, 2016

Had an offline discussion with @cloud-fan and we will try converting from/to UnsafeArrayData directly using memory copy and test its performance.

@cloud-fan cloud-fan changed the title [SPARK-14850][ML] specialize array data for VectorUDT/MatrixUDT [SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT Apr 28, 2016
return arrayCopy;
}

public int[] toPrimitiveIntArray() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't override the toIntArray, but create this special method instead. This operation is dangerous, if some elements are null, we won't return 0, but may crash instead. The reason is we don't write null values, if an element is null, we simply mark it as null in the offset region and skip it. For example, the data size of unsafe int array may be less than 4 * numElements and the memory copy may crash.

Ideally I think we need to improve unsafe array format to handle primitive array better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be hard to tell the difference between toPrimitiveIntArray and toIntArray by name and signature because both returns primitive arrays. How about toIntArrayUnchecked? Please add JavaDoc to explain the difference.

@kiszk
Copy link
Member

kiszk commented Apr 28, 2016

@cloud-fan , @mengxr, it would be worth to add final to a declaration of UnsafeArrayData for encouraging method inlining by JIT compiler, as follows:
public final class UnsafeArrayData extends ArrayData


public int[] toPrimitiveIntArray() {
int[] result = new int[numElements];
Platform.copyMemory(baseObject, baseOffset + 4 + 4 * numElements,
Copy link
Contributor

@mengxr mengxr Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 4 * -> 4L * to avoid overflow. Please check other places as well.
  • I don't quite understand the offsetRegionSize. Is it reserved for marking null values or handling variable-length elements in the future? This is quite expensive for primitive arrays. nvm, I saw L364.

@mengxr
Copy link
Contributor

mengxr commented Apr 28, 2016

@cloud-fan Could you also update the benchmark?

@SparkQA
Copy link

SparkQA commented Apr 28, 2016

Test build #57230 has finished for PR 12640 at commit f4d2cbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

serialize 380 / 392 0.0 379730.0 1.0X
deserialize 138 / 142 0.0 137816.6 2.8X
*/
benchmark.run()
Copy link
Contributor Author

@cloud-fan cloud-fan Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result on master:

VectorUDT de/serialization:         Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
serialize                                1414 / 1462          0.0     1414104.1       1.0X
deserialize                               169 /  178          0.0      169323.7       8.4X

The serialize is much faster now, but the deserialize isn't , investigating

Copy link
Contributor Author

@cloud-fan cloud-fan Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a micro benchmark, the toDoubleArray and the new toDoubleArrayUnchecked don't have much difference(the new one is only 20% faster). Maybe JVM can optimize simple while loop?

 def toDoubleArray(): Array[Double] = {
    val size = numElements()
    val values = new Array[Double](size)
    var i = 0
    while (i < size) {
      values(i) = getDouble(i)
      i += 1
    }
    values
  }

cc @davies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, could you run the benchmark with more iterations to make sure that the C2 compiler could kick in (especially in Java 8)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rerun the benchmark with 5 times higher iterations, but the result shows no difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we ran the test multiple times, and pick the best one, so that's fine.

@SparkQA
Copy link

SparkQA commented Apr 28, 2016

Test build #57237 has finished for PR 12640 at commit a7b7694.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class UnsafeArrayData extends ArrayData
    • public final class UnsafeMapData extends MapData

@SparkQA
Copy link

SparkQA commented Apr 28, 2016

Test build #57241 has finished for PR 12640 at commit d445022.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 28, 2016

Test build #57239 has finished for PR 12640 at commit f6964f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 28, 2016

Test build #57253 has finished for PR 12640 at commit c6c3584.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

var sum = 0
var i = 0
while (i < numRows) {
sum += encoder.toRow(vectors(i)).numFields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call VectorUDT.serialize directly instead of encoder.toRows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's different, VectorUDT.serialize only turn user object to catalyst data, but the real serialization should also include convert catalyst data into unsafe format.

@SparkQA
Copy link

SparkQA commented Apr 29, 2016

Test build #57320 has finished for PR 12640 at commit 537e363.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2016

Test build #57387 has finished for PR 12640 at commit b10845c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please.

@mengxr
Copy link
Contributor

mengxr commented Apr 30, 2016

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Apr 30, 2016

Test build #57399 has finished for PR 12640 at commit b10845c.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 30, 2016

Test build #57403 has finished for PR 12640 at commit b10845c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Apr 30, 2016

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 43b149f Apr 30, 2016

public static UnsafeArrayData fromPrimitiveArray(int[] arr) {
if (arr.length > (Integer.MAX_VALUE - 4) / 8) {
throw new UnsupportedOperationException("Cannot convert this array to unsafe format as " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include (Integer.MAX_VALUE - 4) / 8 in the message so that the user knows the limit ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants