-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT #12640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #56803 has finished for PR 12640 at commit
|
|
I ran the simple benchmark in JIRA: The result for master is: 199306 ms It's still much slower than the 1.4 version, but we have unsafe format since 1.5, and for this simple benchmark that has nearly no execution but only serialization, it's reasonable to be slower. cc @mengxr |
|
Test build #56804 has finished for PR 12640 at commit
|
|
Test build #56843 has finished for PR 12640 at commit
|
|
@cloud-fan Could you also improve the conversion between DoubleArrayData and UnsafeArrayData using memory copy? |
|
@davies , it's a good point! |
|
@cloud-fan This is still much slower than 1.4 and adding more subclasses of ArrayData may prevent JIT inline methods like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be very expensive for large arrays because it scans all elements, which is unnecessary to generate the hashCode.
|
Had an offline discussion with @cloud-fan and we will try converting from/to UnsafeArrayData directly using memory copy and test its performance. |
| return arrayCopy; | ||
| } | ||
|
|
||
| public int[] toPrimitiveIntArray() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't override the toIntArray, but create this special method instead. This operation is dangerous, if some elements are null, we won't return 0, but may crash instead. The reason is we don't write null values, if an element is null, we simply mark it as null in the offset region and skip it. For example, the data size of unsafe int array may be less than 4 * numElements and the memory copy may crash.
Ideally I think we need to improve unsafe array format to handle primitive array better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be hard to tell the difference between toPrimitiveIntArray and toIntArray by name and signature because both returns primitive arrays. How about toIntArrayUnchecked? Please add JavaDoc to explain the difference.
|
@cloud-fan , @mengxr, it would be worth to add |
|
|
||
| public int[] toPrimitiveIntArray() { | ||
| int[] result = new int[numElements]; | ||
| Platform.copyMemory(baseObject, baseOffset + 4 + 4 * numElements, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 *->4L *to avoid overflow. Please check other places as well.I don't quite understand thenvm, I saw L364.offsetRegionSize. Is it reserved for marking null values or handling variable-length elements in the future? This is quite expensive for primitive arrays.
|
@cloud-fan Could you also update the benchmark? |
|
Test build #57230 has finished for PR 12640 at commit
|
| serialize 380 / 392 0.0 379730.0 1.0X | ||
| deserialize 138 / 142 0.0 137816.6 2.8X | ||
| */ | ||
| benchmark.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result on master:
VectorUDT de/serialization: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
serialize 1414 / 1462 0.0 1414104.1 1.0X
deserialize 169 / 178 0.0 169323.7 8.4X
The serialize is much faster now, but the deserialize isn't , investigating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a micro benchmark, the toDoubleArray and the new toDoubleArrayUnchecked don't have much difference(the new one is only 20% faster). Maybe JVM can optimize simple while loop?
def toDoubleArray(): Array[Double] = {
val size = numElements()
val values = new Array[Double](size)
var i = 0
while (i < size) {
values(i) = getDouble(i)
i += 1
}
values
}
cc @davies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, could you run the benchmark with more iterations to make sure that the C2 compiler could kick in (especially in Java 8)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rerun the benchmark with 5 times higher iterations, but the result shows no difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we ran the test multiple times, and pick the best one, so that's fine.
|
Test build #57237 has finished for PR 12640 at commit
|
|
Test build #57241 has finished for PR 12640 at commit
|
|
Test build #57239 has finished for PR 12640 at commit
|
|
Test build #57253 has finished for PR 12640 at commit
|
| var sum = 0 | ||
| var i = 0 | ||
| while (i < numRows) { | ||
| sum += encoder.toRow(vectors(i)).numFields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call VectorUDT.serialize directly instead of encoder.toRows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's different, VectorUDT.serialize only turn user object to catalyst data, but the real serialization should also include convert catalyst data into unsafe format.
|
Test build #57320 has finished for PR 12640 at commit
|
|
Test build #57387 has finished for PR 12640 at commit
|
|
retest this please. |
|
LGTM pending Jenkins |
|
Test build #57399 has finished for PR 12640 at commit
|
|
retest this please. |
|
Test build #57403 has finished for PR 12640 at commit
|
|
LGTM. Merged into master. Thanks! |
|
|
||
| public static UnsafeArrayData fromPrimitiveArray(int[] arr) { | ||
| if (arr.length > (Integer.MAX_VALUE - 4) / 8) { | ||
| throw new UnsupportedOperationException("Cannot convert this array to unsafe format as " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include (Integer.MAX_VALUE - 4) / 8 in the message so that the user knows the limit ?
What changes were proposed in this pull request?
This PR adds
fromPrimitiveArrayandtoPrimitiveArrayinUnsafeArrayData, so that we can do the conversion much faster in VectorUDT/MatrixUDT.How was this patch tested?
existing tests and new test suite
UnsafeArraySuite