-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34817][SQL] Read parquet unsigned types that stored as int32 physical type in parquet #31921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
cf4077d
0c8b6d4
8ff3267
0da5d07
10ed3b9
efe9c4a
3642f91
869dc14
d9afc79
f9ab2d5
02cee4f
71496bd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -565,6 +565,10 @@ private void readIntBatch(int rowId, int num, WritableColumnVector column) throw | |
| canReadAsIntDecimal(column.dataType())) { | ||
| defColumn.readIntegers( | ||
| num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn); | ||
| } else if (column.dataType() == DataTypes.LongType) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we add an extra check to make sure we are reading unsigned values?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is deterministic and controlled by our own, which seems not necessary. see https://github.com/apache/spark/pull/31921/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R137 |
||
| // We use LongType to handle UINT32 | ||
| defColumn.readIntegersAsUnsigned( | ||
|
||
| num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn); | ||
| } else if (column.dataType() == DataTypes.ByteType) { | ||
| defColumn.readBytes( | ||
| num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -83,6 +83,15 @@ public final void readIntegers(int total, WritableColumnVector c, int rowId) { | |
| } | ||
| } | ||
|
|
||
| @Override | ||
| public final void readIntegersAsUnsigned(int total, WritableColumnVector c, int rowId) { | ||
| int requiredBytes = total * 4; | ||
| ByteBuffer buffer = getBuffer(requiredBytes); | ||
| for (int i = 0; i < total; i += 1) { | ||
| c.putLong(rowId + i, Integer.toUnsignedLong(buffer.getInt())); | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe we can improve here by coverting the |
||
| } | ||
| } | ||
|
|
||
| // A fork of `readIntegers` to rebase the date values. For performance reasons, this method | ||
| // iterates the values twice: check if we need to rebase first, then go to the optimized branch | ||
| // if rebase is not needed. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -130,13 +130,11 @@ class ParquetToSparkSchemaConverter( | |
| case INT32 => | ||
| originalType match { | ||
| case INT_8 => ByteType | ||
| case INT_16 => ShortType | ||
| case INT_32 | null => IntegerType | ||
| case INT_16 | UINT_8 => ShortType | ||
| case INT_32 | UINT_16 | null => IntegerType | ||
| case DATE => DateType | ||
| case DECIMAL => makeDecimalType(Decimal.MAX_INT_DIGITS) | ||
| case UINT_8 => typeNotSupported() | ||
| case UINT_16 => typeNotSupported() | ||
| case UINT_32 => typeNotSupported() | ||
| case UINT_32 => LongType | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These were explicitly unsupported at #9646 .. per @liancheng's advice (who's also Parquet committer). So I'm less sure if this is something we should support.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But it's very old. Almost 6 years ago lol. @liancheng do you have a different thought now?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, @HyukjinKwon, IMO, for Spark, it is worthwhile to be able to support more storage layer features without breaking our own rules.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My hunch is that Spark SQL didn't support unsigned integral types at all back then. As long as we support that now, it's OK to have.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's mostly about compatibility. Spark won't have unsigned types, but spark should be able to read existing parquet files written by other systems that support unsigned types. |
||
| case TIME_MILLIS => typeNotImplemented() | ||
| case _ => illegalType() | ||
| } | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.