-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25104][SQL]Avro: Validate user specified output schema #22094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
065229c to
c8e98b1
Compare
|
Test build #94694 has finished for PR 22094 at commit
|
|
LGTM. |
| case (FloatType, FLOAT) => | ||
| (getter, ordinal) => getter.getFloat(ordinal) | ||
| case DoubleType => | ||
| case (DoubleType, DOUBLE) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to allow users to do casting up from catalystType to avroType? For example, catalystType float to avroType double. If so, this can be done in different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would like to keep it simple as this PR proposes.
If data type casting needed, users can always do it in DataFrame before writing Avro files.
But if the casting is important, we can work on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if someone feels it's important, let's do it in different PR.
| (NullType, NULL), | ||
| (BooleanType, BOOLEAN), | ||
| (ByteType, INT), | ||
| (IntegerType, INT), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add (ShortType, INT),, too?
| (DoubleType, DOUBLE), | ||
| (BinaryType, BYTES), | ||
| (DateType, INT), | ||
| (TimestampType, LONG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the intention is to be exhaustive, decimal types? And, primitive to complex, and vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Thanks, I have updated the test.
|
|
||
| private def resolveNullableType(avroType: Schema, nullable: Boolean): Schema = { | ||
| if (nullable) { | ||
| if (nullable && avroType.getType != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes a trivial bug if avroType is NULL type.
|
Test build #94720 has finished for PR 22094 at commit
|
|
Merged into master. Thanks. |
What changes were proposed in this pull request?
With code changes in #21847 , Spark can write out to Avro file as per user provided output schema.
To make it more robust and user friendly, we should validate the Avro schema before tasks launched.
Also we should support output logical decimal type as BYTES (By default we output as FIXED)
How was this patch tested?
Unit test