Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

With code changes in #21847 , Spark can write out to Avro file as per user provided output schema.

To make it more robust and user friendly, we should validate the Avro schema before tasks launched.

Also we should support output logical decimal type as BYTES (By default we output as FIXED)

How was this patch tested?

Unit test

@gengliangwang gengliangwang changed the title Avro serializer match [SPARK-25104][SQL]Avro: Validate user specified output schema Aug 13, 2018
@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94694 has finished for PR 22094 at commit c8e98b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Aug 13, 2018

LGTM.

case (FloatType, FLOAT) =>
(getter, ordinal) => getter.getFloat(ordinal)
case DoubleType =>
case (DoubleType, DOUBLE) =>
Copy link
Member

@dbtsai dbtsai Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow users to do casting up from catalystType to avroType? For example, catalystType float to avroType double. If so, this can be done in different PR.

Copy link
Member Author

@gengliangwang gengliangwang Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would like to keep it simple as this PR proposes.
If data type casting needed, users can always do it in DataFrame before writing Avro files.

But if the casting is important, we can work on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if someone feels it's important, let's do it in different PR.

(NullType, NULL),
(BooleanType, BOOLEAN),
(ByteType, INT),
(IntegerType, INT),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add (ShortType, INT),, too?

(DoubleType, DOUBLE),
(BinaryType, BYTES),
(DateType, INT),
(TimestampType, LONG)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intention is to be exhaustive, decimal types? And, primitive to complex, and vice versa?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thanks, I have updated the test.


private def resolveNullableType(avroType: Schema, nullable: Boolean): Schema = {
if (nullable) {
if (nullable && avroType.getType != NULL) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a trivial bug if avroType is NULL type.

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94720 has finished for PR 22094 at commit d217dfc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Aug 14, 2018

Merged into master. Thanks.

@asfgit asfgit closed this in ab19730 Aug 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants