[SPARK-25104][SQL]Avro: Validate user specified output schema #22094

gengliangwang · 2018-08-13T17:11:50Z

What changes were proposed in this pull request?

With code changes in #21847 , Spark can write out to Avro file as per user provided output schema.

To make it more robust and user friendly, we should validate the Avro schema before tasks launched.

Also we should support output logical decimal type as BYTES (By default we output as FIXED)

How was this patch tested?

Unit test

SparkQA · 2018-08-13T17:30:09Z

Test build #94694 has finished for PR 22094 at commit c8e98b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2018-08-13T20:47:06Z

LGTM.

dbtsai · 2018-08-13T20:38:21Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

+      case (FloatType, FLOAT) =>
        (getter, ordinal) => getter.getFloat(ordinal)
-      case DoubleType =>
+      case (DoubleType, DOUBLE) =>


Do we want to allow users to do casting up from catalystType to avroType? For example, catalystType float to avroType double. If so, this can be done in different PR.

Personally I would like to keep it simple as this PR proposes.
If data type casting needed, users can always do it in DataFrame before writing Avro files.

But if the casting is important, we can work on it.

Yeah, if someone feels it's important, let's do it in different PR.

dongjoon-hyun · 2018-08-13T22:28:05Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+      (NullType, NULL),
+      (BooleanType, BOOLEAN),
+      (ByteType, INT),
+      (IntegerType, INT),


Could you add (ShortType, INT),, too?

dongjoon-hyun · 2018-08-13T22:32:35Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+      (DoubleType, DOUBLE),
+      (BinaryType, BYTES),
+      (DateType, INT),
+      (TimestampType, LONG)


If the intention is to be exhaustive, decimal types? And, primitive to complex, and vice versa?

@dongjoon-hyun Thanks, I have updated the test.

gengliangwang · 2018-08-14T02:00:11Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala


  private def resolveNullableType(avroType: Schema, nullable: Boolean): Schema = {
-    if (nullable) {
+    if (nullable && avroType.getType != NULL) {


This fixes a trivial bug if avroType is NULL type.

SparkQA · 2018-08-14T02:15:12Z

Test build #94720 has finished for PR 22094 at commit d217dfc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2018-08-14T04:44:23Z

Merged into master. Thanks.

validate user provided output schema

c8e98b1

gengliangwang force-pushed the AvroSerializerMatch branch from 065229c to c8e98b1 Compare August 13, 2018 17:12

gengliangwang changed the title ~~Avro serializer match~~ [SPARK-25104][SQL]Avro: Validate user specified output schema Aug 13, 2018

dbtsai approved these changes Aug 13, 2018

View reviewed changes

dongjoon-hyun reviewed Aug 13, 2018

View reviewed changes

address comments

d217dfc

gengliangwang commented Aug 14, 2018

View reviewed changes

asfgit closed this in ab19730 Aug 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-25104][SQL]Avro: Validate user specified output schema #22094

[SPARK-25104][SQL]Avro: Validate user specified output schema #22094

Uh oh!

gengliangwang commented Aug 13, 2018

Uh oh!

SparkQA commented Aug 13, 2018

Uh oh!

dbtsai commented Aug 13, 2018

Uh oh!

dbtsai Aug 13, 2018 •

edited

Loading

Uh oh!

gengliangwang Aug 14, 2018 •

edited

Loading

Uh oh!

dbtsai Aug 14, 2018

Uh oh!

dongjoon-hyun Aug 13, 2018

Uh oh!

dongjoon-hyun Aug 13, 2018

Uh oh!

gengliangwang Aug 14, 2018

Uh oh!

gengliangwang Aug 14, 2018

Uh oh!

SparkQA commented Aug 14, 2018

Uh oh!

dbtsai commented Aug 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-25104][SQL]Avro: Validate user specified output schema #22094

[SPARK-25104][SQL]Avro: Validate user specified output schema #22094

Uh oh!

Conversation

gengliangwang commented Aug 13, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 13, 2018

Uh oh!

dbtsai commented Aug 13, 2018

Uh oh!

dbtsai Aug 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai Aug 14, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Aug 13, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Aug 13, 2018

Choose a reason for hiding this comment

Uh oh!

gengliangwang Aug 14, 2018

Choose a reason for hiding this comment

Uh oh!

gengliangwang Aug 14, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 14, 2018

Uh oh!

dbtsai commented Aug 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dbtsai Aug 13, 2018 •

edited

Loading

gengliangwang Aug 14, 2018 •

edited

Loading