[SPARK-24855][SQL][EXTERNAL]: Built-in AVRO support should support specified schema on write #21847

lindblombr · 2018-07-23T16:13:51Z

What changes were proposed in this pull request?

Allows avroSchema option to be specified on write, allowing a user to specify a schema in cases where this is required. A trivial use case is reading in an avro dataset, making some small adjustment to a column or columns and writing out using the same schema. Implicit schema creation from SQL Struct results in a schema that while for the most part, is functionally similar, is not necessarily compatible.

Allows fixed Field type to be utilized for records of specified avroSchema

How was this patch tested?

Unit tests in AvroSuite are extended to test this with enum and fixed types.

Please review http://spark.apache.org/contributing.html before opening a pull request.

dbtsai · 2018-07-23T17:14:39Z

add to whitelist

SparkQA · 2018-07-23T17:20:03Z

Test build #93453 has finished for PR 21847 at commit 30fc1ae.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-23T18:50:12Z

Test build #93457 has finished for PR 21847 at commit 71dbc39.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-24T00:48:26Z

Test build #93464 has finished for PR 21847 at commit 12b3859.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-24T01:04:44Z

Test build #93466 has finished for PR 21847 at commit 033f4dd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-25T19:30:02Z

Test build #93565 has finished for PR 21847 at commit f05e67e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-07-25T19:37:47Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

Is the change here related to specifying schema on write?

Yes. I need the original rootCatalystType so I can determine how those types map back to a user-specified schema. Simply having nullable as context is not sufficient :(

SparkQA · 2018-07-26T16:00:02Z

Test build #93604 has finished for PR 21847 at commit 6f686d6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-26T16:44:03Z

Test build #93606 has finished for PR 21847 at commit 7e44ca0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2018-07-26T17:07:21Z

+cc @MaxGekk and @gengliangwang who worked on this part of codebase.

viirya · 2018-07-26T18:00:07Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

We can save foldLeft here?

catalystType.asInstanceOf[StructType].zip(avroFields.asScala).forall { case (f1, f2) => typeMatchesSchema(f1.dataType, f2.schema) }

viirya · 2018-07-26T18:02:21Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

Not caused by this PR, but we better explain what this method does. Can you add a comment for it?

viirya · 2018-07-26T18:04:16Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

Is it possible nullable == false but avroType.getType == Type.UNION?

I've converted this into a match statement that covers the four cases:

nullable == false and Type.UNION => should "resolve" the union to the appropriate type

nullable == true and Type.UNION => should "resolve" the union to the appropriate type

nullable == Any and Any (Type) => just return the Type

MaxGekk · 2018-07-26T18:17:08Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

spark is already available from SharedSQLContext. You don't need to pass it to the function

MaxGekk · 2018-07-26T18:22:39Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

Do you actually check schemas? checkAnswer does collect of newDf, and check content of 2 dataframes. I am asking because method name forceSchemaCheck confuses slightly.

I changed from forceSchemaCheck to checkSpecifySchemaOnWrite. I think thats a bit more clear.

MaxGekk · 2018-07-26T18:33:01Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

Why don't you return false instead of crashing in the assert in the case of different sizes?

MaxGekk · 2018-07-26T18:36:04Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

map{...}.foldLeft(true)(_ && _) -> forall{...}

MaxGekk · 2018-07-26T18:38:40Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

Could you add a comment why for DecimalType you pass Type.STRING

Additional work to support some logical types would probably be in order for this to work correctly. Prior to my change, AvroSerializer always wrote out DecimalType as a string, so I'm just keeping the existing behavior. I added a comment to reflect this.

MaxGekk · 2018-07-26T18:39:52Z

external/avro/src/test/resources/multirecordtypeunion.avsc

I hope you didn't expose your internal info.

SparkQA · 2018-07-26T19:09:01Z

Test build #93619 has finished for PR 21847 at commit d85242e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-26T19:26:28Z

Test build #93621 has finished for PR 21847 at commit 621be3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-26T19:40:33Z

Test build #93603 has finished for PR 21847 at commit be037bd.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-07-26T20:14:19Z

Test build #93625 has finished for PR 21847 at commit ed44c76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lindblombr · 2018-07-26T20:27:03Z

@MaxGekk & @viirya thanks for the feedback! Should be addressed.

SparkQA · 2018-07-26T20:44:06Z

Test build #93626 has finished for PR 21847 at commit c76aadd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lindblombr · 2018-07-26T23:34:40Z

FYI Doing some additional testing around performance and have found a pretty gnarly regression with a particular type of schema. I'll try to track down what's causing it.

SparkQA · 2018-08-08T23:14:17Z

Test build #94460 has finished for PR 21847 at commit 46fec45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lindblombr · 2018-08-08T23:54:19Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+      new Schema.Parser().parse(expectedAvroSchema))
+  }
+
+  def getAvroSchemaStringFromFiles(filePath: String): String = {


@dbtsai Should we update the test L980 "Validate namespace in avro file that has nested records with the same name" to use this, too?

Good suggestion. I'll address it in the next push.

dbtsai · 2018-08-09T07:28:13Z

@cloud-fan @gatorsmile @gengliangwang We need this to write avro files from Spark for other applications to consume because we use the fixed and enum types in our external applications. Thanks.

SparkQA · 2018-08-09T07:50:05Z

Test build #94478 has finished for PR 21847 at commit 68ce1b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-09T08:13:05Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

+          (getter, ordinal) => new Utf8(getter.getUTF8String(ordinal).getBytes)
+      }
+      case BinaryType => avroType.getType match {
+        case Type.FIXED =>


FIXED has a "size" attribute, shall we consider it when preparing the bytes? e.g. shall we throw exception if the bytes from Spark exceed the size, and shall we padding the bytes when its length is smaller than the size.

I think we should throw an exception if the size of the bytes to be written is different from the "size" attribute without doing the padding thing.

BTW, avro doesn't handle this well. If the size to be written is larger than the "size" attribute, avro will silently cause data corruption.

cloud-fan · 2018-08-09T08:16:06Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+      checkAvroSchemaEquals(avroSchema, getAvroSchemaStringFromFiles(tempSaveDir))
+
+      // Writing df containing data not in the enum will throw an exception
+      intercept[SparkException] {


can we also check the error message?

I was thinking to do a followup PR on this. What's happening is that avro builds a map of key to indices, and avro looks up the table to get the index. In avro, they don't do any check, so the exception is null pointer exception when looking up non-existent key.

Since I think checking a message against null pointer exception is fragile, I decide to leave it to a followup PR.

cloud-fan · 2018-08-09T08:24:05Z

LGTM except one concern about error checking for fixed type. Thanks for working on it!

SparkQA · 2018-08-09T21:35:31Z

Test build #94522 has finished for PR 21847 at commit 03e112c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T22:01:11Z

Test build #94523 has finished for PR 21847 at commit 0d9c948.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T22:05:27Z

Test build #94524 has finished for PR 21847 at commit 6805c58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-08-10T00:14:21Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

+            if (!enumSymbols.contains(data)) {
+              throw new IncompatibleSchemaException(
+                "Cannot write \"" + data + "\" since it's not defined in enum \"" +
+                  enumSymbols.mkString("\", \""))


nit: enumSymbols.mkString("\", \"") + "\"")

Good catch.

viirya · 2018-08-10T00:16:40Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala

+            if (data.length != size) {
+              throw new IncompatibleSchemaException(
+                s"Cannot write ${data.length} ${if (data.length > 1) "bytes" else "byte"} of " +
+                  s"binary data into FIXED Type with size of " +


nit: "binary data into FIXED Type with size of ".

viirya · 2018-08-10T00:19:38Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+      val df = spark.createDataFrame(dfWithNull.na.drop().rdd,
+        StructType(Seq(StructField("Suit", StringType, false))))
+
+      val tempSaveDir = s"$tempDir/save1/"


nit: seems can still use save?

viirya · 2018-08-10T00:20:53Z

Few minor comments. LGTM.

SparkQA · 2018-08-10T00:47:59Z

Test build #94531 has finished for PR 21847 at commit df486f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-10T01:45:17Z

LGTM

dbtsai · 2018-08-10T03:37:09Z

Thanks all. Merged into master.

The mapping of Spark schema to Avro schema is many-to-many. (See https://spark.apache.org/docs/latest/sql-data-sources-avro.html#supported-types-for-spark-sql---avro-conversion) The default schema mapping might not be exactly what users want. For example, by default, a "string" column is always written as "string" Avro type, but users might want to output the column as "enum" Avro type. With PR apache#21847, Spark supports user-specified schema in the batch writer. For the function `to_avro`, we should support user-specified output schema as well. Unit test. Closes apache#25419 from gengliangwang/to_avro. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 48adc91)

When lindblombr at apple developed [SPARK-24855](apache#21847) to support specified schema on write, we found a performance regression in Avro writer for our dataset. With this PR, the performance is improved, but not as good as Spark 2.3 + the old avro writer. There must be something we miss which we need to investigate further. Spark 2.4 ``` spark git:(master) ./build/mvn -DskipTests clean package spark git:(master) bin/spark-shell --jars external/avro/target/spark-avro_2.11-2.4.0-SNAPSHOT.jar ``` Spark 2.3 + databricks avro ``` spark git:(branch-2.3) ./build/mvn -DskipTests clean package spark git:(branch-2.3) bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0 ``` Current master: ``` +-------+--------------------+ |summary| writeTimes| +-------+--------------------+ | count| 100| | mean| 2.95621| | stddev|0.030895815479469294| | min| 2.915| | max| 3.049| +-------+--------------------+ +-------+--------------------+ |summary| readTimes| +-------+--------------------+ | count| 100| | mean| 0.31072999999999995| | stddev|0.054139709842390006| | min| 0.259| | max| 0.692| +-------+--------------------+ ``` Current master with this PR: ``` +-------+--------------------+ |summary| writeTimes| +-------+--------------------+ | count| 100| | mean| 2.5804300000000002| | stddev|0.011175600225672079| | min| 2.558| | max| 2.62| +-------+--------------------+ +-------+--------------------+ |summary| readTimes| +-------+--------------------+ | count| 100| | mean| 0.29922000000000004| | stddev|0.058261961532514166| | min| 0.251| | max| 0.732| +-------+--------------------+ ``` Spark 2.3 + databricks avro: ``` +-------+--------------------+ |summary| writeTimes| +-------+--------------------+ | count| 100| | mean| 1.7730500000000005| | stddev|0.025199156230863575| | min| 1.729| | max| 1.833| +-------+--------------------+ +-------+-------------------+ |summary| readTimes| +-------+-------------------+ | count| 100| | mean| 0.29715| | stddev|0.05685643358850465| | min| 0.258| | max| 0.718| +-------+-------------------+ ``` The following is the test code to reproduce the result. ```scala spark.sqlContext.setConf("spark.sql.avro.compression.codec", "uncompressed") val sparkSession = spark import sparkSession.implicits._ val df = spark.sparkContext.range(1, 3000).repartition(1).map { uid => val features = Array.fill(16000)(scala.math.random) (uid, scala.math.random, java.util.UUID.randomUUID().toString, java.util.UUID.randomUUID().toString, features) }.toDF("uid", "random", "uuid1", "uuid2", "features").cache() val size = df.count() // Write into ramdisk to rule out the disk IO impact val tempSaveDir = s"/Volumes/ramdisk/${java.util.UUID.randomUUID()}/" val n = 150 val writeTimes = new Array[Double](n) var i = 0 while (i < n) { val t1 = System.currentTimeMillis() df.write .format("com.databricks.spark.avro") .mode("overwrite") .save(tempSaveDir) val t2 = System.currentTimeMillis() writeTimes(i) = (t2 - t1) / 1000.0 i += 1 } df.unpersist() // The first 50 runs are for warm-up val readTimes = new Array[Double](n) i = 0 while (i < n) { val t1 = System.currentTimeMillis() val readDF = spark.read.format("com.databricks.spark.avro").load(tempSaveDir) assert(readDF.count() == size) val t2 = System.currentTimeMillis() readTimes(i) = (t2 - t1) / 1000.0 i += 1 } spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show() spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show() ``` Existing tests. Author: DB Tsai <[email protected]> Author: Brian Lindblom <[email protected]> Closes apache#21952 from dbtsai/avro-performance-fix. (cherry picked from commit 273b284) RB=1516361 R=fli,mshen,yezhou,edlu A=fli

…cified schema on write Allows `avroSchema` option to be specified on write, allowing a user to specify a schema in cases where this is required. A trivial use case is reading in an avro dataset, making some small adjustment to a column or columns and writing out using the same schema. Implicit schema creation from SQL Struct results in a schema that while for the most part, is functionally similar, is not necessarily compatible. Allows `fixed` Field type to be utilized for records of specified `avroSchema` Unit tests in AvroSuite are extended to test this with enum and fixed types. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#21847 from lindblombr/specify_schema_on_write. Lead-authored-by: Brian Lindblom <[email protected]> Co-authored-by: DB Tsai <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit 0cea9e3) This is a partial backport of only the test cases in SPARK-24855, and the core part of SPARK-24855 has been merged in the previous commits. RB=2119392 BUG=LIHADOOP-53602 G=spark-reviewers R=mshen,ekrogen A=ekrogen

lindblombr force-pushed the specify_schema_on_write branch from 30fc1ae to 71dbc39 Compare July 23, 2018 18:37

lindblombr force-pushed the specify_schema_on_write branch from 71dbc39 to 12b3859 Compare July 24, 2018 00:35

lindblombr force-pushed the specify_schema_on_write branch from 12b3859 to 033f4dd Compare July 24, 2018 00:58

lindblombr force-pushed the specify_schema_on_write branch from 033f4dd to f05e67e Compare July 25, 2018 19:09

viirya reviewed Jul 25, 2018

View reviewed changes

lindblombr force-pushed the specify_schema_on_write branch from e31100d to 6f686d6 Compare July 26, 2018 15:41

lindblombr force-pushed the specify_schema_on_write branch from 6f686d6 to 7e44ca0 Compare July 26, 2018 16:26

lindblombr changed the title ~~[SPARK-24855][SQL][EXTERNAL][WIP]: Built-in AVRO support should support specified schema on write~~ [SPARK-24855][SQL][EXTERNAL]: Built-in AVRO support should support specified schema on write Jul 26, 2018

viirya reviewed Jul 26, 2018

View reviewed changes

MaxGekk reviewed Jul 26, 2018

View reviewed changes

lindblombr force-pushed the specify_schema_on_write branch 2 times, most recently from 8d7c520 to d85242e Compare July 26, 2018 18:52

lindblombr force-pushed the specify_schema_on_write branch from 621be3e to ed44c76 Compare July 26, 2018 19:54

lindblombr force-pushed the specify_schema_on_write branch from ed44c76 to c76aadd Compare July 26, 2018 20:22

add tests and cut PR

46fec45

lindblombr commented Aug 8, 2018

View reviewed changes

Added fixed2 test

68ce1b6

cloud-fan reviewed Aug 9, 2018

View reviewed changes

Addressed feedback

03e112c

dbtsai added 2 commits August 9, 2018 14:38

remove unneeded require

0d9c948

minor

6805c58

viirya reviewed Aug 10, 2018

View reviewed changes

address comment

df486f2

asfgit closed this in 0cea9e3 Aug 10, 2018

This was referenced Aug 13, 2018

[SPARK-25104][SQL]Avro: Validate user specified output schema #22094

Closed

[SPARK-25160][SQL]Avro: remove sql configuration spark.sql.avro.outputTimestampType #22151

Closed

gengliangwang mentioned this pull request Aug 12, 2019

[SPARK-28698][SQL] Support user-specified output schema in to_avro #25419

Closed

[SPARK-24855][SQL][EXTERNAL]: Built-in AVRO support should support specified schema on write #21847

[SPARK-24855][SQL][EXTERNAL]: Built-in AVRO support should support specified schema on write #21847

Uh oh!

Conversation

lindblombr commented Jul 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dbtsai commented Jul 23, 2018

Uh oh!

SparkQA commented Jul 23, 2018

Uh oh!

SparkQA commented Jul 23, 2018

Uh oh!

SparkQA commented Jul 24, 2018

Uh oh!

SparkQA commented Jul 24, 2018

Uh oh!

SparkQA commented Jul 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lindblombr Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

dbtsai commented Jul 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

lindblombr commented Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 26, 2018

Uh oh!

lindblombr commented Jul 26, 2018

Uh oh!

SparkQA commented Aug 8, 2018

Uh oh!

lindblombr Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lindblombr commented Jul 23, 2018 •

edited

Loading

lindblombr Jul 26, 2018 •

edited

Loading

lindblombr commented Jul 26, 2018 •

edited

Loading

lindblombr Aug 8, 2018 •

edited

Loading

dbtsai commented Aug 9, 2018 •

edited

Loading