Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jul 31, 2018

What changes were proposed in this pull request?

Support reading/writing Avro logical timestamp type with different precisions
https://avro.apache.org/docs/1.8.2/spec.html#Timestamp+%28millisecond+precision%29

To specify the output timestamp type, use Dataframe option outputTimestampType or SQL config spark.sql.avro.outputTimestampType. The supported values are

  • TIMESTAMP_MICROS
  • TIMESTAMP_MILLIS

The default output type is TIMESTAMP_MICROS

How was this patch tested?

Unit test

@holdensmagicalunicorn
Copy link

@gengliangwang, thanks! I am a bot who has found some folks who might be able to help with the review:@cloud-fan, @gatorsmile and @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93838 has finished for PR 21935 at commit 3a53f55.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case DateType => builder.longType()
case TimestampType => builder.longType()
case TimestampType =>
// To be consistent with the previous behavior of writing Timestamp type with Avro 1.7,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous behavior is: we can't write out timestamp data, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we should follow parquet and have a config spark.sql.avro.outputTimestampType to control it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we write timestamp as Long and divide the value by 1000(millisecond precision).
Maybe I need to revise the comment.
+1 on the new config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I think writing out timestamp micros should be good

case TimestampType =>
// To be consistent with the previous behavior of writing Timestamp type with Avro 1.7,
// the default output Avro Timestamp type is with millisecond precision.
builder.longBuilder().prop(LogicalType.LOGICAL_TYPE_PROP, "timestamp-millis").endLong()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better API for it? hardcoding a string is hacky.

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93859 has finished for PR 21935 at commit fdc6c2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

catalystType: DataType,
path: List[String]): (CatalystDataUpdater, Int, Any) => Unit =
path: List[String]): (CatalystDataUpdater, Int, Any) => Unit = {
(avroType.getLogicalType, catalystType) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this like:

      case (LONG, TimestampType) => avroType.getLogicalType match {
        case _: TimestampMillis => (updater, ordinal, value) =>
          updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)
        case _: TimestampMicros => (updater, ordinal, value) =>
          updater.setLong(ordinal, value.asInstanceOf[Long])
        case _ => (updater, ordinal, value) =>
          updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)
      }

? Looks they have Avro long type anyway. Thought it's better to read and actually safer and correct.

* This function takes an avro schema and returns a sql schema.
*/
def toSqlType(avroSchema: Schema): SchemaType = {
avroSchema.getLogicalType match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

case _: TimestampMicros => (updater, ordinal, value) =>
updater.setLong(ordinal, value.asInstanceOf[Long])
case _ => (updater, ordinal, value) =>
updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment to say it's for backward compatibility reasons. Also we should only do it when logical type is null. For other logical types, we should fail here.

(getter, ordinal) => avroType.getLogicalType match {
case _: TimestampMillis => getter.getLong(ordinal) / 1000
case _: TimestampMicros => getter.getLong(ordinal)
case _ => getter.getLong(ordinal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

case LONG => SchemaType(LongType, nullable = false)
case LONG => avroSchema.getLogicalType match {
case _: TimestampMillis | _: TimestampMicros =>
return SchemaType(TimestampType, nullable = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use return here?

case TimestampType => builder.longType()
case TimestampType =>
val timestampType = outputTimestampType match {
case "TIMESTAMP_MILLIS" => LogicalTypes.timestampMillis()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't hardcode the strings, we can write

if (outputTimestampType == AvroOutputTimestampType.TIMESTAMP_MICROS.toString) ...

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93884 has finished for PR 21935 at commit be0077a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

prevNameSpace: String = ""): Schema = {
prevNameSpace: String = "",
outputTimestampType: AvroOutputTimestampType.Value = AvroOutputTimestampType.TIMESTAMP_MICROS
): Schema = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the indent here is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe

 outputTimestampType: AvroOutputTimestampType.Value = AvroOutputTimestampType.TIMESTAMP_MICROS)
: Schema = {

is more correct per https://github.com/databricks/scala-style-guide#spacing-and-indentation

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93899 has finished for PR 21935 at commit 09ad6e9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

* from the Unix epoch. TIMESTAMP_MILLIS is also logical, but with millisecond precision,
* which means Spark has to truncate the microsecond portion of its timestamp value.
*/
val outputTimestampType: AvroOutputTimestampType.Value = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I wouldn't expose this as an option for now - that at least matches to Parquet's.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with it, I think parquet should also follow this.

import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.{StructType, _}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import looks a bit odd :-)

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

// For backward compatibility, if the Avro type is Long and it is not logical type,
// the value is processed as timestamp type with millisecond precision.
updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a default case and throw IncompatibleSchemaException, in case avro add more logical types for long type in the future.

(getter, ordinal) => getter.getInt(ordinal) * DateTimeUtils.MILLIS_PER_DAY
case TimestampType =>
(getter, ordinal) => getter.getLong(ordinal) / 1000
(getter, ordinal) => avroType.getLogicalType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not do pattern match per record, we should

avroType.getLogicalType match {
  case _: TimestampMillis => (getter, ordinal) => ...

case _: TimestampMicros => getter.getLong(ordinal)
// For backward compatibility, if the Avro type is Long and it is not logical type,
// output the timestamp value as with millisecond precision.
case null => getter.getLong(ordinal) / 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, add a default case.

recordName: String = "topLevelRecord",
prevNameSpace: String = ""): Schema = {
prevNameSpace: String = "",
outputTimestampType: AvroOutputTimestampType.Value = AvroOutputTimestampType.TIMESTAMP_MICROS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need the default value? Seems only one call site excluding the recursive ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also used in CatalystDataToAvro

updater.setLong(ordinal, value.asInstanceOf[Long] * 1000)
case _: TimestampMicros => (updater, ordinal, value) =>
updater.setLong(ordinal, value.asInstanceOf[Long])
case null => (updater, ordinal, value) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, add a default case.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93920 has finished for PR 21935 at commit 09ad6e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
val episodesAvro = testFile("episodes.avro")
val testAvro = testFile("test.avro")
val timestampAvro = testFile("timestamp.avro")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least we should provide how the binary file is generated, or just do roundtrip test: Spark write avro files and then read it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema and data is stated in https://github.com/apache/spark/pull/21935/files#diff-9364b0610f92b3cc35a4bc43a80751bfR397
It should be easy to get from test cases.
The other test file episodesAvro also doesn't provide how it is generated.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93947 has finished for PR 21935 at commit 2b286cd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93959 has finished for PR 21935 at commit 921e6cb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93985 has finished for PR 21935 at commit 921e6cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #93998 has finished for PR 21935 at commit 499fbf3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #94004 has finished for PR 21935 at commit fed8505.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #94020 has finished for PR 21935 at commit fed8505.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Aug 2, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #94052 has finished for PR 21935 at commit fed8505.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 7cf16a7 Aug 3, 2018
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
In PR apache#21984 and apache#21935 , the related test cases are using binary files created by Python scripts.

Generate the binary files in test suite to make it more transparent.  Also we can

Also move the related test cases to a new file `AvroLogicalTypeSuite.scala`.

Unit test.

Closes apache#22091 from gengliangwang/logicalType_suite.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

RB=2651977
BUG=LIHADOOP-59243
G=spark-reviewers
R=ekrogen
A=ekrogen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants