[HUDI-83] Fix Timestamp/Date type read by Hive3 #3391

cdmikechen · 2021-08-03T11:15:43Z

Change Logs

This pull request let hive can read timestamp type column datas correctly.
The problem was initially related to JIRA HUDI-83 and related issues on issue #2544

Change HoodieParquetInputFormat to use a custom ParquetInputFormat named HudiAvroParquetInputFormat
In HudiAvroParquetInputFormat we use a custom RecordReader named HudiAvroParquetReader. In this class we use AvroReadSupport so that Hive can get parquet data with an avro GenericRecord.
Use org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.avroToArrayWritable to transform GenericRecord to ArrayWriteable. At the same time, timestamp/date type processing for different situations of hive2 and hive3 is added to this method.
Set hoodie.datasource.hive_sync.support_timestamp default value from false to true
add a supportAvroRead value to be compatible with the adaptation of some old hudi versions for hive3 timestamp/date types

Impact

hudi-hadoop-mr
spark

Risk level

low

Documentation Update

The javadoc has been modified and the website document will be on other PR later.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

cdmikechen · 2021-08-08T14:54:26Z

@hudi-bot run azure

vinothchandar

I have a concern around performance overhead and also wondering if we can just do it as a part of the existing inputformat with a flag, instead of switching over entirely to a new ipf? thougnts?

vinothchandar · 2021-08-10T19:19:11Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HudiAvroParquetReader.java

+  @Override
+  public ArrayWritable getCurrentValue() throws IOException, InterruptedException {
+    GenericRecord record = parquetRecordReader.getCurrentValue();
+    return (ArrayWritable) HoodieRealtimeRecordReaderUtils.avroToArrayWritable(record, record.getSchema());


this extra avro conversion will cost performance? Wondering if we can avoid this.

@vinothchandar
I have been running the fork for several months. At present, it does not cause too many additional problems. This may be related to the small amount of data processed by my hive and the insufficient impact of memory.

At present, the parsing of avro data by hudi spark in org.apache.hudi.AvroConversionHelper or hive itself in org.apache.hadoop.hive.serde2.avro.AvroGenericRecordWritable both wrapper around an Avro GenericRecord. Normally, I think data processing should not cause serious overhead .

Meanwhile, in part of the instantiation of TimestampWritableV2, I reconstructed some code to enhance and fix some of the original errors and problems.

ghost · 2021-09-02T08:43:08Z

Hey,
Not sure if I'm "allowed" to chime in here, but I don't know where else to post this?

I was trying to build this locally and it compiled fine.

However, when using it on my Spark 2.4.4 environment, I started hitting NoSuchMethodError for getUseVectorizedInputFileFormat(). After investigation, this is because Spark 2.4.4 is built with hive-exec:1.2.1.spark2, and the method getUseVectorizedInputFileFormat() does not exist within Utilities in hive-exec:1.2.1.spark2.
I fixed this issue by manually implementing this method within HoodieParquetInputFormat.java and I was finally able to test this out in a proper spark 2.4.4 environment.

I'm not sure if manually implementing this method is the right way to go or not, but I thought I'd share my thoughts here hoping that it would help you...?

Let me know!

cdmikechen · 2021-12-29T05:03:00Z

I have a concern around performance overhead and also wondering if we can just do it as a part of the existing inputformat with a flag, instead of switching over entirely to a new ipf? thougnts?

For compatibility com.twitter:parquet-hadoop-bundle which used for ParquetInputFormat in Spark2 (It only contains a parameterless constructor, while in hive2 and hive3, a constructor containing ParquetInputFormat is added)

https://github.com/apache/hive/blob/8e7f23f34b2ce7328c9d571a13c336f0c8cdecb6/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java#L48-L55

  public MapredParquetInputFormat() {
    this(new ParquetInputFormat<ArrayWritable>(DataWritableReadSupport.class));
  }

  protected MapredParquetInputFormat(final ParquetInputFormat<ArrayWritable> inputFormat) {
    this.realInput = inputFormat;
    vectorizedSelf = new VectorizedParquetInputFormat();
  }

Otherwise, we can actually consider refactoring directly into

  public HoodieParquetInputFormat() {
    super(new HudiAvroParquetInputFormat());
  }

lucasmo · 2022-01-20T23:09:48Z

I was directed to add a comment here from hudi slack.

Our team us experimenting with MOR tables. Our write ecosystem is AWS Glue and our query ecosystem for this use case is AWS Athena. Our writes are working fine. However, when querying with Athena, we get the following error:

GENERIC_INTERNAL_ERROR: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

This error occurs if we choose to select any timestamp field. Selecting only non-timestamp fields works correctly. We searched and found no working resolution.

Our table looks like this:

CREATE EXTERNAL TABLE `foo_table_mor`(
  `foo_id` bigint, 
  `foo_timestamp` timestamp, 
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://foo-bucket/foo-prefix/'

cdmikechen · 2022-01-21T03:44:51Z

@lucasmo
You can try this pr, but it looks like there are some conflicts after I push this commit. I will resolve the conflicts later.

XuQianJin-Stars · 2022-04-22T07:51:55Z

hi @cdmikechen can rebase this pr？

cdmikechen · 2022-05-10T13:08:49Z

@XuQianJin-Stars
I have a question about whether a hive3 pipeline task should be added to deal with some compatibility problems between hive2 and hive3 (and future hive4 support) ?

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java

cdmikechen · 2022-05-13T12:46:45Z

@hudi-bot run azure

cdmikechen · 2022-05-13T15:12:29Z

@XuQianJin-Stars
Hi~ You can review the codes when you have time, please.
I found that there were some unknown errors in the recent CI. After several rebase / merge, the current CI has passed.

XuQianJin-Stars · 2022-05-14T07:30:20Z

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java

-      .defaultValue("false")
+      .defaultValue("true")
      .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. "
          + "Disabled by default for backward compatibility.");


withDocumentation‘s content is need change also?

@XuQianJin-Stars
Can this shows whether we can clearly explain the PR?

'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. Previous versions keep being disabled by default.

@XuQianJin-Stars Can this shows whether we can clearly explain the PR?

'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. Previous versions keep being disabled by default.

In deprecatedAfter method write version 0.12.0 and change withDocumentation‘s content?

@XuQianJin-Stars Can this shows whether we can clearly explain the PR?
'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. Previous versions keep being disabled by default.

In deprecatedAfter method write version 0.12.0 and change withDocumentation‘s content?

@XuQianJin-Stars
Is this right?

public static final ConfigProperty<String> HIVE_SUPPORT_TIMESTAMP_TYPE = ConfigProperty .key("hoodie.datasource.hive_sync.support_timestamp") .defaultValue("true") .deprecatedAfter("0.12.0") .withDocumentation("'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. " + "From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. " + "Previous versions keep being disabled by default.");

If there's no problem, I'll change all the other descriptions.

hudi-bot · 2022-12-10T07:08:32Z

CI report:

b5e5133 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2023-01-27T09:13:47Z

Seems the PR is good to land, can you resolve the conflicts.

alexeykudinkin · 2023-01-21T03:12:04Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java

+
+import static org.apache.parquet.hadoop.ParquetInputFormat.getFilter;
+
+public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {


@cdmikechen can you please help me understand why we need custom ParquetReader?

bschell · 2023-02-18T00:05:34Z

Hi, I am testing this fix backported to emr-6.9.0 with the reproduction steps linked by @lucasmo above. But I think something is not working right.

CREATE EXTERNAL TABLE `hudi_test`(
    `_hoodie_commit_time` string COMMENT '', 
    `_hoodie_commit_seqno` string COMMENT '', 
    `_hoodie_record_key` string COMMENT '', 
    `_hoodie_partition_path` string COMMENT '', 
    `_hoodie_file_name` string COMMENT '', 
    `id` string COMMENT '', 
    `tstamp` timestamp COMMENT '') 
ROW FORMAT SERDE 
    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
    'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://{s3 path}'

I am still getting the same error

java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritableV2

veenaypatil · 2023-04-20T07:30:15Z

We are still seeing this issue with version 0.12.2 for timestamp column, is there a workaround today ? setting hoodie.datasource.hive_sync.support_timestamp=true does not work as well.
cc @codope @vinothchandar

danny0405 · 2023-06-02T06:44:46Z

Should be fixed in #7173.

cdmikechen · 2023-06-06T00:10:03Z

@danny0405 @xicm
I'm sorry that I didn't make it to the end, but I'm glad to see that the problem was finally fixed.
Thanks all your efforts and the assistance of others~

xicm · 2023-06-06T01:46:46Z

Thanks @cdmikechen , you did most of the work.

splate · 2023-06-28T19:04:56Z

For the rest of us, can you clarify which jars are impacted (like hudi-spark-bundle* ?) and follow up with when this fix is released in an official version? Sorry if this is a dumb question, but I have this issue and this thread does not tell me how to solve my problem.

danny0405 · 2023-06-29T01:28:51Z

It should be the hudi-hadoop-mr jar, which is used by Hive.

splate · 2023-06-29T17:15:18Z

Would this bug also exist in the spark hudi libraries used in AWS glue? My issue is I am trying to use Spark SQL to query a hudi table and put it into a spark dataframe. I am getting a casting exception ("java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable"). Would this be related to this?

danny0405 · 2023-06-30T11:29:29Z

Should be, but it is more related with how the timestamp type is synced I think: #8867

cdmikechen force-pushed the HUDI-83 branch 2 times, most recently from 84b1840 to e19068f Compare August 8, 2021 14:53

cdmikechen marked this pull request as draft August 8, 2021 15:40

vinothchandar reviewed Aug 10, 2021

View reviewed changes

nsivabalan mentioned this pull request Aug 31, 2021

[SUPPORT] java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable when I query a hudi table #2869

Closed

vinothchandar added the area:schema Schema evolution and data types label Sep 7, 2021

vinothchandar assigned nsivabalan Sep 7, 2021

cdmikechen force-pushed the HUDI-83 branch from e19068f to ecb72b8 Compare December 29, 2021 04:05

cdmikechen marked this pull request as ready for review December 29, 2021 04:06

cdmikechen force-pushed the HUDI-83 branch from ecb72b8 to 223c320 Compare January 28, 2022 07:22

cdmikechen force-pushed the HUDI-83 branch 2 times, most recently from 9361bbe to f9b524a Compare May 4, 2022 04:14

cdmikechen changed the title ~~[HUDI-83] Fix Timestamp type read by Hive~~ [HUDI-83] Fix Timestamp/Date type read by Hive3 May 10, 2022

lucasmo reviewed May 11, 2022

View reviewed changes

hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java Outdated Show resolved Hide resolved

cdmikechen force-pushed the HUDI-83 branch 5 times, most recently from e41458b to 689164e Compare May 13, 2022 12:30

XuQianJin-Stars reviewed May 14, 2022

View reviewed changes

XuQianJin-Stars requested a review from xiarixiaoyao May 14, 2022 07:32

cdmikechen added 3 commits December 10, 2022 09:38

add inputformat

60bff79

Optimise codes

2e30eed

Fix timestamp

0fb7a3b

cdmikechen force-pushed the HUDI-83 branch from 0373467 to c2c754c Compare December 10, 2022 02:04

Rebase 2022.12.10 and fix test case error

2cd8e80

cdmikechen force-pushed the HUDI-83 branch 2 times, most recently from 2cd8e80 to 4712497 Compare December 10, 2022 02:29

cdmikechen closed this Dec 10, 2022

cdmikechen force-pushed the HUDI-83 branch from 4712497 to c2c0fa0 Compare December 10, 2022 02:32

cdmikechen reopened this Dec 10, 2022

Fix tab

b5e5133

cdmikechen force-pushed the HUDI-83 branch from d5b260f to b5e5133 Compare December 10, 2022 02:48

alexeykudinkin self-requested a review January 21, 2023 03:14

szingerpeter mentioned this pull request Jan 21, 2023

[SUPPORT] failed to read timestamp from hive #7724

Closed

alexeykudinkin requested changes Jan 27, 2023

View reviewed changes

codope mentioned this pull request Feb 1, 2023

[SUPPORT] Hive Sync Tool parses timestamp field as bigint in Hive metastore #7730

Closed

codope mentioned this pull request Mar 5, 2023

[SUPPORT]PrestoDB failed to query data from mor table #8078

Open

cdmikechen closed this Jun 6, 2023

xicm mentioned this pull request Jun 14, 2023

[HUDI-6367] Fix NPE in HoodieAvroParquetReader and support complex schema with timestamp #8955

Merged

4 tasks


		import static org.apache.parquet.hadoop.ParquetInputFormat.getFilter;

		public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> {

[HUDI-83] Fix Timestamp/Date type read by Hive3 #3391

[HUDI-83] Fix Timestamp/Date type read by Hive3 #3391

Uh oh!

Conversation

cdmikechen commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level

Documentation Update

Contributor's checklist

Uh oh!

cdmikechen commented Aug 8, 2021

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar Aug 10, 2021

Choose a reason for hiding this comment

Uh oh!

cdmikechen Dec 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Sep 2, 2021

Uh oh!

cdmikechen commented Dec 29, 2021

Uh oh!

lucasmo commented Jan 20, 2022

Uh oh!

cdmikechen commented Jan 21, 2022

Uh oh!

XuQianJin-Stars commented Apr 22, 2022

Uh oh!

cdmikechen commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cdmikechen commented May 13, 2022

Uh oh!

cdmikechen commented May 13, 2022

Uh oh!

XuQianJin-Stars May 14, 2022

Choose a reason for hiding this comment

Uh oh!

cdmikechen May 14, 2022

Choose a reason for hiding this comment

Uh oh!

XuQianJin-Stars May 15, 2022

Choose a reason for hiding this comment

Uh oh!

cdmikechen May 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Dec 10, 2022

CI report:

Uh oh!

danny0405 commented Jan 27, 2023

Uh oh!

alexeykudinkin Jan 21, 2023

Choose a reason for hiding this comment

Uh oh!

bschell commented Feb 18, 2023

Uh oh!

veenaypatil commented Apr 20, 2023

Uh oh!

danny0405 commented Jun 2, 2023

Uh oh!

cdmikechen commented Jun 6, 2023

Uh oh!

xicm commented Jun 6, 2023

Uh oh!

splate commented Jun 28, 2023

Uh oh!

danny0405 commented Jun 29, 2023

Uh oh!

splate commented Jun 29, 2023

Uh oh!

danny0405 commented Jun 30, 2023

cdmikechen commented Aug 3, 2021 •

edited

Loading

cdmikechen Dec 29, 2021 •

edited

Loading

cdmikechen commented May 10, 2022 •

edited

Loading

cdmikechen May 15, 2022 •

edited

Loading