-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-83] Fix Timestamp/Date type read by Hive3 #3391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
84b1840 to
e19068f
Compare
|
@hudi-bot run azure |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern around performance overhead and also wondering if we can just do it as a part of the existing inputformat with a flag, instead of switching over entirely to a new ipf? thougnts?
| @Override | ||
| public ArrayWritable getCurrentValue() throws IOException, InterruptedException { | ||
| GenericRecord record = parquetRecordReader.getCurrentValue(); | ||
| return (ArrayWritable) HoodieRealtimeRecordReaderUtils.avroToArrayWritable(record, record.getSchema()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this extra avro conversion will cost performance? Wondering if we can avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar
I have been running the fork for several months. At present, it does not cause too many additional problems. This may be related to the small amount of data processed by my hive and the insufficient impact of memory.
At present, the parsing of avro data by hudi spark in org.apache.hudi.AvroConversionHelper or hive itself in org.apache.hadoop.hive.serde2.avro.AvroGenericRecordWritable both wrapper around an Avro GenericRecord. Normally, I think data processing should not cause serious overhead .
Meanwhile, in part of the instantiation of TimestampWritableV2, I reconstructed some code to enhance and fix some of the original errors and problems.
|
Hey, I was trying to build this locally and it compiled fine. However, when using it on my Spark I'm not sure if manually implementing this method is the right way to go or not, but I thought I'd share my thoughts here hoping that it would help you...? Let me know! |
For compatibility public MapredParquetInputFormat() {
this(new ParquetInputFormat<ArrayWritable>(DataWritableReadSupport.class));
}
protected MapredParquetInputFormat(final ParquetInputFormat<ArrayWritable> inputFormat) {
this.realInput = inputFormat;
vectorizedSelf = new VectorizedParquetInputFormat();
}Otherwise, we can actually consider refactoring directly into public HoodieParquetInputFormat() {
super(new HudiAvroParquetInputFormat());
} |
|
I was directed to add a comment here from hudi slack. Our team us experimenting with MOR tables. Our write ecosystem is AWS Glue and our query ecosystem for this use case is AWS Athena. Our writes are working fine. However, when querying with Athena, we get the following error:
This error occurs if we choose to select any timestamp field. Selecting only non-timestamp fields works correctly. We searched and found no working resolution. Our table looks like this: |
|
@lucasmo |
|
hi @cdmikechen can rebase this pr? |
9361bbe to
f9b524a
Compare
|
@XuQianJin-Stars |
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java
Outdated
Show resolved
Hide resolved
e41458b to
689164e
Compare
|
@hudi-bot run azure |
|
@XuQianJin-Stars |
| .defaultValue("false") | ||
| .defaultValue("true") | ||
| .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. " | ||
| + "Disabled by default for backward compatibility."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withDocumentation‘s content is need change also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XuQianJin-Stars
Can this shows whether we can clearly explain the PR?
'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. Previous versions keep being disabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XuQianJin-Stars Can this shows whether we can clearly explain the PR?
'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. Previous versions keep being disabled by default.
In deprecatedAfter method write version 0.12.0 and change withDocumentation‘s content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XuQianJin-Stars Can this shows whether we can clearly explain the PR?
'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. Previous versions keep being disabled by default.In
deprecatedAftermethod write version 0.12.0 and change withDocumentation‘s content?
@XuQianJin-Stars
Is this right?
public static final ConfigProperty<String> HIVE_SUPPORT_TIMESTAMP_TYPE = ConfigProperty
.key("hoodie.datasource.hive_sync.support_timestamp")
.defaultValue("true")
.deprecatedAfter("0.12.0")
.withDocumentation("'INT64' with original type TIMESTAMP_MICROS is converted to hive 'timestamp' type. "
+ "From 0.12.0, 'timestamp' type will be supported and also can be disabled by this variable. "
+ "Previous versions keep being disabled by default.");If there's no problem, I'll change all the other descriptions.
2cd8e80 to
4712497
Compare
|
Seems the PR is good to land, can you resolve the conflicts. |
|
|
||
| import static org.apache.parquet.hadoop.ParquetInputFormat.getFilter; | ||
|
|
||
| public class HoodieAvroParquetReader extends RecordReader<Void, ArrayWritable> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdmikechen can you please help me understand why we need custom ParquetReader?
|
Hi, I am testing this fix backported to emr-6.9.0 with the reproduction steps linked by @lucasmo above. But I think something is not working right. I am still getting the same error |
|
We are still seeing this issue with version 0.12.2 for timestamp column, is there a workaround today ? setting |
|
Should be fixed in #7173. |
|
@danny0405 @xicm |
|
Thanks @cdmikechen , you did most of the work. |
|
For the rest of us, can you clarify which jars are impacted (like hudi-spark-bundle* ?) and follow up with when this fix is released in an official version? Sorry if this is a dumb question, but I have this issue and this thread does not tell me how to solve my problem. |
|
It should be the hudi-hadoop-mr jar, which is used by Hive. |
|
Would this bug also exist in the spark hudi libraries used in AWS glue? My issue is I am trying to use Spark SQL to query a hudi table and put it into a spark dataframe. I am getting a casting exception ("java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable"). Would this be related to this? |
|
Should be, but it is more related with how the timestamp type is synced I think: #8867 |
Change Logs
This pull request let hive can read timestamp type column datas correctly.
The problem was initially related to JIRA HUDI-83 and related issues on issue #2544
HoodieParquetInputFormatto use a customParquetInputFormatnamedHudiAvroParquetInputFormatHudiAvroParquetInputFormatwe use a customRecordReadernamedHudiAvroParquetReader. In this class we useAvroReadSupportso that Hive can get parquet data with an avro GenericRecord.org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.avroToArrayWritableto transform GenericRecord to ArrayWriteable. At the same time, timestamp/date type processing for different situations of hive2 and hive3 is added to this method.hoodie.datasource.hive_sync.support_timestampdefault value from false to truesupportAvroReadvalue to be compatible with the adaptation of some old hudi versions for hive3 timestamp/date typesImpact
Risk level
low
Documentation Update
The javadoc has been modified and the website document will be on other PR later.
Contributor's checklist