Skip to content

Conversation

@xicm
Copy link
Contributor

@xicm xicm commented Nov 10, 2022

Change Logs

Compilation of HiveAvroSerializer fails with hive3.

Impact

none

Risk level (write none, low medium or high below)

low

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@xicm
Copy link
Contributor Author

xicm commented Nov 10, 2022

Hi @xiarixiaoyao , please review if you are available.
I tested with hive3 on my pc.

@xiarixiaoyao xiarixiaoyao self-assigned this Nov 11, 2022
@xiarixiaoyao
Copy link
Contributor

@hudi-bot run azure

@xiarixiaoyao
Copy link
Contributor

@xicm
Thank you for your contribution
Looks good, let me test it for presto

@cdmikechen
Copy link
Contributor

@xiarixiaoyao @xicm
There seems to be a PR at the moment that may overlap with this issue, should we deal with this issue in a unified way?
#3391

return new Utf8(string);
case DATE:
return DateWritable.dateToDays(((DateObjectInspector)fieldOI).getPrimitiveJavaObject(structFieldData));
return new DateWritable((DateWritable)structFieldData).getDays();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive3 will return DateWritableV2 for date type, we cannot convert DateWritableV2 to DateWritable direclty.

Copy link
Contributor Author

@xicm xicm Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh, I compile with hive 3.1.2 and run TestHiveAvroSerializer, the test is successful. Let me do more tests.

Timestamp timestamp =
((TimestampObjectInspector) fieldOI).getPrimitiveJavaObject(structFieldData);
return timestamp.getTime();
return new TimestampWritable((TimestampWritable) structFieldData).getTimestamp().getTime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@xiarixiaoyao
Copy link
Contributor

@xiarixiaoyao @xicm There seems to be a PR at the moment that may overlap with this issue, should we deal with this issue in a unified way? #3391

yes, it will be better to deal with hive2/hive3 will a unifed way

@xiarixiaoyao xiarixiaoyao added the priority:high Significant impact; potential bugs label Nov 13, 2022
@xicm xicm changed the title [HUDI-5189] Make HiveAvroSerializer compatible with hive3 [WIP][HUDI-5189] Make HiveAvroSerializer compatible with hive3 Nov 14, 2022
@xicm
Copy link
Contributor Author

xicm commented Nov 14, 2022

@xiarixiaoyao @cdmikechen
#3391 should be resolved, if we want to support hive3.
@cdmikechen Could you fix the conflict in #3391?

@xicm
Copy link
Contributor Author

xicm commented Nov 16, 2022

Hi @cdmikechen , do you mind if I copy the code #3391 to this pr?

@xicm
Copy link
Contributor Author

xicm commented Nov 16, 2022

Hi @xiarixiaoyao, I merged #3391 and tested with the case you described in #6741.
a tricky result,
select * from tx_null -> The result is right.
select col7 from tx_null -> returns 5 records but the values are NULL.
I have checked the code but find nothing, , do you have any advice to solve this.
I will push the code later.

@cdmikechen
Copy link
Contributor

@xiarixiaoyao @xicm
I've had a lot of workday stuff going on lately and haven't rebase the code for a while.
But I wonder if we could find a time for us to have an online chat(like wechat, zoom or slack at weekend), in which case I think we should be able to work this out well.

@xicm
Copy link
Contributor Author

xicm commented Nov 16, 2022

@xiarixiaoyao @xicm I've had a lot of workday stuff going on lately and haven't rebase the code for a while. But I wonder if we could find a time for us to have an online chat(like wechat, zoom or slack at weekend), in which case I think we should be able to work this out well.

I have merged your code , but there is still a problem with Timestamp, I'm debugging.

@xiarixiaoyao
Copy link
Contributor

@xiarixiaoyao @xicm I've had a lot of workday stuff going on lately and haven't rebase the code for a while. But I wonder if we could find a time for us to have an online chat(like wechat, zoom or slack at weekend), in which case I think we should be able to work this out well.

my wechat 1037817390

@xicm xicm changed the title [WIP][HUDI-5189] Make HiveAvroSerializer compatible with hive3 [HUDI-5189] Make HiveAvroSerializer compatible with hive3 Nov 25, 2022
@xicm xicm requested a review from xiarixiaoyao December 2, 2022 07:47
@xiarixiaoyao
Copy link
Contributor

@cdmikechen could you pls help review this pr, thanks

@nsivabalan nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022
@codope codope added engine:hive Hive integration and removed release-0.12.2 Patches targetted for 0.12.2 labels Dec 7, 2022
@xicm
Copy link
Contributor Author

xicm commented Dec 16, 2022

@hudi-bot run azure

@danny0405
Copy link
Contributor

@xiarixiaoyao Can we push-forward this PR again, seems s useful fix.

* @param realReader Parquet Reader
*/
public RealtimeUnmergedRecordReader(RealtimeSplit split, JobConf job,
RecordReader<NullWritable, ArrayWritable> realReader) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not change the original format, thanks

.withLogFilePaths(split.getDeltaLogPaths())
.withReaderSchema(getReaderSchema())
.withLatestInstantTime(split.getMaxCommitTime())
.withReadBlocksLazily(Boolean.parseBoolean(this.jobConf.get(HoodieRealtimeConfig.COMPACTION_LAZY_BLOCK_READ_ENABLED_PROP, HoodieRealtimeConfig.DEFAULT_COMPACTION_LAZY_BLOCK_READ_ENABLED)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

return DateWritable.dateToDays(((DateObjectInspector)fieldOI).getPrimitiveJavaObject(structFieldData));
try {
Class<?> clazz = structFieldData.getClass();
return clazz.getMethod("getDays").invoke(structFieldData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a significant impact on performance,
Let's avoid using reflection every iteration。
Let's do the same thing as HiveUtils

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@xicm xicm requested a review from xiarixiaoyao March 31, 2023 07:11
public static final String DATE_WRITEABLE_V2_CLASS = "org.apache.hadoop.hive.serde2.io.DateWritableV2";
public static final boolean SUPPORT_DATE_WRITEABLE_V2;
private static final Constructor DATE_WRITEABLE_V2_CONSTRUCTOR;
public static String DATE_WRITEABLE_V2_CLASS = "org.apache.hadoop.hive.serde2.io.DateWritableV2";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final ?

@xiarixiaoyao
Copy link
Contributor

LGTM

@xiarixiaoyao
Copy link
Contributor

@XuQianJin-Stars could you pls help review again

@danny0405 danny0405 added issue:version-compatibility Version compatibility issues release-0.14.0 priority:blocker Production down; release blocker labels Apr 24, 2023
@danny0405
Copy link
Contributor

@xicm Thanks for the contribution, can we squash the commits into one, it is hard for code reviewing because of the merge cmd, let's use the git rebase instead of git merge, the git merge would generate a new log on the git history.

@xicm
Copy link
Contributor Author

xicm commented Apr 28, 2023

@hudi-bot run azure

@codope codope removed the priority:blocker Production down; release blocker label May 1, 2023
@danny0405
Copy link
Contributor

@xicm Do you have plan to push forward this PR again?

@xicm
Copy link
Contributor Author

xicm commented May 25, 2023

@xicm Do you have plan to push forward this PR again?

Year, I resolved the conflict, review again when you are available.

@danny0405
Copy link
Contributor

Thanks for the contribution, overall looks good now, I have reviewed and created a patch:
5189.patch.zip

@danny0405 danny0405 self-assigned this May 26, 2023
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor

Awesome job on landing this. I guess this has been one of the longest pending gaps we had w/ hudi. timestamp col not readable w/ hive3. thank you folks.

if (Arrays.stream(constructors)
.anyMatch(c -> c.getParameterCount() > 0 && c.getParameterTypes()[0]
.getName().equals(ParquetInputFormat.class.getName()))) {
supportAvroRead = true;
Copy link
Contributor

@Zouxxyy Zouxxyy Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will supportAvroRead be false? I add a pr for it~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zouxxyy You can read the comments. This is to address compatibility issues with spark

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zouxxyy You can read the comments. This is to address compatibility issues with spark

yeal, but it always be true now

Copy link
Contributor

@cdmikechen cdmikechen Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zouxxyy
In my impression, hive2 and spark3 have the same processing method and constructor, but there is a difference in hive3.
You can try switching hive to version 3 for verification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revisit this method, the name ParquetInputFormat.class.getName() in hive2 is the same as hive3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for compatibility with spark,do you mean hive on spark? @cdmikechen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine:hive Hive integration issue:version-compatibility Version compatibility issues priority:high Significant impact; potential bugs release-0.14.0

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

8 participants