Skip to content

Conversation

@YannByron
Copy link
Contributor

@YannByron YannByron commented Oct 19, 2021

What is the purpose of the pull request

We can specify different values to URL_ENCODE_PARTITIONING and HIVE_STYLE_PARTITIONING_ENABLE configs for multiple write operation without any warns or errors. And some partition path enable hive-style and some not, or some enable url-encode and some not, that is so weird.

So i wanna to persist these config to hoodie.properties when the first write (write data by spark dataframe the first time or create table by spark-sql). And then, uses do not need to specify these config any more. If these configs are specified and different to the existing value in hoodie.properties, exceptions will be raised.

And, this is also useful to solve some of the keyGenerator discrepancy issues between DataFrame writer and SQL.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Oct 19, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@YannByron
Copy link
Contributor Author

@leesf @vinothchandar can you help to review this?

@YannByron YannByron force-pushed the master_persist_configs branch from 5ce2404 to c491ffd Compare October 26, 2021 12:04
Copy link
Contributor

@leesf leesf Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between case "org.apache.hudi.keygen.ComplexAvroKeyGenerator": return "org.apache.hudi.keygen.ComplexKeyGenerator"; I would not get the point here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KenGenerators whose name isn't include 'Avro' extends SparkKeyGeneratorInterface which has method getRecordKey and getPartitionPath from Row. I wanna to make sure the KeyGenerator's instance works within Spark via convertToSparkKeyGenerator .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need the change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactor the method createKeyGenerator which wouldn't throw an exception with this message directly, but this message "Property hoodie.datasource.write.recordkey.field not found" exists in the exception's stacktrace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed. The parameters variable includes all default hoodie's parameters, and will be used as following codes. The optParams variable just used to validate whether there are conflicts between these and TableConfig.

@YannByron YannByron force-pushed the master_persist_configs branch 5 times, most recently from 791e312 to 8441914 Compare October 27, 2021 16:48
@YannByron
Copy link
Contributor Author

@hudi-bot run azure

1 similar comment
@YannByron
Copy link
Contributor Author

@hudi-bot run azure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need this change, TypedProperties should act as same as Properties

Copy link
Contributor Author

@YannByron YannByron Nov 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because HoodieSparkUtils.getPartitionColumns use Properties as the input parameter type.

@YannByron YannByron force-pushed the master_persist_configs branch 2 times, most recently from 6c8875d to a9673c0 Compare November 1, 2021 14:52
@YannByron YannByron force-pushed the master_persist_configs branch from a9673c0 to 38d68b3 Compare November 2, 2021 10:07
@leesf leesf merged commit 6351e5f into apache:master Nov 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants