-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs #5667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,183 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
| this work for additional information regarding copyright ownership. | ||
| The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --> | ||
|
|
||
| # RFC-54: New Table APIs and Streamline Hudi Configs | ||
|
|
||
| ## Proposers | ||
|
|
||
| - @codope | ||
|
|
||
| ## Approvers | ||
|
|
||
| - @xushiyan | ||
| - @vinothchandar | ||
|
|
||
| ## Status | ||
|
|
||
| JIRA: [HUDI-4141](https://issues.apache.org/jira/browse/HUDI-4141) | ||
|
|
||
| ## Abstract | ||
|
|
||
| Users configure jobs to write Hudi tables and control the behaviour of their | ||
| jobs at different levels such as table, write client, datasource, record | ||
| payload, etc. On one hand, this is the true strength of Hudi which makes it | ||
| suitable for many use cases and offers the users a solution to the tradeoffs | ||
| encountered in data systems. On the other, it has also resulted in the learning | ||
| curve for new users to be steeper. In this RFC, we propose to streamline some of | ||
| these configurations. Additionally, we propose a few table level APIs to create | ||
| or update Hudi table programmatically. Together, they would help in a smoother | ||
| onboarding experience and increase the usability of Hudi. It would also help | ||
| existing users through better configuration maintenance. | ||
|
|
||
| ## Background | ||
|
|
||
| Currently, users can create and update Hudi Table using three different | ||
| ways: [Spark datasource](https://hudi.apache.org/docs/writing_data), | ||
| [SQL](https://hudi.apache.org/docs/table_management) | ||
| and [DeltaStreamer](https://hudi.apache.org/docs/hoodie_deltastreamer). Each one | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but there is no DeltaStreamer anymore. it was renamed to just Streamer https://hudi.apache.org/docs/hoodie_streaming_ingestion |
||
| of these ways is setup using a bunch | ||
| of [configurations](https://hudi.apache.org/docs/configurations), which has | ||
| grown over the years as new features have beed added. Imagine yourself as a data | ||
| engineer who has been using Spark to write parquet tables. You want to try out | ||
| Hudi and land on | ||
| the [quickstart](https://hudi.apache.org/docs/quick-start-guide) page. You see a | ||
| bunch of configurations (precombine field, record key, partition path) to be set | ||
| and wonder why can't I just do `spark.write.format("hudi").save()`. Apart from | ||
| configurations, there is no first-class support for table management APIs such | ||
| as to create or drop table. The implementation section below presents the | ||
| proposals to fill such gaps. | ||
|
|
||
| ## Implementation | ||
|
|
||
| Implementation can be split into two independent changes: streamline | ||
| configuration and new table APIs. | ||
|
|
||
| ### Streamline Configuration | ||
|
|
||
| #### Minimal set of quickstart configurations | ||
|
|
||
| * Users should be able to simply write Hudi table | ||
| using `spark.write.format("hudi")`. If no record key and precombine field is | ||
| provided, then assume append only and avoid index lookup and merging. | ||
| * Hudi should infer partition field if users provide | ||
| as `spark.write.format("hudi").partitionBy(field)`. | ||
| * Users need not pass all the configurations in each write operation. Once the | ||
| table has been created, most table configs do not change, e.g. table name | ||
| needs to be passed in every write, even though its only needed first time. | ||
| Hudi should fetch from table configs when options are not provided by the | ||
| user. | ||
|
|
||
| #### Better defaults | ||
|
|
||
| * Default values for configurations should be optimized for simple bulk load | ||
| scenario e.g. by default if we have NONE sort mode then it's as good as | ||
| parquet writes with some additional work for meta columns. | ||
| * Make reasonable assumptions, such as do not rely on any external system (e.g. | ||
| hbase) for default. As another example, enable schema reconciliation by | ||
| default instead of failing writes. | ||
|
|
||
| #### Consistency across write paths | ||
|
|
||
| * Keep configs for Spark SQL, Spark DataSource and HoodieDeltaStreamer in sync as much | ||
| as possible. Document exceptions, e.g. key generator for sql is | ||
| ComplexKeyGenerator while for datasource it is SimpleKeyGenerator. | ||
| * Rename/reuse existing datasource keys that are meant for the same purpose. | ||
| * In all these changes, we should support backward compatibility. | ||
|
|
||
| #### Refactor Meta Sync ([RFC-55](/rfc/rfc-55/rfc-55.md)) | ||
|
|
||
| * Reduce the number of configs needed for Hive sync, e.g. table name once | ||
| provided at the time of first write can be reused for hive sync table name | ||
| config as well. | ||
| * Refactor the class hierarchy and APIs. | ||
|
|
||
| #### Support `HoodieConfig` API | ||
|
|
||
| * Users should be able to use the config builders instead of specifying config | ||
| keys, | ||
| e.g. `spark.write.format("hudi").options(HoodieClusteringConfig.Builder().withXYZ().build())` | ||
|
|
||
| ### Table APIs | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a great idea! Many people used to other frameworks (like DeltaLake) would onboard easily. As a user, I just have a concern: are you planning on creating SDKs for other languages supported by Spark, especially Python? Asking that because at my company we use Hudi successfully with PySpark (even though the Hudi project doesn't have a single line of Python) because of the way it works through configuration. I believe that there are many other users that have successfully used Hudi with PySpark for that same reason, so I would think about that and maybe add that support in the roadmap
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good Call out. +1 for a python client. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Happy you liked it, @vinothchandar. Would love to help if you guys need it. I would just need some initial direction :) |
||
| These APIs are meant for programmatically interacting with Hudi tables. Users | ||
| should be able to create or update the tables using static methods. | ||
|
|
||
| | Method Name | Description | | ||
| | ------------- | ------------- | | ||
| | bootstrap | Create a Hudi table from the given table in parquet and other supported formats | | ||
| | create | Create a Hudi table with the given configs if it does not exist. Returns an instance of `HudiTable` for the newly created or an existing Hudi table. | | ||
| | update | Update rows in a Hudi table that match the given condition with the given update expression | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here will we make use of spark sql expression or build hudi expression and transform spark sql expression to hudi expression?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. spark sql expression. Please check my comment below #5667 (comment) |
||
| | drop | Drop the given Hudi table completely | | ||
| | truncate | Delete data from the given Hudi table but does not drop it | | ||
| | restoreTo | Restore Hudi table to the given older commit time or a logical time. | | ||
|
Comment on lines
+119
to
+126
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this going to be a preferred api rather than spark datasource V2 apis like writeTo(), toTable() etc.? Will this support spark structured streaming read/write? |
||
|
|
||
| Let's look at some examples: | ||
|
|
||
| ```java | ||
| // create Hudi table | ||
| HudiTable hudiTable = HudiTable.create(HoodieTableConfig.newBuilder() | ||
| .withTableName("tableName") | ||
| .withBasePath("basePath") | ||
| .withTableType(TableType.MERGE_ON_READ) | ||
| .build()) | ||
|
|
||
| // update Hudi table, add 1 to colA for all records of the current year (2022) | ||
| hudiTable.update( | ||
| functions.col("dateCol").gt("2021-12-31"), // filter condition | ||
| functions.col("colA").plus(1) // update expression | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the running engine behind these APIs, do we have configurable/pluggable engine options ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| ) | ||
|
|
||
| // restore to previous commit | ||
| hudiTable.restoreTo("0000000" // previous commit time) | ||
|
|
||
| // drop | ||
| hudiTable.drop() // deletes the whole data and the base path as well | ||
| ``` | ||
|
|
||
| **Phase 1** | ||
|
|
||
| Spark will be the execution engine behind these APIs. We will use spark sql functions for update expressions. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you planning to add merge into, insert, upsert operations in the future/ next phases? |
||
|
|
||
| **Phase 2** | ||
|
|
||
| Support other engines such as Flink. | ||
|
|
||
| ## Rollout/Adoption Plan | ||
|
|
||
| - What impact (if any) will there be on existing users? | ||
|
|
||
| Minimal impact. New APIs are intended to be used for new tables. Most of the | ||
| configuration changes will be backward compatible. | ||
|
|
||
| - If we are changing behavior how will we phase out the older behavior? | ||
|
|
||
| Some behaviour changes will be handled during table upgrade. Any breaking | ||
| changes will be called out in the release notes. | ||
|
|
||
| - If we need special migration tools, describe them here. | ||
|
|
||
| For breaking changes that cannot be handled automatically, we will add commands | ||
| to hudi-cli to support migration of existing tables to the newer version. | ||
|
|
||
| - When will we remove the existing behavior? | ||
|
|
||
| Not required. | ||
|
|
||
| ## Test Plan | ||
|
|
||
| Describe in few sentences how the RFC will be tested. How will we know that the | ||
| implementation works as expected? How will we know nothing broke?. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we use the HudiTable in the write client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HudiTable is an alternative to using write client directly. It is not meant to be used within the write client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sounds like the
HudiTableis a new concept and not meant to replace the currentHoodieTable. Did I understand correctly?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there is kinda naming convention in community: prefix "hudi" - is for project and its submodules, but "hoodie" - is for classes. May be it is better don't break this rule and do not use HudiTable as a class name?