-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[RFC-69] Hudi 1.X #8679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC-69] Hudi 1.X #8679
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,159 @@ | ||
| <!-- | ||
| Licensed to the Apache Software Foundation (ASF) under one or more | ||
| contributor license agreements. See the NOTICE file distributed with | ||
| this work for additional information regarding copyright ownership. | ||
| The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| --> | ||
| # RFC-69: Hudi 1.X | ||
|
|
||
| ## Proposers | ||
|
|
||
| * Vinoth Chandar | ||
|
|
||
| ## Approvers | ||
|
|
||
| * Hudi PMC | ||
|
|
||
| ## Status | ||
|
|
||
| Under Review | ||
|
|
||
| ## Abstract | ||
|
|
||
| This RFC proposes an exciting and powerful re-imagination of the transactional database layer in Hudi to power continued innovation across the community in the coming years. We have [grown](https://git-contributor.com/?chart=contributorOverTime&repo=apache/hudi) more than 6x contributors in the past few years, and this RFC serves as the perfect opportunity to clarify and align the community around a core vision. This RFC aims to serve as a starting point for this discussion, then solicit feedback, embrace new ideas and collaboratively build consensus towards an impactful Hudi 1.X vision, then distill down what constitutes the first release - Hudi 1.0. | ||
|
|
||
| ## **State of the Project** | ||
|
|
||
| As many of you know, Hudi was originally created at Uber in 2016 to solve [large-scale data ingestion](https://www.uber.com/blog/uber-big-data-platform/) and [incremental data processing](https://www.uber.com/blog/ubers-lakehouse-architecture/) problems and later [donated](https://www.uber.com/blog/apache-hudi/) to the ASF. | ||
| Since its graduation as a top-level Apache project in 2020, the community has made impressive progress toward the [streaming data lake vision](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform) to make data lakes more real-time and efficient with incremental processing on top of a robust set of platform components. | ||
| The most recent 0.13 brought together several notable features to empower incremental data pipelines, including - [_RFC-51 Change Data Capture_](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md), more advanced indexing techniques like [_consistent hash indexes_](https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md) and | ||
| novel innovations like [_early conflict detection_](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) - to name a few. | ||
|
|
||
|
|
||
|
|
||
| Today, Hudi [users](https://hudi.apache.org/powered-by) are able to solve end-end use cases using Hudi as a data lake platform that delivers a significant amount of automation on top of an interoperable open storage format. | ||
| Users can ingest incrementally from files/streaming systems/databases and insert/update/delete that data into Hudi tables, with a wide selection of performant indexes. | ||
| Thanks to the core design choices like record-level metadata and incremental/CDC queries, users are able to consistently chain the ingested data into downstream pipelines, with the help of strong stream processing support in | ||
| recent years in frameworks like Apache Spark, Apache Flink and Kafka Connect. Hudi's table services automatically kick in across this ingested and derived data to manage different aspects of table bookkeeping, metadata and storage layout. | ||
| Finally, Hudi's broad support for different catalogs and wide integration across various query engines mean Hudi tables can also be "batch" processed old-school style or accessed from interactive query engines. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you started to call "batch" as old school 👏
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, ingest is completely incremental now - across industry. Once upon a time, it was unthinkable. :)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be worth focusing on bringing the cost of streaming ingestion/processing down, I think multitable deltastreamer concepts or similar will be very important for wider adoption.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. Efforts like Record index, should bring that down dramatically I feel for say random write workloads. This is definitely sth to measure and baseline and set plans for, but not sure how to pull a concrete project around this yet. Multi delta streamer needs more love for sure. will make it 1.1 for now though, since we need to front load other things before
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #6612 should help reduce costs as well. |
||
|
|
||
| ## **Future Opportunities** | ||
|
|
||
| We're adding new capabilities in the 0.x release line, but we can also turn the core of Hudi into a more general-purpose database experience for the lake. As the first kid on the lakehouse block (we called it "transactional data lakes" or "streaming data lakes" | ||
vinothchandar marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| to speak the warehouse users' and data engineers' languages, respectively), we made some conservative choices based on the ecosystem at that time. However, revisiting those choices is important to see if they still hold up. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm all for this, make it a really good experience to use Hudi via SQL so that's a more pleasant experience for DWH turned DE folks.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 . |
||
|
|
||
| * **Deep Query Engine Integrations:** Back then, query engines like Presto, Spark, Trino and Hive were getting good at queries on columnar data files but painfully hard to integrate into. Over time, we expected clear API abstractions | ||
vinothchandar marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| around indexing/metadata/table snapshots in the parquet/orc read paths that a project like Hudi can tap into to easily leverage innovations like Velox/PrestoDB. However, most engines preferred a separate integration - leading to Hudi maintaining its own Spark Datasource, | ||
| Presto and Trino connectors. However, this now opens up the opportunity to fully leverage Hudi's multi-modal indexing capabilities during query planning and execution. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. the direction seems to be that query side smartness is increasingly being pushed to Hudi's layer rightfully. |
||
| * **Generalized Data Model:** While Hudi supported keys, we focused on updating Hudi tables as if they were a key-value store, while SQL queries ran on top, blissfully unchanged and unaware. Back then, generalizing the support for | ||
| keys felt premature based on where the ecosystem was, which was still doing large batch M/R jobs. Today, more performant, advanced engines like Apache Spark and Apache Flink have mature extensible SQL support that can support a generalized, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I love that you mention SQL support multiple times in the RFC - to bring equal experience to DWH and RDBMS so that SQL-heavy users can use Hudi without friction. Do we plan to make SQL support a priority in all areas (read, write, indexing, e.g.,
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yihua +1. All operation on the Hudi could use SQL to express and execute, which reduces users' usage and learning costs and decouples from the engine so that not only uses the SQL of engine like Flink SQL etc.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I cannot agree more, it should be super easy to start using Hudi with SQL (and Python which is by far the most popular language I see in Hudi slack?) as a first-class citizen
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kazdy, Iceberg has supported PyIceberg. IMO, we could support the PyHudi to provide ML developer to access Hudi data.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes the single node python use-case is good. I wonder if we just add support to popular frameworks like polars directly instead. thoughts? @kazdy @SteNicholas ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vinothchandar I think there are upsides to having eg a rust client with Arrow underneath and maintain bindings for Python. @SteNicholas PyIceberg if I'm not wrong is written in Python in full and uses PyArrow underneath. That's another option, but it feels like hudi-rs + python bindings will create less overhead to maintain it overall. Btw once we have python bindings/client I will be happy to add Hudi support to aws sdk for pandas, good chunk Hudi Slack community seems to use Hudi on AWS so it makes sense to do so.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am more inclined towards this. Just have two native implementations - Java, Rust. and wrap others. I just need to make sure Rust to C++ works great as well. Did not have a stellar experience here with Go.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there are some changes to be done to Hudi Spark integration, then maybe it's the right time to
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes even on trino. something to consider. When we were writing Hudi originally, there was no Spark DataSet APIs even, FWIW :)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tracking here. https://issues.apache.org/jira/browse/HUDI-6488 feel free to grab it! |
||
| relational data model for Hudi tables. | ||
|
Comment on lines
+58
to
+60
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really like this point, but what do you have in mind? What needs to be changed to support the relational data model better? Make precombine field optional or something else? How does this relate to the below point "Beyond structured Data"? How to marry these two together elegantly?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kazdy I have some concrete thoughts here. Will write up in the next round. By and large, to generalize keys, we need to just stop special casing record keys in indexing layer and have metadata layer, build indexes for any column for e.g Some of these are already there. Another aspect is introducing the key constraint key words similar to RDBMS-es (
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think we can study the experience offered by json/document stores more here. and see if we can borrow some approaches.. e.g Mongo. wdyt?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vinothchandar, the definition of data lake is unstructured data oriented, therefore the data model of Hudi could upgrade to support unstructured data oriented.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@vinothchandar On the other hand, Mongo is super flexible but is/was? not that great at joins so that is not aligned with the relational model. I myself sometimes keep json data as strings in Hudi tables, but that's mostly because I was scared of schema evolution and compatibility issues and I can not be sure what will I get from the data producer. At the same time, I wanted to use some of Hudi capabilities. So for semi-structured data solution can be to introduce super flexibility with schemas in Hudi + maybe add another base format like BSON to support this flexibility easier.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kazdy @SteNicholas I owe you both a response here. Researching more. Will circle back here. Thanks for the great points raised.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. my conclusions here @kazdy @SteNicholas after researching many systems that are used for ML/AI use-cases that involve unstructured data.
I think both of these can be supported as long as new storage format can
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have some concrete ideas to try once we have the storage format finalized more and some abstractions are in place. LMK if you want to flesh this out more and do some research ahead of it. cc @bhasudha who's looking into some of these. |
||
| * **Serverful and Serverless:** Data lakes have historically been about jobs triggered periodically or on demand. Even though many metadata scaling challenges can be solved by a well-engineered metaserver | ||
vinothchandar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| (similar to what modern cloud warehouses do anyway), the community has been hesitant towards a long-running service in addition to their data catalog or a Hive metaserver. In fact, our timeline server efforts were stalled | ||
| due to a lack of consensus in the community. However, as needs like concurrency control evolve, proprietary solutions emerge to solve these very problems around open formats. It's probably time to move towards a truly-open | ||
| solution for the community by embracing a hybrid architecture where we employ server components for table metadata while remaining server-less for data. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Couldn’t agree more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @BruceKellan Actually got heavy pushbacks before - valid concerns on running yet another metastore. We could take a flexible approach to bridge the lack of consensus. I feel we can abstract the metadata fetching out such that metadata can be read from storage/metadata table (or) from the metaserver. I think the meta server today does sth similar today; It's not a required component, to query the timeline i.e instead of .hoodie, you'd ask the metaserver. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I always thought the Metadata Server is kind of a requirement for the implementation of additional clients such as the Rust+Python bindings client. But do I understand right that you are planning on making the metaserver optional and still support all operation with the clients going directly to the metadata log files?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @moritzmeister yes. the idea is the latter. Metaserver will be optional and will speed up query planning and bring other benefits like UI/RBAC and so on. Planning to track format changes here. https://issues.apache.org/jira/browse/HUDI-6242 , main thing to figure out how to make the metadata indexed in HFile into another format. Looking into Lance format, which has readers already in Rust/C++ (but not java, we'd need to add that). One way or the other.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 on this.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. if we can embrace this model, management becomes much simpler.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it will get less pushbacks if it's easy to deploy in cloud (eg from marketplace on aws). Being in a smaller company and having another thing to deploy and maintain can be a no go. Esp that the rest of the stack can be and probably is serverless. Having a choice here would be the best from user perspective.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes k8s support on eks with a helm chart? All major cloud providers support some level of managed k8s. We can start from there, and let the community add more things to it. It ll be hard for us to build any native integrations anyway. We can be OSS focussed.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exciting for this part. For
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @YannByron agree. Anything in the works, in terms of upstreaming? :) I have now pushed out the metaserve to 1.1.0 since it needs broad alignment across the community. So we have sometime, but love to just some alignment here and move on ahead.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vinothchandar yep, we wanna work on it, and already have some research. will propose a RFC (maybe an umbrella one) in the short run, then let's discuss deeper. cc @Zouxxyy |
||
| * **Beyond structured Data**: Even as we solved challenges around ingesting, storing, managing and transforming data in parquet/avro/orc, there is still a majority of other data that does not benefit from these capabilities. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, I think UDFs and materialized views (MV) will be key for ML ecosystem.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to understand what supporting MVs mean better. See my comment on Flink MV using Dynamic table above. and consolidate there? |
||
| Using Hudi's HFile tables for ML Model serving is an emerging use case with users who want a lower-cost, lightweight means to serve computed data directly off the lake storage. Often, unstructured data like JSON and blobs | ||
| like images must be pseudo-modeled with some structure, leading to poor performance or manageability. With the meteoric rise of AI/ML in recent years, the lack of support for complex, unstructured, large blobs in a project like Hudi will only fragment data in lakes. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pretty excited to see how Hudi can help AI/ML use cases. I believe we're just scratching the surface here and going to provide efficient mutation solutions tailed for AI/ML on top of a generic framework. |
||
| To this end, we need to support all the major image, video and ML/AI formats with the same depth of capabilities around indexing, mutating or capturing changes. | ||
vinothchandar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| * **Even greater self-management**: Hudi offers the most extensive set of capabilities today in open-source data lake management, from ingesting data to optimizing data and automating various bookkeeping activities to | ||
| automatically manage table data and metadata. Seeing how the community has used this management layer to up-level their data lake experience is impressive. However, we have plenty of capabilities to be added, e.g., | ||
| reverse streaming data into other systems or [snapshot management](https://github.com/apache/hudi/pull/6576/files) or [diagnostic reporters](https://github.com/apache/hudi/pull/6600) or cross-region logical replication or | ||
| record-level [time-to-live management](https://github.com/apache/hudi/pull/8062), to name a few. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This part is amazing as it's mostly driven by the community, crowdsourcing ideas from various use cases in production. |
||
|
|
||
| ## **Hudi 1.X** | ||
|
|
||
| Given that we have approached Hudi more like a database problem, it's unsurprising that Hudi has many building blocks that make up a database. Drawing a baseline from the | ||
| seminal [Architecture of a Database System](https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf) paper (see page 4), we can see how Hudi makes up the bottom half of a database optimized for the lake, | ||
| with multiple query engines layers - SQL, programmatic access, specialized for ML/AI, real-time analytics and other engines sitting on top. The major areas below directly map how we have tracked | ||
| the Hudi [roadmap](https://hudi.apache.org/roadmap). We will see how we have adapted these components specifically for the scale of data lakes and the characteristics of lake workloads. | ||
|
|
||
|  | ||
|
|
||
| _Reference diagram highlighting existing (green) and new (yellow) Hudi components, along with external components (blue)._ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i see that catalog manager is blue. metaserver is actually positioned towards Hudi's own catalog service, so it should be yellow?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel Catalog is a little different, where we need access controls and other permissions added. I was thinking about just how to scale metadata for planning. May be the metaserver authors can help clarify? @minihippo ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. Maybe at least index system, stronger transaction management (including cross-table transaction) should be counted in. |
||
|
|
||
|
|
||
|
|
||
| The log manager component in a database helps organize logs for recovery of the database during crashes, among other things. At the transactional layer, Hudi implements ways to organize data into file groups and | ||
| file slices and stores events that modify the table state in a timeline. Hudi also tracks inflight transactions using marker files for effective rollbacks. Since the lake stores way more data than typical operational | ||
| databases or data warehouses while needing much longer record version tracking, Hudi generates record-level metadata that compresses well to aid in features like change data capture or incremental queries, effectively | ||
| treating data itself as a log. In the future, we would want to continue improving the data organization in Hudi, to provide scalable, infinite timeline and data history, time-travel writes, storage federation and other features. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Infinite timeline and data history will be gold 🏅
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. time-travel writes looks interesting to me. |
||
|
|
||
|
|
||
|
|
||
| The lock manager component helps implement concurrency control mechanisms in a database. Hudi ships with several external lock managers, although we would want to ultimately streamline this through | ||
vinothchandar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| our [metaserver](https://github.com/apache/hudi/pull/4718) that serves only timeline metadata today. The paper (pg 81) describes the tradeoffs between the common concurrency control techniques in | ||
| databases: _2Phase Locking_ (hard to implement without a central transaction manager), _OCC_ (works well w/o contention, fails very poorly with contention) and _MVCC_ (yields high throughputs, but relaxed | ||
| serializability in some cases). Hudi implements OCC between concurrent writers while providing MVCC-based concurrency for writers and table services to avoid any blocking between them. Taking a step back, | ||
| we need to ask ourselves if we are building an OLTP relational database to avoid falling into the trap of blindly applying the same concurrency control techniques that apply to them to the high-throughput | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 100% with you!
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @codope, could we build HTAP database like TiDB etc?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will be hard to support any real, high perf transactional workloads on the lake IMO. Our commit times will still be say 20-30 seconds. Those are long-running transactions in the transactional world (speaking from my Oracle work ex) :) |
||
| pipelines/jobs writing to the lake. Hudi has a less enthusiastic view of OCC and encourages serializing updates/deletes/inserts through the input stream to avoid performance penalties with OCC for fast-mutating | ||
| tables or streaming workloads. Even as we implemented techniques like Early Conflict Detection to improve OCC, this RFC proposes Hudi should pursue a more general purpose non-blocking MVCC-based concurrency control | ||
vinothchandar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| while retaining OCC for simple and batch append-only use cases. | ||
|
|
||
|
|
||
|
|
||
| The access methods component encompasses indexes, metadata and storage layout organization techniques exposed to reads/writes on the database. Last year, we added | ||
| a new [multi-modal index](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi) with support for [asynchronous index building](https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md) based | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. vector database is quite a hot topic recently, and I take a look at it trying to figure out what it is. Looks like it's a database with a special index provide similarity search. With the pluggable indexes and clustering, Hudi is in a sweet spot in this race!
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @garyli1019 came across this. https://www.linkedin.com/pulse/text-based-search-from-elastic-vector-kaushik-muniandi , thoughts?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. very nice article, elastic search just can't handle that much of data. That would be really helpful if this user could elaborate more details about their use cases. A general purpose LLM + a database that store the customized data = a customized AI assistant. Searching could be the next battle ground for the lakehouse tech.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @garyli1019 sg. Interested in trying https://github.com/lancedb/lance as a base file format and adding some capabilities through one of the existing engines? I can see elastic, pg etc all adding specialized indexes. I think the vector search itself will be commodotized soon. It's just another query type.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting, lots of new stuff coming out recently, let me take a look 👀 |
||
| on MVCC to build indexes without blocking writers and still be consistent upon completion with the table data. Our focus has thus far been more narrowly aimed at using indexing techniques for write performance, | ||
| while queries benefit from files and column statistics metadata for planning. In the future, we want to generalize support for using various index types uniformly across writes and queries so that queries can be planned, | ||
| optimized and executed efficiently on top of Hudi's indices. This is now possible due to having Hudi's connectors for popular open-source engines like Presto, Spark and Trino. | ||
| New [secondary indexing schemes](https://github.com/apache/hudi/pull/5370) and a proposal for built-in index functions to index values derived from columns have already been added. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When improving Hudi's multi-modal index for query performance, should we also think about the user experience of creating, managing, and "tuning" the indexes? Say, an advanced user may create a specific index such as B-tree for query speedup, and PostgreSQL has this functionality: https://www.postgresql.org/docs/current/indexes.html. Also, query engine can pick and choose what they need in the integration. wdyt?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. a db advisor type thing would be awesome to have. it can recommend indexes and such. We can do it in 1.1 though. |
||
|
|
||
| The buffer manager component manages dirtied blocks of storage and also caches data for faster query responses. In Hudi's context, we want to bring to life our now long-overdue | ||
| columnar [caching service](https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#lake-cache) that can sit transparently between lake storage and query engines while understanding | ||
| transaction boundaries and record mutations. The tradeoffs in designing systems that balance read, update and memory costs are detailed in the [RUM conjecture](https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf). | ||
| Our basic idea here is to optimize for read (faster queries served out of cache) and update (amortizing MoR merge costs by continuously compacting in-memory) costs while adding the cost of cache/memory to the system. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are we also thinking about some AI component in the query planning phase which will automatically infer which type of index to use for what columns so as to get the best possible read performance? that would be super powerful.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IDK if we need AI . but yeah index selection is a hard problem we'd solve.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have plan to introduce query/reation/fine-grained SQL semantics cache layer, so that these cache can be share by common sub-clauses of different queries, can even can be served as pre-aggregation for online materialize views.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe cache-aware query planning and execution with distributed cache is sth relevant we can take on.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pre-existing data on query patterns and performance is needed for training a model to optimize query planning if that is going to be anywhere near helpful. |
||
| Currently, there are potentially many candidate designs for this idea, and we would need a separate design/RFC to pursue them. | ||
|
|
||
| Shared components include replication, loading, and various utilities, complete with a catalog or metadata server. Most databases hide the underlying format/storage complexities, providing users | ||
| with many data management tools. Hudi is no exception. Hudi has battle-hardened bulk and continuous data loading utilities (deltastreamer, flinkstreamer tools, along with Kafka Connect Sink), | ||
| a comprehensive set of table services (cleaning, archival, compaction, clustering, indexing, ..), admin CLI and much much more. The community has been working on new server components | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i know that @XuQianJin-Stars is working on a UI component as part of platformization; an RFC is pending.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great. A Hudi UI would be amazing.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @xushiyan, is Hudi UI only worked for admin? Could the Hudi UI display the metadata and data quantity of Hudi table and related metrics?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The UI would be a big-killer, @XuQianJin-Stars is pushing forward this and they put it into production in Tencent.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @xushiyan do we have an RFC or Epic for this already.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vinothchandar just created this one https://issues.apache.org/jira/browse/HUDI-6255
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have moved it to 1.1.0 in scoping, so that it's easier with newer APIs available. but @XuQianJin-Stars , please feel free get started sooner if you think it a good idea. |
||
| like a [metaserver](https://github.com/apache/hudi/pull/4718) that could expand to indexing the table metadata using advanced data structures like zone maps/interval trees or a [table management service](https://github.com/apache/hudi/pull/4309) to manage Hudi tables | ||
vinothchandar marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| centrally. We would love to evolve towards having a set of horizontally scalable, highly available metaservers, that can provide both these functionalities as well as some of the lock management capabilities. | ||
| Another interesting direction to pursue would be a reverse loader/streamer utility that can also move data out of Hudi into other external storage systems. | ||
|
|
||
| In all, we propose Hudi 1.x as a reimagination of Hudi, as the _transactional database for the lake_, with [polyglot persistence](https://en.wikipedia.org/wiki/Polyglot_persistence), raising the level of | ||
| abstraction and platformization even higher for Hudi data lakes. | ||
vinothchandar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Hudi 1.0 Release | ||
|
|
||
| Rome was not built in a day, so can't the Hudi 1.x vision also? This section outlines the first 1.0 release goals and the potentially must-have changes to be front-loaded. | ||
| This RFC solicits more feedback and contributions from the community for expanding the scope or delivering more value to the users in the 1.0 release. | ||
|
|
||
| In short, we propose Hudi 1.0 try and achieve the following. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could the integration with other database like Doris, StarRocks etc be include? Meanwhile, is the C++ client which reads and writes Hudi data file proposed to achive for AI scenario?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 on first. If we can solidify the format changes, then a Java API and a Rust/C++ API for metadata table, timeline/snapshots, and Filegroup reader/writer will be cool. Boils down to finding enough contributors to drive it
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have some concrete ideas for AI scenarios?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vinothchandar, I would like to contribute the Rust/C++ client and driver the contributors in China to contribute. Meanwhile, AI scenarios have the ideas that dstributed and incremental reading of Hudi data via C++/Python client, like Kafka python client which dstributed reading message of Kafka topic. This idea comes from AI cases in BiliBili.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vinothchandar, another idea is that integrates with the OpenAI API to answer user question of Hudi and provides the usage or SQL of Hudi feature.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds amazing, I'm out of bunden with all kinds of issues and feature enquires.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yihua, most AI scenarios are focused on the feature engineering, model training and prediction. Feature data could store in Hudi as json format and may only use several feature in json which need materialized column. Meanwhile, model is unstructured data that Hudi doesn't support at present. From user perspective, ML engineer uses the C++/Python distributed client to access the Hudi data and incremental consume of Hudi data. Therefore, the features including materialized column, unstructured data support and C++/Python distributed client are mainly required for AI scenarios.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SteNicholas @danny0405 @yihua This is a good read on this topic of MLOps and ML workloads their intersection with regular data eng https://www.cpard.xyz/posts/mlops_is_mostly_data_engineering/
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SteNicholas I am also interested in helping with Rust/C++
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SteNicholas @jonvex filed https://issues.apache.org/jira/browse/HUDI-6486 under this Table Format API for now. We need to first finalize the 1.0 storage format draft and get busy on this. Please let me know how you want to engage.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, could we consider some functions for cloud native like separation of hot and cold data, integration with K8S operator etc? IMO, the future trend of database is closer to cloud native.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SteNicholas Interesting. I was thinking that we make operators for components like metaserver or cache server (depending on how we build it) or the table management server . Could you expand on the hot/cold data separation idea?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vinothchandar, metaserver or cache server or the table management server could serve as kubernetes service. The hot/cold data separation idea refers to CloudJump: Optimizing Cloud Databases for Cloud Storages.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @SteNicholas ! My thoughts are also to start with sth that can run these on k8s. I will read the paper and reflect back.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tracking here. https://issues.apache.org/jira/browse/HUDI-6489 Happy to help firm up a RFC/design, we can evolve as 1.0 format evolves.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @SteNicholas |
||
|
|
||
| 1. Incorporate all/any changes to the format - timeline, log, metadata table... | ||
| 2. Put up new APIs, if any, across - Table metadata, snapshots, index/metadata table, key generation, record merger,... | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want to discuss how the 0.x releases (e.g., 0.14) will work alongside 1.x development, to avoid conflict and rebasing difficulties?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I am thinking we merge all the core code abstraction changes - HoodieSchema, HoodieData and such into 0.X line to make rebasing easier, then fork off the 1.0 feature branch. |
||
| 3. Get internal code layering/abstractions in place - e.g. HoodieData, FileGroup Reader/Writer, Storage,... | ||
| 4. Land all major, outstanding "needle mover" PRs, in a safe manner guarded by configs. | ||
| 5. Integrate some/all of the existing indexes for Spark/Presto, and validate intended functionality and performance gains. | ||
vinothchandar marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| All changes should be backward compatible and not require rewriting of base/parquet files in existing tables. However, full compaction of the logs or planned downtime to rewrite the | ||
| timeline or rebuilding the metadata table may be necessary when moving from 0.x to 1.0 release. | ||
|
|
||
|
|
||
| This RFC will be expanded with concrete changes to different parts of Hudi. Note that this RFC only serves to identify these areas, and separate RFCs should follow for any changes impacting the storage format, backward compatibility or new public APIs. | ||
|
|
||
| ## Rollout/Adoption Plan | ||
|
|
||
| We propose 1.0 execution be done in a series of three releases below. | ||
|
|
||
| 1. **alpha (July 2023)**: All format changes are landed, internal code layering/abstraction work, major outstanding PRs are landed safely. 0.X tables can be upgraded seamlessly and core Hudi write/query flows are certified. | ||
| 2. **beta (August 2023)**: new APIs are added, code paths are changes to use the new APIs, index integrations with performance/functional qualifications. | ||
| 3. **Generally available (September 2023):** Scale testing, stress testing and production hardening by the community before general release. | ||
|
|
||
|
|
||
| ## Test Plan | ||
| TBD | ||
|
|
||

Uh oh!
There was an error while loading. Please reload this page.