[HUDI-3225] [RFC-45] for async metadata indexing #4640

codope · 2022-01-19T15:28:06Z

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

nsivabalan · 2022-01-25T01:03:08Z

rfc/rfc-45/rfc-45.md

+We introduce a new action `index` which will denote the index building process,
+the mechanics of which is as follows:
+
+1. From an external process, users can issue a CREATE INDEX or similar statement


we should support this via regular writers as well. just scheduling. it will make life easier.

You mean for inline scheduling like how we have for other table services?

nsivabalan · 2022-01-25T01:08:33Z

rfc/rfc-45/rfc-45.md

+       inflight writer, that is just about to commit concurrently, has a very
+       high chance of seeing the indexing plan and aborting itself.
+
+We can just introduce a lock for adding events to the timeline and these races


in a single write mode, users may not have configured any lock service and we don't enforce one as of today. something to keep in mind.

We have to clearly document these, along with other operations that cannot be performed without lock provider configured. As safety, should the indexer always error out if there there is no lock provider configured?

I think we should error out. I tried to think of a way without taking any lock but we need this minimal locking. We should call it out in documentation.

+1 for requiring locking.

Having wrong or missing data from the MDT is very difficult to debug in the long run and can have serious data quality issues. Also, anyone having enough scale to be requiring asyc indexing should be able to choose one of the many locking options available.

nsivabalan · 2022-01-25T01:11:44Z

rfc/rfc-45/rfc-45.md

+       that. We will correct this during indexing action completion. In the
+       average case, this may not happen and the design has liveness.
+
+3. When the indexing process is about to complete, it will check for all


its not very clear what does the indexer does here apart from populating base files for all instants having commit time < t. for eg, what does indexer does to completed instants > t up until now ?

If I am not wrong, "check for all completed commit actions to ensure each of them added entries per its indexing plan" refers to all instants up until now. would be good to call it out.

+1 high level indexer should only write the base files.

nsivabalan · 2022-01-25T01:17:06Z

rfc/rfc-45/rfc-45.md

+3. When the indexing process is about to complete, it will check for all
+   completed commit actions to ensure each of them added entries per its
+   indexing plan, otherwise simply abort after a configurable timeout. Let's
+   call this the **indexing check**.


also, lets say this is the timeline when indexer started
C1, C2,.... C5 (inflight), C6, C7, C8. Start indexer.

indexer will build base file with all info until C4.
what happens to data pertaining to C6, C7, C8? these are already completed ones.
essentially indexer will go through every completed commit up until now and ensure they are applied to new partition. if its < C= C4, it goes to base file. if its> C4, it goes as delta commit is it ?

I think indexer should work upto C8. anything inflight between C4 and C8, has to do the "indexing check". This is a valid case that siva's pointing out.

That's correct and if the indexer times out before C5 completes then it will abort. Next time, indexing will start again with C4 as base instant and run indexing check again.

nsivabalan · 2022-01-25T01:19:54Z

rfc/rfc-45/rfc-45.md

+**Case 2: Indexer fails while writer is inflight**
+
+Writer will commit adding log entries to the metadata partition. Indexer will
+fetch the last instant for which indexing was done from `.index.inflight` file.


would like to understand this more. so the indexer will keep updating the index.inflight meta file is it? like checkpointing ?

No longer valid. To keep things simple, either an MDT partition is available or not available and this will be known through table config. There is some value in checkpointing, especially for indexes that take time, but depending on timeline adds more complexity and we will have to deal with more correctness issues.

nsivabalan · 2022-01-25T01:22:30Z

rfc/rfc-45/rfc-45.md

+We can just introduce a lock for adding events to the timeline and these races
+would vanish completely, still providing great scalability and asynchrony for
+these processes.
+


I don't see details on when exactly regular writers will start to make synchronous updates. Also, when exactly the callers can start using the new index that got built out? whats the source of truth. we can rely on timeline completed instant for the index, but after archival? also loading timeline everytime might be costly as well.

Once the indexing action completed, any MDT partition that is currently not being indexed, are considered ready for use

Added some more details. Table config will be the source of truth.

nsivabalan · 2022-01-25T01:23:00Z

rfc/rfc-45/rfc-45.md

+
+b) Inflight writer about to commit, but indexing completed just before that.
+
+In this case, since the indexer completed before the writer, so it has already


will sync up directly. I don't get this fully.

vinothchandar

We also need to cover the "reindexing" scenario, where somehow new file slices are created for the MDT index partition, and older slices are cleaned/deleted.

need to carefully review the multi writer scenarios still, but thus far seems to be close to what the JIRA had.

I'd also ask for a review from @prashantwason

vinothchandar · 2022-02-18T03:51:29Z

rfc/rfc-45/rfc-45.md

+provider and only one writer can access MDT in read-write mode. Hence, any write
+to MDT is guarded by the data table lock. This ensures only one write is
+committed to MDT at any point in time and thus guarantees serializability.
+However, locking overhead adversely affects the write throughput and will reach


Not sure how the metadata indexing solves the multi-writer problem for MDT. Strictly speaking we just need table service scheduling on MDT by guarded by the lock.

Metadata table is unique in the respect that each write to MDT will involve multiple partitions to be updated together in a transaction. So I do not see a truly parallel commit to MDT possible.

vinothchandar · 2022-02-18T05:22:52Z

rfc/rfc-45/rfc-45.md

+
+1. From an external process, users can issue a CREATE INDEX or similar statement
+   to trigger indexing for an existing table.
+    1. This will add a `<instant_time>.index.requested` to the timeline, which


nit: indexing.requested ? all actions are verbs

index is a noun as well as verb.

i'd prefer index for brevity, and none of our action end with -ing. But let me know if you think indexing is more appropriate, i can change it.

vinothchandar · 2022-02-18T05:25:17Z

rfc/rfc-45/rfc-45.md

+       that. We will correct this during indexing action completion. In the
+       average case, this may not happen and the design has liveness.
+
+3. When the indexing process is about to complete, it will check for all


+1 high level indexer should only write the base files.

vinothchandar · 2022-02-18T05:26:07Z

rfc/rfc-45/rfc-45.md

+3. When the indexing process is about to complete, it will check for all
+   completed commit actions to ensure each of them added entries per its
+   indexing plan, otherwise simply abort after a configurable timeout. Let's
+   call this the **indexing check**.


I think indexer should work upto C8. anything inflight between C4 and C8, has to do the "indexing check". This is a valid case that siva's pointing out.

vinothchandar · 2022-02-18T05:27:20Z

rfc/rfc-45/rfc-45.md

+       inflight writer, that is just about to commit concurrently, has a very
+       high chance of seeing the indexing plan and aborting itself.
+
+We can just introduce a lock for adding events to the timeline and these races


We have to clearly document these, along with other operations that cannot be performed without lock provider configured. As safety, should the indexer always error out if there there is no lock provider configured?

vinothchandar · 2022-02-18T05:28:12Z

rfc/rfc-45/rfc-45.md

+We can just introduce a lock for adding events to the timeline and these races
+would vanish completely, still providing great scalability and asynchrony for
+these processes.
+


Once the indexing action completed, any MDT partition that is currently not being indexed, are considered ready for use

codope · 2022-03-09T16:14:19Z

@vinothchandar @nsivabalan This is ready for review again. Following has changed since the last review:

Filegroup in metadata partition will be initialized while scheduling index action. No new filegroup is initialized by writers while indexing is in progress.
Table config will be the source of truth for what metadata partitions are available, instead of relying on the commit metadata on timeline.

prashantwason

Looks good.

prashantwason · 2022-02-23T09:19:10Z

rfc/rfc-45/rfc-45.md

+   to trigger indexing for an existing table.
+    1. This will add a `<instant_time>.index.requested` to the timeline, which
+       contains the indexing plan.
+    2. From here on, the index building process will continue to build an index


Should this be reflected by choosing the index timestamp as t?
E.g. t.index.requested ?

Table service operations on the metadata table usually take in the timestamp of the last op with a suffix - 001 for compaction, 002 for clean etc.

So it may be good to have this as t001. index.requested.

Yes we can do that and can avoid little serde cost. It can also ease debugging. However, i should point out that index action will be written on the data timeline as it will be known to the user.

prashantwason · 2022-03-11T07:10:54Z

rfc/rfc-45/rfc-45.md

+provider and only one writer can access MDT in read-write mode. Hence, any write
+to MDT is guarded by the data table lock. This ensures only one write is
+committed to MDT at any point in time and thus guarantees serializability.
+However, locking overhead adversely affects the write throughput and will reach


Metadata table is unique in the respect that each write to MDT will involve multiple partitions to be updated together in a transaction. So I do not see a truly parallel commit to MDT possible.

prashantwason · 2022-03-11T07:12:23Z

rfc/rfc-45/rfc-45.md

+We introduce a new action `index` which will denote the index building process,
+the mechanics of which is as follows:
+
+1. From an external process, users can issue a CREATE INDEX or similar statement


rfc/rfc-45/rfc-45.md

prashantwason · 2022-03-11T07:18:55Z

rfc/rfc-45/rfc-45.md

+       inflight writer, that is just about to commit concurrently, has a very
+       high chance of seeing the indexing plan and aborting itself.
+
+We can just introduce a lock for adding events to the timeline and these races


+1 for requiring locking.

Having wrong or missing data from the MDT is very difficult to debug in the long run and can have serious data quality issues. Also, anyone having enough scale to be requiring asyc indexing should be able to choose one of the many locking options available.

prashantwason · 2022-03-11T07:23:28Z

rfc/rfc-45/rfc-45.md

+Writer will commit adding log entries to the metadata partition. However, table
+config will indicate that partition is not ready to use. When indexer is
+re-triggered, it will check the plan and table config to figure out which MDT
+partitions to index and start indexing for those partitions.


When the indexer starts the next time, it will choose a different instant time. Hence, the older log blocks written are no longer valid. So I think each time the indexer starts (either the first time or after a failure), it should clean out the older file groups and create new ones (with newer instant time).

Yes, that's the plan. But, it will start from scratch only for the partitions that were partially indexed i.e. partitions for which table config was not updated in the last indexing. Table config update always happens at the end of indexing for a partition.

We don't want to start all over again for all the partitions. So, let's say at some t indexer was scheduled and it wrote t.index.requested with plan of indexing files and column_stats partitions. It completed files but failed midway for column_stats. Then table config will show that only files partition is available for reads/updates. When indexer starts the next time, it will see a pending index action, reads the plan as well as table config and figures out that only column_stats index is pending. Will clean the older filegroups for column_stats and choose the latest completed instant (without holes) on data timeline and create new filegroup and so on.

If this sounds right, I can update this example in the RFC.

prashantwason · 2022-03-11T07:27:47Z

rfc/rfc-45/rfc-45.md

+re-triggered, it will check the plan and table config to figure out which MDT
+partitions to index and start indexing for those partitions.
+
+**Case 3: Race conditions**


There is another race condition possible:

Writer is in inflight mode

Indexer is starting and creating the file-groups. Suppose there are 100 file-groups to be created.

Writer just finished and tries to write log blocks - it only sees a subset of file-groups created yet (as the above step 2 above has not completed yet). This will cause writer to incorrectly write updated to lesser number of shards.

In essence:

Locking is required

Indexer need to hold lock which creating the file-groups too

Good point! Intialization fo filegroups happen when index is scheduled. While scheduling we can take a lock. I'll update in RFC.

nsivabalan

LGTM. few minor clarifications.

nsivabalan · 2022-03-11T14:12:27Z

rfc/rfc-45/rfc-45.md

+    2. From here on, the index building process will continue to build an index
+       up to instant time `t`, where `t` is the latest completed instant time on
+       the timeline without any
+       "holes" i.e. no pending async operations prior to it.


not necessarily async. it could be regular writes too. in case of multi-writers, there could be a failed commit waiting to be rolled back.

nsivabalan · 2022-03-11T14:23:15Z

rfc/rfc-45/rfc-45.md

+create log files in the same filegroup for the metadata index update. This will
+happen within the existing data table lock.
+
+The indexer runs in a loop until the metadata for data upto `t0` plus the data


would like to understand the loop here. I thought we will just go for one round and then timeout. will sync up f2f.

you're right..i'll word it better..what i meant is run until timeout.

nsivabalan · 2022-03-11T14:26:34Z

rfc/rfc-45/rfc-45.md

+Further, suppose there were two inflight writers Writer1 and Writer2 (with
+inflight instants `t1` and `t2` respectively) while the indexing was requested
+or inflight. In this case, the writers will check for pending index action and
+find a pending instant `t3`. Now, if the metadata index creation is pending,


In the attached image, I see locks. Would be good to cover whats the critical section for which we acquire lock here for entire design in general.
for eg:
regular writers when checking for pending indexing?
regular writers to check for completed partitions in MDT? (from table config)
async indexer while updating the hoodie table config ?
etc. something like this. I am not claiming we need to acquire lock for all of above. But a list like this would be good to call it out explicitly.

good point, i'll update. basically, we need lock when:

creating filegroups while scheduling

writing to MDT timeline.

Add more details

hudi-bot · 2022-04-01T16:31:16Z

CI report:

ee05ae2 Azure: CANCELED

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

vinothchandar added the rfc Request for comments label Jan 19, 2022

nsivabalan reviewed Jan 25, 2022

View reviewed changes

vinothchandar changed the title ~~[HUDI-3225] Add RFC for async metadata indexing~~ [HUDI-3225] [RFC-45] for async metadata indexing Jan 26, 2022

xushiyan assigned vinothchandar Feb 16, 2022

vinothchandar reviewed Feb 18, 2022

View reviewed changes

codope force-pushed the rfc-45-async-index branch from 1cb4c54 to bd3b354 Compare March 9, 2022 04:55

prashantwason reviewed Mar 11, 2022

View reviewed changes

codope force-pushed the rfc-45-async-index branch from 7afccec to 62db921 Compare March 11, 2022 12:04

nsivabalan reviewed Mar 11, 2022

View reviewed changes

codope added the priority:blocker Production down; release blocker label Mar 16, 2022

codope mentioned this pull request Mar 25, 2022

[HUDI-2488][HUDI-3175] Implement async metadata indexing #4693

Merged

5 tasks

codope added 4 commits April 1, 2022 19:25

Add RFC for async metadata indexing

10ece5a

Add more details

Add changes since last discussion

ee2a53e

Add another race condition handling

9705e18

Update rfc

ee05ae2

codope force-pushed the rfc-45-async-index branch from 62db921 to ee05ae2 Compare April 1, 2022 15:28

vinothchandar merged commit dfdd2de into apache:master Apr 1, 2022

vinothchandar mentioned this pull request Apr 27, 2022

[HUDI-3907]Claim RFC 52 for Introduce Secondary Index to Improve HUDI Query Performance #5441

Merged

5 tasks


		b) Inflight writer about to commit, but indexing completed just before that.

		In this case, since the indexer completed before the writer, so it has already

[HUDI-3225] [RFC-45] for async metadata indexing #4640

[HUDI-3225] [RFC-45] for async metadata indexing #4640

Uh oh!

Conversation

codope commented Jan 19, 2022

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codope commented Mar 9, 2022

Uh oh!

prashantwason left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!