[HUDI-3085] improve bulk insert partitioner abstraction #4441

YuweiXiao · 2021-12-24T02:51:26Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Restructure the bulk insert partitioner interface, to include the handling of fileIdPfx & write handle factory.

With this improvement, one can implement a new bulk_insert partitioner that is capable of routing records to pre-defined fileIds using customized write factory (e.g., different write factories for different partitions)

JIRA link

Brief change log

Modify interface of BulkInsertPartitioner, to include the logic of handling fileIdPrefix and writeHandlerFactory. With this update, the partitioner of bulk_insert path now have ability to control records' final file group location (similar to the partitioner of the upsert path)
Modify bulk_insert write path (e.g., AbstractBulkInsertHelper and its subclasses) to make use of the new partitioner interface. Now the partitioner are mandatory (not optional anymore), similar to the standard upsert/insert path.
The java bulk_insert write path is mostly left untouched because of its specialty, e.g., always write to a single filegroup (i.e., parallelism=1) and has customized fileId generator FileIdPrefixProvider.

Verify this pull request

Added a fileId generation check to existing tests, and other parts are already covered by existing tests, such as TestBulkInsertInternalPartitioner.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

yihua · 2022-01-09T08:38:34Z

@YuweiXiao Could you rebase your PR on master to resolve the conflicts?

YuweiXiao · 2022-01-10T01:41:27Z

@YuweiXiao Could you rebase your PR on master to resolve the conflicts?

Thanks for the reminder. The conflict has been resolved.

YuweiXiao · 2022-01-10T04:10:58Z

@hudi-bot run azure

YuweiXiao · 2022-02-08T06:09:27Z

@hudi-bot run azure

YuweiXiao · 2022-02-09T05:53:30Z

@hudi-bot run azure

YuweiXiao · 2022-02-25T01:02:30Z

@hudi-bot run azure

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java

yihua · 2022-02-25T06:39:29Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java

 * partitions should be almost equal to (#inputRecords / #outputSparkPartitions) to avoid possible skews.
 */
-public interface BulkInsertPartitioner<I> {
+public abstract class BulkInsertPartitioner<I> implements Serializable {


This interface is public and users may implement their own bulk insert partitioner as a plugin. The change from interface to abstract class is not backward compatible. Could you keep it as an interface and use default methods for new logic?

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java

yihua · 2022-02-25T07:24:49Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java

+    return fileIdPfx;
+  }
+
+  public void setDefaultWriteHandleFactory(WriteHandleFactory defaultWriteHandleFactory) {


Should setDefaultWriteHandleFactory() functionality be implemented through the constructor with the defaultWriteHandleFactory passed in? e.g.,

public GlobalSortPartitioner(WriteHandleFactory defaultWriteHandleFactory);

yihua · 2022-02-25T07:48:50Z

...hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/JavaBulkInsertHelper.java

        config, instantTime, table,
        fileIdPrefixProvider.createFilePrefix(""), table.getTaskContextSupplier(),
-        new CreateHandleFactory<>()).forEachRemaining(writeStatuses::addAll);
+        partitioner.getWriteHandleFactory(0)).forEachRemaining(writeStatuses::addAll);


This looks hacky

True... It is only for Java Engine, where the bulk_insert routes all record to a single file group. Let me check if there is a better way to do abstraction for Java engine.

yihua · 2022-02-25T07:52:52Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java

+  private WriteHandleFactory defaultWriteHandleFactory;
+  private List<String> fileIdPfx;


After looking at this PR as a whole, I'm thinking that it may be better to store a generating function of partitionId -> fileIdPrefix, and partitionId -> writeHandleFactory and have those functions passed in from the constructor.

Do you have any PoC of BulkInsertPartitioner implementation that provides partition-specific file ID and write handle factory? I'd like to understand how these are coupled with the repartition logic and how the interface design can accommodate the use case.

Yes, in order to enable concurrent clustering and upsert to a same file group, we have to control how records are routing to file group in the clustering (which uses bulk_insert to write records). So in my case, customized ClusteringExecutionStrategy and BulkInsertPartitioner are implemented.

The overall design indeed is partitionId -> fileIdPrefix (fileIdPfxList), partitionId -> writeHandleFactory (getWriteHandleFactory interface). Of course, will go with having those organized in the constructor, which should make the design more clear.

yihua · 2022-02-25T07:54:28Z

...in/java/org/apache/hudi/client/clustering/run/strategy/JavaSortAndSizeExecutionStrategy.java

    HoodieWriteConfig newConfig = HoodieWriteConfig.newBuilder().withProps(props).build();
+
+    BulkInsertPartitioner partitioner = getPartitioner(strategyParams, schema);
+    partitioner.setDefaultWriteHandleFactory(new CreateHandleFactory(preserveHoodieMetadata));


This can be achieved through constructor.

yihua · 2022-02-25T07:56:22Z

...ent/src/main/java/org/apache/hudi/execution/bulkinsert/JavaCustomColumnsSortPartitioner.java

  @Override
  public List<HoodieRecord<T>> repartitionRecords(
      List<HoodieRecord<T>> records, int outputSparkPartitions) {
+    generateFileIdPfx(outputSparkPartitions);


Wondering if this can be achieved by a function (func) passed to the constructor and the logic of sth like IntStream.range(0, outputSparkPartitions).mapToObj(i -> func.apply(i))?

yihua · 2022-02-25T08:03:14Z

...i-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertMapFunction.java

    return new SparkLazyInsertIterable<>(recordItr, areRecordsSorted, config, instantTime, hoodieTable,
-        fileIDPrefixes.get(partition), hoodieTable.getTaskContextSupplier(), useWriterSchema,
-        writeHandleFactory);
+        (String)partitioner.getFileIdPfx().get(partition), hoodieTable.getTaskContextSupplier(), useWriterSchema,


It's better to apply the partition ID -> file ID here if the partitioner just stores the function.

YuweiXiao · 2022-02-25T08:25:15Z

Thanks for the review! @yihua Will try to use passed-in function to manage the partition-fileId-writeHandle mapping, believe will have a better, clear interface.

YuweiXiao · 2022-03-05T02:31:26Z

Hey Yihua @yihua, the PR is ready for another round of review~:)

…e handling of fileIdPfx & write handle factory into partitioner

YuweiXiao · 2022-04-18T12:29:13Z

@hudi-bot run azure

leesf · 2022-04-23T12:03:13Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java

+
+  /**
+   * Return write handle factory for the given partition.
+   * By default, return CreateHandleFactory which will always write to a new file group


The description is not correct since it returns empty?

Thanks! Fixed.

leesf · 2022-04-23T12:04:49Z

...hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/JavaBulkInsertHelper.java

+
    // write new files
-    List<WriteStatus> writeStatuses = bulkInsert(inputRecords, instantTime, table, config, performDedupe, userDefinedBulkInsertPartitioner, false, config.getBulkInsertShuffleParallelism(), new CreateHandleFactory(false));
+    List<WriteStatus> writeStatuses = bulkInsert(inputRecords, instantTime, table, config, performDedupe, partitioner, false, config.getBulkInsertShuffleParallelism(), new CreateHandleFactory(false));


this line can be split into two lines.

Got it, fixed.

leesf · 2022-04-23T12:06:53Z

...hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/JavaBulkInsertHelper.java

        config, instantTime, table,
        fileIdPrefixProvider.createFilePrefix(""), table.getTaskContextSupplier(),
-        new CreateHandleFactory<>()).forEachRemaining(writeStatuses::addAll);
+        (WriteHandleFactory) partitioner.getWriteHandleFactory(0).orElse(writeHandleFactory)).forEachRemaining(writeStatuses::addAll);


here what's the meaning of passing 0 as partitioneId?

0 means getting write handle factory for partition 0. The code is consistent with previous behavior, as java engine always has only one data partition.

we can add some comments here.

leesf · 2022-04-23T12:08:09Z

...lient/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSpatialCurveSortPartitioner.java


  @Override
  public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputSparkPartitions) {
-    SerializableSchema serializableSchema = new SerializableSchema(schema);


nice improvement.

leesf · 2022-04-23T12:09:20Z

...di-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkBulkInsertHelper.java

            executor.getCommitActionType(), instantTime), Option.empty(),
        config.shouldAllowMultiWriteOnSameInstant());
+
+    BulkInsertPartitioner partitioner = userDefinedBulkInsertPartitioner.isPresent()


I see the duplicate code in JavaBulkInsertHelper, can we unify it?

The BulkInsertPartitionerFactory is different for spark&java. I could extract an interface(e.g., GetBulkInsertPartitionerFactory) to the base class if we want to unify the code. But as yihua said, the change of public interface may broken existing users' code, requiring them to update their code too.

ps. I have re-written this part of the code to make it more clear.

leesf

LGTM, @yihua do you have any other concern?

leesf · 2022-04-25T00:57:27Z

@hudi-bot run azure

YuweiXiao · 2022-04-25T04:46:38Z

@hudi-bot run azure

hudi-bot · 2022-04-25T06:08:46Z

CI report:

2243884 UNKNOWN
f08642c UNKNOWN
de00fb3 Azure: FAILURE Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

YuweiXiao force-pushed the HUDI-3085 branch 2 times, most recently from eba225e to 10b433d Compare December 24, 2021 05:29

yihua self-assigned this Dec 27, 2021

YuweiXiao force-pushed the HUDI-3085 branch from 10b433d to 1277b45 Compare January 10, 2022 01:40

YuweiXiao force-pushed the HUDI-3085 branch from 1277b45 to cdb9542 Compare January 10, 2022 01:49

YuweiXiao force-pushed the HUDI-3085 branch from cdb9542 to 6b59b27 Compare January 20, 2022 02:30

YuweiXiao force-pushed the HUDI-3085 branch from 6b59b27 to cc446f0 Compare February 8, 2022 04:26

nsivabalan added priority:critical Production degraded; pipelines stalled and removed priority:critical Production degraded; pipelines stalled labels Feb 8, 2022

YuweiXiao force-pushed the HUDI-3085 branch from cc446f0 to ee74d44 Compare February 9, 2022 02:15

YuweiXiao force-pushed the HUDI-3085 branch from ee74d44 to fb474d1 Compare February 23, 2022 07:27

yihua reviewed Feb 25, 2022

View reviewed changes

YuweiXiao force-pushed the HUDI-3085 branch 2 times, most recently from 2243884 to 58a02eb Compare February 26, 2022 14:01

YuweiXiao requested a review from yihua February 28, 2022 04:35

vinothchandar self-assigned this Mar 9, 2022

yihua added the priority:medium Moderate impact; usability gaps label Mar 16, 2022

YuweiXiao added 2 commits April 18, 2022 15:18

[HUDI-3085] improve bulk insert partitioner abstraction, embedding th…

3be9d9a

…e handling of fileIdPfx & write handle factory into partitioner

[HUDI-3085] review: better and clear interface

a4f82cf

YuweiXiao force-pushed the HUDI-3085 branch from 58a02eb to a4f82cf Compare April 18, 2022 08:43

leesf reviewed Apr 23, 2022

View reviewed changes

YuweiXiao force-pushed the HUDI-3085 branch from d2fee75 to f08642c Compare April 24, 2022 02:23

[HUDI-3085] review fix

de00fb3

YuweiXiao force-pushed the HUDI-3085 branch from f08642c to de00fb3 Compare April 24, 2022 07:42

leesf approved these changes Apr 24, 2022

View reviewed changes

leesf merged commit f2ba0fe into apache:master Apr 25, 2022

leesf pushed a commit to leesf/hudi that referenced this pull request Nov 24, 2022

[HUDI-3085] Improve bulk insert partitioner abstraction (apache#4441)

226b40f

		private WriteHandleFactory defaultWriteHandleFactory;
		private List<String> fileIdPfx;

[HUDI-3085] improve bulk insert partitioner abstraction #4441

[HUDI-3085] improve bulk insert partitioner abstraction #4441

Uh oh!

Conversation

YuweiXiao commented Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

yihua commented Jan 9, 2022

Uh oh!

YuweiXiao commented Jan 10, 2022

Uh oh!

YuweiXiao commented Jan 10, 2022

Uh oh!

YuweiXiao commented Feb 8, 2022

Uh oh!

YuweiXiao commented Feb 9, 2022

Uh oh!

YuweiXiao commented Feb 25, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuweiXiao commented Feb 25, 2022

Uh oh!

YuweiXiao commented Mar 5, 2022

Uh oh!

YuweiXiao commented Apr 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leesf left a comment

Choose a reason for hiding this comment

Uh oh!

leesf commented Apr 25, 2022

Uh oh!

YuweiXiao commented Apr 25, 2022

YuweiXiao commented Dec 24, 2021 •

edited

Loading