Skip to content

Conversation

@YuweiXiao
Copy link
Contributor

@YuweiXiao YuweiXiao commented Dec 24, 2021

Tips

What is the purpose of the pull request

Restructure the bulk insert partitioner interface, to include the handling of fileIdPfx & write handle factory.

With this improvement, one can implement a new bulk_insert partitioner that is capable of routing records to pre-defined fileIds using customized write factory (e.g., different write factories for different partitions)

JIRA link

Brief change log

  • Modify interface of BulkInsertPartitioner, to include the logic of handling fileIdPrefix and writeHandlerFactory. With this update, the partitioner of bulk_insert path now have ability to control records' final file group location (similar to the partitioner of the upsert path)
  • Modify bulk_insert write path (e.g., AbstractBulkInsertHelper and its subclasses) to make use of the new partitioner interface. Now the partitioner are mandatory (not optional anymore), similar to the standard upsert/insert path.
  • The java bulk_insert write path is mostly left untouched because of its specialty, e.g., always write to a single filegroup (i.e., parallelism=1) and has customized fileId generator FileIdPrefixProvider.

Verify this pull request

Added a fileId generation check to existing tests, and other parts are already covered by existing tests, such as TestBulkInsertInternalPartitioner.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@YuweiXiao YuweiXiao force-pushed the HUDI-3085 branch 2 times, most recently from eba225e to 10b433d Compare December 24, 2021 05:29
@yihua yihua self-assigned this Dec 27, 2021
@yihua
Copy link
Contributor

yihua commented Jan 9, 2022

@YuweiXiao Could you rebase your PR on master to resolve the conflicts?

@YuweiXiao
Copy link
Contributor Author

@YuweiXiao Could you rebase your PR on master to resolve the conflicts?

Thanks for the reminder. The conflict has been resolved.

@YuweiXiao
Copy link
Contributor Author

@hudi-bot run azure

@YuweiXiao
Copy link
Contributor Author

@hudi-bot run azure

@nsivabalan nsivabalan added priority:critical Production degraded; pipelines stalled and removed priority:critical Production degraded; pipelines stalled labels Feb 8, 2022
@YuweiXiao
Copy link
Contributor Author

@hudi-bot run azure

@YuweiXiao
Copy link
Contributor Author

@hudi-bot run azure

* partitions should be almost equal to (#inputRecords / #outputSparkPartitions) to avoid possible skews.
*/
public interface BulkInsertPartitioner<I> {
public abstract class BulkInsertPartitioner<I> implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface is public and users may implement their own bulk insert partitioner as a plugin. The change from interface to abstract class is not backward compatible. Could you keep it as an interface and use default methods for new logic?

return fileIdPfx;
}

public void setDefaultWriteHandleFactory(WriteHandleFactory defaultWriteHandleFactory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should setDefaultWriteHandleFactory() functionality be implemented through the constructor with the defaultWriteHandleFactory passed in? e.g.,

public GlobalSortPartitioner(WriteHandleFactory defaultWriteHandleFactory);

config, instantTime, table,
fileIdPrefixProvider.createFilePrefix(""), table.getTaskContextSupplier(),
new CreateHandleFactory<>()).forEachRemaining(writeStatuses::addAll);
partitioner.getWriteHandleFactory(0)).forEachRemaining(writeStatuses::addAll);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks hacky

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True... It is only for Java Engine, where the bulk_insert routes all record to a single file group. Let me check if there is a better way to do abstraction for Java engine.

Comment on lines 36 to 37
private WriteHandleFactory defaultWriteHandleFactory;
private List<String> fileIdPfx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this PR as a whole, I'm thinking that it may be better to store a generating function of partitionId -> fileIdPrefix, and partitionId -> writeHandleFactory and have those functions passed in from the constructor.

Do you have any PoC of BulkInsertPartitioner implementation that provides partition-specific file ID and write handle factory? I'd like to understand how these are coupled with the repartition logic and how the interface design can accommodate the use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in order to enable concurrent clustering and upsert to a same file group, we have to control how records are routing to file group in the clustering (which uses bulk_insert to write records). So in my case, customized ClusteringExecutionStrategy and BulkInsertPartitioner are implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall design indeed is partitionId -> fileIdPrefix (fileIdPfxList), partitionId -> writeHandleFactory (getWriteHandleFactory interface). Of course, will go with having those organized in the constructor, which should make the design more clear.

HoodieWriteConfig newConfig = HoodieWriteConfig.newBuilder().withProps(props).build();

BulkInsertPartitioner partitioner = getPartitioner(strategyParams, schema);
partitioner.setDefaultWriteHandleFactory(new CreateHandleFactory(preserveHoodieMetadata));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be achieved through constructor.

@Override
public List<HoodieRecord<T>> repartitionRecords(
List<HoodieRecord<T>> records, int outputSparkPartitions) {
generateFileIdPfx(outputSparkPartitions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this can be achieved by a function (func) passed to the constructor and the logic of sth like IntStream.range(0, outputSparkPartitions).mapToObj(i -> func.apply(i))?

return new SparkLazyInsertIterable<>(recordItr, areRecordsSorted, config, instantTime, hoodieTable,
fileIDPrefixes.get(partition), hoodieTable.getTaskContextSupplier(), useWriterSchema,
writeHandleFactory);
(String)partitioner.getFileIdPfx().get(partition), hoodieTable.getTaskContextSupplier(), useWriterSchema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to apply the partition ID -> file ID here if the partitioner just stores the function.

@YuweiXiao
Copy link
Contributor Author

Thanks for the review! @yihua Will try to use passed-in function to manage the partition-fileId-writeHandle mapping, believe will have a better, clear interface.

@YuweiXiao YuweiXiao force-pushed the HUDI-3085 branch 2 times, most recently from 2243884 to 58a02eb Compare February 26, 2022 14:01
@YuweiXiao YuweiXiao requested a review from yihua February 28, 2022 04:35
@YuweiXiao
Copy link
Contributor Author

Hey Yihua @yihua, the PR is ready for another round of review~:)

@vinothchandar vinothchandar self-assigned this Mar 9, 2022
@yihua yihua added the priority:medium Moderate impact; usability gaps label Mar 16, 2022
@YuweiXiao
Copy link
Contributor Author

@hudi-bot run azure


/**
* Return write handle factory for the given partition.
* By default, return CreateHandleFactory which will always write to a new file group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is not correct since it returns empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed.


// write new files
List<WriteStatus> writeStatuses = bulkInsert(inputRecords, instantTime, table, config, performDedupe, userDefinedBulkInsertPartitioner, false, config.getBulkInsertShuffleParallelism(), new CreateHandleFactory(false));
List<WriteStatus> writeStatuses = bulkInsert(inputRecords, instantTime, table, config, performDedupe, partitioner, false, config.getBulkInsertShuffleParallelism(), new CreateHandleFactory(false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line can be split into two lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, fixed.

config, instantTime, table,
fileIdPrefixProvider.createFilePrefix(""), table.getTaskContextSupplier(),
new CreateHandleFactory<>()).forEachRemaining(writeStatuses::addAll);
(WriteHandleFactory) partitioner.getWriteHandleFactory(0).orElse(writeHandleFactory)).forEachRemaining(writeStatuses::addAll);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here what's the meaning of passing 0 as partitioneId?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 means getting write handle factory for partition 0. The code is consistent with previous behavior, as java engine always has only one data partition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add some comments here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


@Override
public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputSparkPartitions) {
SerializableSchema serializableSchema = new SerializableSchema(schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice improvement.

executor.getCommitActionType(), instantTime), Option.empty(),
config.shouldAllowMultiWriteOnSameInstant());

BulkInsertPartitioner partitioner = userDefinedBulkInsertPartitioner.isPresent()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the duplicate code in JavaBulkInsertHelper, can we unify it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BulkInsertPartitionerFactory is different for spark&java. I could extract an interface(e.g., GetBulkInsertPartitionerFactory) to the base class if we want to unify the code. But as yihua said, the change of public interface may broken existing users' code, requiring them to update their code too.

ps. I have re-written this part of the code to make it more clear.

Copy link
Contributor

@leesf leesf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @yihua do you have any other concern?

@leesf
Copy link
Contributor

leesf commented Apr 25, 2022

@hudi-bot run azure

1 similar comment
@YuweiXiao
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@leesf leesf merged commit f2ba0fe into apache:master Apr 25, 2022
leesf pushed a commit to leesf/hudi that referenced this pull request Nov 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Moderate impact; usability gaps

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants