Skip to content

[FEATURE] Parallelize ChangelogWorker threads for improved throughput in iceberg-source #6723

@lawofcycles

Description

@lawofcycles

Is your feature request related to a problem? Please describe.

The iceberg-source currently runs a single ChangelogWorker thread per node, processing SHUFFLE_WRITE, SHUFFLE_READ, and CHANGELOG tasks sequentially. This limits throughput when processing large snapshots with many data files, as only one task is processed at a time on each node.

Describe the solution you'd like

Add a configurable workers parameter (similar to S3 source's workers config) that controls the number of ChangelogWorker threads per node.

When parallelizing SHUFFLE_WRITE tasks, the combined memory usage of LocalDiskShuffleWriter needs to be considered. Currently the writer buffers all records from a data file in memory before sorting by partition number. With a single worker this is bounded by one Iceberg data file (default 512MB), but parallelizing workers would multiply the memory usage proportionally. A configurable memory limit with spill to disk should be added as part of this work.

Describe alternatives you've considered (Optional)
N/A

Additional context

Related PR: #6682 (source-layer shuffle implementation)
Review comment: #6682 (comment)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

Unplanned

Relationships

None yet

Development

No branches or pull requests

Issue actions