Is your feature request related to a problem? Please describe.
The iceberg-source currently runs a single ChangelogWorker thread per node, processing SHUFFLE_WRITE, SHUFFLE_READ, and CHANGELOG tasks sequentially. This limits throughput when processing large snapshots with many data files, as only one task is processed at a time on each node.
Describe the solution you'd like
Add a configurable workers parameter (similar to S3 source's workers config) that controls the number of ChangelogWorker threads per node.
When parallelizing SHUFFLE_WRITE tasks, the combined memory usage of LocalDiskShuffleWriter needs to be considered. Currently the writer buffers all records from a data file in memory before sorting by partition number. With a single worker this is bounded by one Iceberg data file (default 512MB), but parallelizing workers would multiply the memory usage proportionally. A configurable memory limit with spill to disk should be added as part of this work.
Describe alternatives you've considered (Optional)
N/A
Additional context
Related PR: #6682 (source-layer shuffle implementation)
Review comment: #6682 (comment)
Is your feature request related to a problem? Please describe.
The iceberg-source currently runs a single
ChangelogWorkerthread per node, processing SHUFFLE_WRITE, SHUFFLE_READ, and CHANGELOG tasks sequentially. This limits throughput when processing large snapshots with many data files, as only one task is processed at a time on each node.Describe the solution you'd like
Add a configurable
workersparameter (similar to S3 source'sworkersconfig) that controls the number ofChangelogWorkerthreads per node.When parallelizing SHUFFLE_WRITE tasks, the combined memory usage of
LocalDiskShuffleWriterneeds to be considered. Currently the writer buffers all records from a data file in memory before sorting by partition number. With a single worker this is bounded by one Iceberg data file (default 512MB), but parallelizing workers would multiply the memory usage proportionally. A configurable memory limit with spill to disk should be added as part of this work.Describe alternatives you've considered (Optional)
N/A
Additional context
Related PR: #6682 (source-layer shuffle implementation)
Review comment: #6682 (comment)