[Feature][Connector-V2] Enable file split for S3File source by yzeng1618 · Pull Request #10450 · apache/seatunnel

yzeng1618 · 2026-02-05T01:56:54Z

#10129

Purpose of this pull request

Implements logical file split for S3File source to improve read parallelism when ingesting large files from S3/MinIO.

Add enable_file_split and file_split_size options to S3File source option rule.
Support splitting for text/csv/json (align split end to next row_delimiter) and parquet (split by RowGroup; never breaks a RowGroup).
Ensure header skipping behaves correctly when enable_file_split=true but the split has no range (fallback to non-splitting read).
Improve parquet split failure message with file path and root cause context.
Update docs in both docs/en and docs/zh, and add/extend tests (unit + e2e).

Does this PR introduce any user-facing change?

Yes.

New source options for S3File: enable_file_split (boolean) and file_split_size (long).
Documentation updated to describe behavior, recommendations, and limitations (e.g., non-compressed formats only; otherwise fallback to non-splitting).

How was this patch tested?

Added unit tests:
- ReadStrategySplitFallbackTest (text/csv fallback behavior when split is enabled but no range provided)
- S3FileFactoryTest (option rule contains split-related options)
Added E2E test:
- S3FileWithFilterIT#testS3FileTextEnableSplitToAssert with MinIO + Assert sink

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
[ * ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

DanielCarter-stack · 2026-02-05T04:54:40Z

Issue 1: file_split_size lacks input validation

Location: seatunnel-connectors-v2/connector-file/connector-file-s3/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/s3/source/S3FileSourceFactory.java

Related Context:

Define configuration option: S3FileSourceFactory.java:42-55
Use configuration option: CustomFileSplitGenerator (not modified in this PR)

Problem Description:
The file_split_size configuration option does not have validity validation. Users may configure the following invalid values:

Negative numbers: leading to logic errors
Zero: may cause infinite loops or division-by-zero errors
Excessively large values: may cause a single split to exceed memory limits

Potential Risks:

Risk 1: When configured as 0 or a negative number, it may cause splitEnd calculation errors, producing abnormal split ranges
Risk 2: When configured as an extremely large value (e.g., Long.MAX_VALUE), it may degrade to single-split reading, but users may mistakenly believe splitting is enabled
Risk 3: No friendly error messages, making it difficult for users to troubleshoot configuration issues

Impact Scope:

Direct Impact: All S3File jobs using enable_file_split=true
Affected Area: Single Connector (S3File)

Severity: MAJOR

Improvement Suggestions:

// S3FileSourceFactory.java
public static final ConfigOption<Long> FILE_SPLIT_SIZE = ConfigOption
    .key("file_split_size")
    .longType()
    .defaultValue(64 * 1024 * 1024L)
    .withDescription(
        "The file split size (in bytes) when file split is enabled. "
        + "Must be positive. Recommended values are between 1MB and 1GB. "
        + "Note: actual split size may be larger due to row delimiter alignment.");

// Add validation in the prepare method
@Override
public void prepare(PrepareConfig config) throws Exception {
    // ... existing code ...
    
    Long fileSplitSize = options.get(FILE_SPLIT_SIZE);
    if (fileSplitSize != null && fileSplitSize <= 0) {
        throw new IllegalArgumentException(
            "file_split_size must be positive, but got: " + fileSplitSize);
    }
    
    // Optional: Add warning
    if (fileSplitSize != null && fileSplitSize < 1024 * 1024) {
        LOG.warn("file_split_size is less than 1MB, which may cause too many splits. "
                + "Recommended value: at least 1MB.");
    }
}

Rationale: Adding configuration validation can prevent runtime errors caused by invalid configurations, detecting issues early and providing friendly error messages.

Issue 2: TextReadStrategy line separator hardcoding

Location: seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/TextReadStrategy.java

Related Context:

Parent class: AbstractReadStrategy
Child classes: CsvReadStrategy, JsonReadStrategy
Caller: FileSourceReader

Problem Description:
The adjustSplitEndToNextDelimiter method hardcodes \n as the line separator, not supporting Windows-style \r\n. Although most modern systems and S3 storage use Unix-style line breaks, issues may arise in certain scenarios (e.g., files uploaded directly from Windows).

Code Snippet:

protected long adjustSplitEndToNextDelimiter(
        FileSourceSplit sourceSplit, long splitEnd, byte[] delimiter) {
    // ...
    int nextNewlinePos = findNextNewlinePosition(content, 0, content.length);
    // ...
}

private int findNextNewlinePosition(byte[] content, int start, int end) {
    for (int i = start; i < end; i++) {
        if (content[i] == '\n') { // Hardcoded \n
            return i;
        }
    }
    return -1;
}

Potential Risks:

Risk 1: If the file uses \r\n line breaks, a split may cut at \r, causing the last line of data to contain \r characters at the end
Risk 2: Downstream processing may produce parsing errors due to trailing \r (e.g., field values have an extra space or invisible character)
Risk 3: Inconsistent with expectations of some CSV parsers (most CSV parsers can correctly handle \r\n)

Impact Scope:

Direct Impact: Using enable_file_split=true to read text/csv/json files
Indirect Impact: Downstream sinks may receive dirty data containing \r
Affected Area: Single Connector (S3File and other file connectors that may use this logic)

Severity: MINOR

Improvement Suggestions:

private int findNextNewlinePosition(byte[] content, int start, int end) {
    for (int i = start; i < end; i++) {
        if (content[i] == '\n') {
            // Unix style: found \n, return position after \n
            return i + 1;
        }
        if (content[i] == '\r' && i + 1 < end && content[i + 1] == '\n') {
            // Windows style: found \r\n, return position after \n
            return i + 2;
        }
        if (content[i] == '\r') {
            // Old Mac style: found lone \r, return position after \r
            return i + 1;
        }
    }
    return -1;
}

Rationale: Support mainstream line break formats (Unix \n, Windows \r\n, Old Mac \r) to improve system robustness. This aligns with the behavior of most text parsers and IDEs.

Issue 3: CSV format split fallback logic not explicit enough

Location: seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java

Related Context:

Parent class: TextReadStrategy
Caller: FileSourceReader
Related tests: ReadStrategySplitFallbackTest

Problem Description:
In CSV's prepareRead method, start >= end is used to determine whether to skip the header. This logic implicitly indicates "fallback to non-split reading when split has no valid range." However, this behavior is not explicit enough and may cause future maintainers to misunderstand.

Code Snippet:

public void prepareRead(...) {
    if (start >= end) {
        // fallback to non-splitting read
        skipHeader(in);
    }
    // ...
    adjustSplitEndToNextDelimiter(sourceSplit, end, rowDelimiter);
}

Potential Risks:

Risk 1: Reduced code readability, maintainers may not understand why to skip header when start >= end
Risk 2: If the way prepareRead is called changes in the future, this implicit assumption may be broken
Risk 3: Lack of logging, unable to track whether fallback occurred

Impact Scope:

Direct Impact: CSV file split behavior
Affected Area: Single Connector

Severity: MINOR

Improvement Suggestions:

public void prepareRead(...) {
    // Explicit check: if split has no valid range, fallback to non-splitting read
    if (start >= end) {
        LOG.debug("Split {} has no valid range (start={}, end={}), "
                + "falling back to non-splitting read with header skip",
                sourceSplit.splitId(), start, end);
        skipHeader(in);
    }
    
    // ... rest of the code ...
}

Rationale: Add comments and logs to explicitly state fallback behavior, improving code readability and maintainability.

Issue 4: Parquet split error handling improvements not uniformly applied to other formats

Location: seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java

Related Context:

Related classes: TextReadStrategy, CsvReadStrategy
Comparison: Parquet has enhanced error handling, but Text/CSV does not

Problem Description:
When Parquet split fails, error messages are enhanced (adding file path and root cause), but Text/CSV format split failures do not have similar improvements. If Text/CSV split fails (although less likely), users will struggle to troubleshoot.

Code Comparison:

Parquet (Improved):

throw new IOException(
    String.format("Failed to get split for file: %s", filePath), e);

Text/CSV (Not Improved):

// No explicit error handling, relies on framework default exception propagation

Potential Risks:

Risk 1: When Text/CSV split fails, error messages may not be detailed enough to locate the problem
Risk 2: Inconsistent error handling behavior between different formats increases user confusion
Risk 3: If adjustSplitEndToNextDelimiter throws an exception (e.g., array out of bounds), there is no context information

Impact Scope:

Direct Impact: Text/CSV/JSON file split error diagnosis
Affected Area: Single Connector

Severity: MINOR

Improvement Suggestions:

// TextReadStrategy.java
protected long adjustSplitEndToNextDelimiter(
        FileSourceSplit sourceSplit, long splitEnd, byte[] delimiter) {
    try {
        // ... existing logic ...
    } catch (Exception e) {
        throw new IOException(
            String.format("Failed to adjust split end for file: %s, splitId: %s, splitEnd: %d",
                sourceSplit.path(), sourceSplit.splitId(), splitEnd),
            e);
    }
}

Rationale: Unify error handling style to improve error diagnostic capabilities for all formats.

Issue 5: Missing unit tests for Parquet split functionality

Location: seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/ReadStrategySplitFallbackTest.java

Related Context:

Existing tests: Only covers text and csv
Related classes: ParquetFileSplitStrategy, ParquetReadStrategy

Problem Description:
Unit test ReadStrategySplitFallbackTest only tests fallback behavior for text and csv, but does not test Parquet split behavior. Parquet's split logic differs from text/csv (based on RowGroup, no line alignment needed) and should be tested separately.

Potential Risks:

Risk 1: After Parquet split logic is modified, there may be no test coverage, leading to regression issues
Risk 2: Unable to verify correctness of Parquet split (e.g., whether it correctly splits by RowGroup)
Risk 3: Unable to verify error handling improvements when Parquet split fails

Impact Scope:

Direct Impact: Parquet file split functionality quality
Affected Area: Single Connector

Severity: MINOR

Improvement Suggestions:

// ReadStrategySplitFallbackTest.java
@Test
public void testParquetSplitWithValidRange() throws Exception {
    // Create a test Parquet file with multiple RowGroups
    Path testParquetFile = createTestParquetFile(3); // 3 RowGroups
    
    ParquetReadStrategy readStrategy = new ParquetReadStrategy();
    FileSourceSplit split = new FileSourceSplit(0, testParquetFile, 0, 1024, null);
    
    // Prepare read should succeed
    readStrategy.prepareRead(/* params */);
    
    // Verify that split is handled correctly
    // ...
}

@Test
public void testParquetSplitFailureMessage() throws Exception {
    // Test the enhanced error message
    Path invalidParquetFile = createInvalidParquetFile();
    
    ParquetFileSplitStrategy splitStrategy = new ParquetFileSplitStrategy();
    
    IOException exception = assertThrows(IOException.class, () -> {
        splitStrategy.getSplits(invalidParquetFile, /* params */);
    });
    
    assertTrue(exception.getMessage().contains("Failed to get split for file:"));
    assertTrue(exception.getMessage().contains(invalidParquetFile.toString()));
}

Rationale: Improve test coverage to ensure correctness of Parquet split and effectiveness of error handling.

Issue 6: Missing split support for JSON format

Location: Documentation mentions JSON format split support, but no explicit verification in code

Related Context:

Documentation: docs/en/connectors/source/S3File.md mentions json format support
Related classes: JsonReadStrategy (not modified in this PR)
Inheritance: JsonReadStrategy may inherit from TextReadStrategy or other classes

Problem Description:
The PR description mentions split support for json format, but no modifications to JsonReadStrategy.java are seen in the code change list. Need to confirm:

Does JsonReadStrategy inherit from TextReadStrategy? (If so, split is automatically supported)
Does JSON split require special line alignment logic? (e.g., multi-line JSON objects)
Has JSON split functionality been tested?

Potential Risks:

Risk 1: If JsonReadStrategy does not inherit from TextReadStrategy, JSON format does not support split, but documentation misleads users
Risk 2: If JSON files contain multi-line JSON objects (e.g., JSON Lines), current logic may not handle correctly
Risk 3: Lack of testing, unable to confirm if JSON split works properly

Impact Scope:

Direct Impact: JSON file split functionality
Affected Area: Single Connector

Severity: MAJOR (if JSON actually not supported) / MINOR (if JSON already automatically supported)

Improvement Suggestions:
Further inspection of JsonReadStrategy implementation is needed:

If JsonReadStrategy inherits from TextReadStrategy:
- Add unit tests for JSON format
- Document in docs that JSON only supports JSON Lines format (one JSON object per line)
If JsonReadStrategy does not inherit from TextReadStrategy:
- Implement similar split logic, or explicitly state JSON split is not supported
- Update documentation to remove JSON split description

Rationale: Ensure documentation and implementation are consistent, avoiding user misunderstanding.

Issue 7: E2E tests do not cover all scenarios

Location: seatunnel-e2e/seatunnel-connector-v2-e2e/connector-file-s3-e2e/src/test/java/org/apache/seatunnel/e2e/connector/file/s3/S3FileWithFilterIT.java

Related Context:

Existing tests: testS3FileTextEnableSplitToAssert (tests CSV + header)
Missing scenarios: Parquet, JSON, text without header, split edge cases

Problem Description:
E2E tests only cover CSV + header scenarios, missing tests for the following:

Parquet file split (need to verify RowGroup splitting)
JSON file split
Text file split without header
Split edge cases (file size is exactly a multiple of file_split_size)
Performance comparison with different file_split_size configurations

Potential Risks:

Risk 1: Parquet and JSON split functionality not verified by E2E, may fail in real environments
Risk 2: Edge cases not tested, may cause issues under specific data distributions
Risk 3: Unable to evaluate actual performance improvement of split functionality

Impact Scope:

Direct Impact: Test coverage and functionality reliability
Affected Area: Single Connector

Severity: MINOR

Improvement Suggestions:

// S3FileWithFilterIT.java

@Test
public void testS3FileParquetEnableSplit() throws Exception {
    // Test Parquet file with multiple RowGroups
    // Verify split behavior and data correctness
}

@Test
public void testS3FileJsonEnableSplit() throws Exception {
    // Test JSON Lines file
    // Verify split behavior and data correctness
}

@Test
public void testS3FileTextNoHeaderEnableSplit() throws Exception {
    // Test text file without header
    // Verify no header is skipped
}

@Test
public void testS3FileSplitBoundaryCase() throws Exception {
    // Test file size exactly equals file_split_size
    // Verify no data duplication or loss at boundaries
}

Rationale: Improve test coverage to ensure all declared supported formats and scenarios are verified by E2E.

Issue 8: Documentation lacks detailed explanation of split limitations

Location: docs/en/connectors/source/S3File.md and docs/zh/connectors/source/S3File.md

Problem Description:
Documentation mentions that only "non-compressed formats" support split, but does not explain in detail:

Which compression formats are not supported? (gz, zip, bz2, lz4, snappy, etc.)
What happens if users enable split for compressed files? (Automatic fallback? Error?)
Does snappy compression used internally by Parquet support split? (Should support, because Parquet split is logical)
Will split affect data order? (May, because of parallel reading)

Potential Risks:

Risk 1: Users may enable split for compressed files expecting performance improvement, but actually no effect
Risk 2: Users don't understand why certain files cannot be split, causing confusion
Risk 3: Order-dependent scenarios (e.g., log files) may have data order changed due to split

Impact Scope:

Direct Impact: User understanding and configuration
Affected Area: Single Connector's user experience

Severity: MINOR

Improvement Suggestions:

Add detailed explanation in documentation:

### File Split Limitations

- **Supported formats**: 
  - Text (plain text files)
  - CSV (including with/without header)
  - JSON Lines (one JSON object per line)
  - Parquet (split by RowGroup)

- **Unsupported formats**:
  - Compressed text files (e.g., .gz, .bz2, .zip, .lz4) - split will be automatically disabled
  - Excel (.xlsx)
  - XML
  - Single-line JSON files (not JSON Lines)

- **Behavior with unsupported formats**:
  If you enable `enable_file_split` for an unsupported format, the system will 
  automatically fall back to non-splitting mode. A warning log will be emitted.

- **Data ordering**:
  When file split is enabled, data may be read out of order across splits. 
  If strict ordering is required, do not enable file split or use a single-split strategy.

- **Parquet compression**:
  Parquet files with internal compression (e.g., Snappy, Gzip) are fully supported,
  because Parquet split is based on RowGroup boundaries, not byte ranges.

Rationale: Provide clear and complete limitation descriptions to help users correctly understand and configure.

Issue 9: Missing Metrics for split performance monitoring

Location: The entire PR adds no Metrics-related code

Related Context:

SeaTunnel Metrics framework
Related classes: FileSourceReader, FileSourceSplitEnumerator

Problem Description:
The PR does not add any Metrics to monitor split behavior and performance, including:

Actual number of splits generated
Size distribution of each split
Split read duration
Split failure rate
Frequency of split alignment adjustments

Potential Risks:

Risk 1: Users cannot monitor whether split functionality is effective
Risk 2: Unable to tune file_split_size parameter
Risk 3: Unable to diagnose split-related performance issues
Risk 4: Unable to evaluate actual effectiveness of split functionality

Impact Scope:

Direct Impact: Observability and operational capabilities
Affected Area: Single Connector

Severity: MINOR

Improvement Suggestions:

// FileSourceReader.java (pseudocode)
private Counter splitCounter;
private Histogram splitSizeHistogram;
private Timer splitReadTimer;

public void open() {
    splitCounter = context.metricRegistry().counter("file.split.count");
    splitSizeHistogram = context.metricRegistry().histogram("file.split.size");
    splitReadTimer = context.metricRegistry().timer("file.split.read.time");
}

public void readNext() {
    Timer.Context timeContext = splitReadTimer.time();
    try {
        // ... reading logic ...
        splitCounter.inc();
        splitSizeHistogram.update(splitSize);
    } finally {
        timeContext.stop();
    }
}

Rationale: Provide observability to help users monitor and tune split functionality.

…tStrategyFactoryTest and doc

yzeng1618 · 2026-02-05T09:53:56Z

The issues described above have been supplemented and fixed.

corgy-w · 2026-02-06T14:42:08Z

@chl-wxpDid you implement data files in another format? If so, take a look at this

[Feature][Connector-V2] Enable file split for S3File source

3f093f1

yzeng1618 requested review from chl-wxp and corgy-w February 5, 2026 01:57

github-actions bot added document connectors-v2 e2e file labels Feb 5, 2026

yzeng1618 and others added 2 commits February 5, 2026 10:11

Merge branch 'dev' into dev-s3-split1

46d0961

[Feature][Connector-V2] resolve conflict

e7591d9

[Feature][Connector-V2] fix issue1、issue4、issue7、issue8, add FileSpli…

aeedca6

…tStrategyFactoryTest and doc

yzeng1618 requested review from LiJie20190102 and zhangshenghang February 6, 2026 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][Connector-V2] Enable file split for S3File source#10450

[Feature][Connector-V2] Enable file split for S3File source#10450
yzeng1618 wants to merge 4 commits intoapache:devfrom
yzeng1618:dev-s3-split1

yzeng1618 commented Feb 5, 2026

Uh oh!

DanielCarter-stack commented Feb 5, 2026

Uh oh!

yzeng1618 commented Feb 5, 2026

Uh oh!

corgy-w commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yzeng1618 commented Feb 5, 2026

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

DanielCarter-stack commented Feb 5, 2026

Issue 1: file_split_size lacks input validation

Issue 2: TextReadStrategy line separator hardcoding

Issue 3: CSV format split fallback logic not explicit enough

Issue 4: Parquet split error handling improvements not uniformly applied to other formats

Issue 5: Missing unit tests for Parquet split functionality

Issue 6: Missing split support for JSON format

Issue 7: E2E tests do not cover all scenarios

Issue 8: Documentation lacks detailed explanation of split limitations

Issue 9: Missing Metrics for split performance monitoring

Uh oh!

yzeng1618 commented Feb 5, 2026

Uh oh!

corgy-w commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants