Skip to content

[Feature][Connector-V2] Enable file split for S3File source#10450

Open
yzeng1618 wants to merge 4 commits intoapache:devfrom
yzeng1618:dev-s3-split1
Open

[Feature][Connector-V2] Enable file split for S3File source#10450
yzeng1618 wants to merge 4 commits intoapache:devfrom
yzeng1618:dev-s3-split1

Conversation

@yzeng1618
Copy link
Collaborator

#10129

Purpose of this pull request

Implements logical file split for S3File source to improve read parallelism when ingesting large files from S3/MinIO.

  • Add enable_file_split and file_split_size options to S3File source option rule.
  • Support splitting for text/csv/json (align split end to next row_delimiter) and parquet (split by RowGroup; never breaks a RowGroup).
  • Ensure header skipping behaves correctly when enable_file_split=true but the split has no range (fallback to non-splitting read).
  • Improve parquet split failure message with file path and root cause context.
  • Update docs in both docs/en and docs/zh, and add/extend tests (unit + e2e).

Does this PR introduce any user-facing change?

Yes.

  • New source options for S3File: enable_file_split (boolean) and file_split_size (long).
  • Documentation updated to describe behavior, recommendations, and limitations (e.g., non-compressed formats only; otherwise fallback to non-splitting).

How was this patch tested?

  • Added unit tests:
    • ReadStrategySplitFallbackTest (text/csv fallback behavior when split is enabled but no range provided)
    • S3FileFactoryTest (option rule contains split-related options)
  • Added E2E test:
    • S3FileWithFilterIT#testS3FileTextEnableSplitToAssert with MinIO + Assert sink

Check list

@DanielCarter-stack
Copy link

Issue 1: file_split_size lacks input validation

Location: seatunnel-connectors-v2/connector-file/connector-file-s3/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/s3/source/S3FileSourceFactory.java

Related Context:

  • Define configuration option: S3FileSourceFactory.java:42-55
  • Use configuration option: CustomFileSplitGenerator (not modified in this PR)

Problem Description:
The file_split_size configuration option does not have validity validation. Users may configure the following invalid values:

  • Negative numbers: leading to logic errors
  • Zero: may cause infinite loops or division-by-zero errors
  • Excessively large values: may cause a single split to exceed memory limits

Potential Risks:

  • Risk 1: When configured as 0 or a negative number, it may cause splitEnd calculation errors, producing abnormal split ranges
  • Risk 2: When configured as an extremely large value (e.g., Long.MAX_VALUE), it may degrade to single-split reading, but users may mistakenly believe splitting is enabled
  • Risk 3: No friendly error messages, making it difficult for users to troubleshoot configuration issues

Impact Scope:

  • Direct Impact: All S3File jobs using enable_file_split=true
  • Affected Area: Single Connector (S3File)

Severity: MAJOR

Improvement Suggestions:

// S3FileSourceFactory.java
public static final ConfigOption<Long> FILE_SPLIT_SIZE = ConfigOption
    .key("file_split_size")
    .longType()
    .defaultValue(64 * 1024 * 1024L)
    .withDescription(
        "The file split size (in bytes) when file split is enabled. "
        + "Must be positive. Recommended values are between 1MB and 1GB. "
        + "Note: actual split size may be larger due to row delimiter alignment.");

// Add validation in the prepare method
@Override
public void prepare(PrepareConfig config) throws Exception {
    // ... existing code ...
    
    Long fileSplitSize = options.get(FILE_SPLIT_SIZE);
    if (fileSplitSize != null && fileSplitSize <= 0) {
        throw new IllegalArgumentException(
            "file_split_size must be positive, but got: " + fileSplitSize);
    }
    
    // Optional: Add warning
    if (fileSplitSize != null && fileSplitSize < 1024 * 1024) {
        LOG.warn("file_split_size is less than 1MB, which may cause too many splits. "
                + "Recommended value: at least 1MB.");
    }
}

Rationale: Adding configuration validation can prevent runtime errors caused by invalid configurations, detecting issues early and providing friendly error messages.


Issue 2: TextReadStrategy line separator hardcoding

Location: seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/TextReadStrategy.java

Related Context:

  • Parent class: AbstractReadStrategy
  • Child classes: CsvReadStrategy, JsonReadStrategy
  • Caller: FileSourceReader

Problem Description:
The adjustSplitEndToNextDelimiter method hardcodes \n as the line separator, not supporting Windows-style \r\n. Although most modern systems and S3 storage use Unix-style line breaks, issues may arise in certain scenarios (e.g., files uploaded directly from Windows).

Code Snippet:

protected long adjustSplitEndToNextDelimiter(
        FileSourceSplit sourceSplit, long splitEnd, byte[] delimiter) {
    // ...
    int nextNewlinePos = findNextNewlinePosition(content, 0, content.length);
    // ...
}

private int findNextNewlinePosition(byte[] content, int start, int end) {
    for (int i = start; i < end; i++) {
        if (content[i] == '\n') { // Hardcoded \n
            return i;
        }
    }
    return -1;
}

Potential Risks:

  • Risk 1: If the file uses \r\n line breaks, a split may cut at \r, causing the last line of data to contain \r characters at the end
  • Risk 2: Downstream processing may produce parsing errors due to trailing \r (e.g., field values have an extra space or invisible character)
  • Risk 3: Inconsistent with expectations of some CSV parsers (most CSV parsers can correctly handle \r\n)

Impact Scope:

  • Direct Impact: Using enable_file_split=true to read text/csv/json files
  • Indirect Impact: Downstream sinks may receive dirty data containing \r
  • Affected Area: Single Connector (S3File and other file connectors that may use this logic)

Severity: MINOR

Improvement Suggestions:

private int findNextNewlinePosition(byte[] content, int start, int end) {
    for (int i = start; i < end; i++) {
        if (content[i] == '\n') {
            // Unix style: found \n, return position after \n
            return i + 1;
        }
        if (content[i] == '\r' && i + 1 < end && content[i + 1] == '\n') {
            // Windows style: found \r\n, return position after \n
            return i + 2;
        }
        if (content[i] == '\r') {
            // Old Mac style: found lone \r, return position after \r
            return i + 1;
        }
    }
    return -1;
}

Rationale: Support mainstream line break formats (Unix \n, Windows \r\n, Old Mac \r) to improve system robustness. This aligns with the behavior of most text parsers and IDEs.


Issue 3: CSV format split fallback logic not explicit enough

Location: seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/CsvReadStrategy.java

Related Context:

  • Parent class: TextReadStrategy
  • Caller: FileSourceReader
  • Related tests: ReadStrategySplitFallbackTest

Problem Description:
In CSV's prepareRead method, start >= end is used to determine whether to skip the header. This logic implicitly indicates "fallback to non-split reading when split has no valid range." However, this behavior is not explicit enough and may cause future maintainers to misunderstand.

Code Snippet:

public void prepareRead(...) {
    if (start >= end) {
        // fallback to non-splitting read
        skipHeader(in);
    }
    // ...
    adjustSplitEndToNextDelimiter(sourceSplit, end, rowDelimiter);
}

Potential Risks:

  • Risk 1: Reduced code readability, maintainers may not understand why to skip header when start >= end
  • Risk 2: If the way prepareRead is called changes in the future, this implicit assumption may be broken
  • Risk 3: Lack of logging, unable to track whether fallback occurred

Impact Scope:

  • Direct Impact: CSV file split behavior
  • Affected Area: Single Connector

Severity: MINOR

Improvement Suggestions:

public void prepareRead(...) {
    // Explicit check: if split has no valid range, fallback to non-splitting read
    if (start >= end) {
        LOG.debug("Split {} has no valid range (start={}, end={}), "
                + "falling back to non-splitting read with header skip",
                sourceSplit.splitId(), start, end);
        skipHeader(in);
    }
    
    // ... rest of the code ...
}

Rationale: Add comments and logs to explicitly state fallback behavior, improving code readability and maintainability.


Issue 4: Parquet split error handling improvements not uniformly applied to other formats

Location: seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java

Related Context:

  • Related classes: TextReadStrategy, CsvReadStrategy
  • Comparison: Parquet has enhanced error handling, but Text/CSV does not

Problem Description:
When Parquet split fails, error messages are enhanced (adding file path and root cause), but Text/CSV format split failures do not have similar improvements. If Text/CSV split fails (although less likely), users will struggle to troubleshoot.

Code Comparison:

Parquet (Improved):

throw new IOException(
    String.format("Failed to get split for file: %s", filePath), e);

Text/CSV (Not Improved):

// No explicit error handling, relies on framework default exception propagation

Potential Risks:

  • Risk 1: When Text/CSV split fails, error messages may not be detailed enough to locate the problem
  • Risk 2: Inconsistent error handling behavior between different formats increases user confusion
  • Risk 3: If adjustSplitEndToNextDelimiter throws an exception (e.g., array out of bounds), there is no context information

Impact Scope:

  • Direct Impact: Text/CSV/JSON file split error diagnosis
  • Affected Area: Single Connector

Severity: MINOR

Improvement Suggestions:

// TextReadStrategy.java
protected long adjustSplitEndToNextDelimiter(
        FileSourceSplit sourceSplit, long splitEnd, byte[] delimiter) {
    try {
        // ... existing logic ...
    } catch (Exception e) {
        throw new IOException(
            String.format("Failed to adjust split end for file: %s, splitId: %s, splitEnd: %d",
                sourceSplit.path(), sourceSplit.splitId(), splitEnd),
            e);
    }
}

Rationale: Unify error handling style to improve error diagnostic capabilities for all formats.


Issue 5: Missing unit tests for Parquet split functionality

Location: seatunnel-connectors-v2/connector-file/connector-file-base/src/test/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/ReadStrategySplitFallbackTest.java

Related Context:

  • Existing tests: Only covers text and csv
  • Related classes: ParquetFileSplitStrategy, ParquetReadStrategy

Problem Description:
Unit test ReadStrategySplitFallbackTest only tests fallback behavior for text and csv, but does not test Parquet split behavior. Parquet's split logic differs from text/csv (based on RowGroup, no line alignment needed) and should be tested separately.

Potential Risks:

  • Risk 1: After Parquet split logic is modified, there may be no test coverage, leading to regression issues
  • Risk 2: Unable to verify correctness of Parquet split (e.g., whether it correctly splits by RowGroup)
  • Risk 3: Unable to verify error handling improvements when Parquet split fails

Impact Scope:

  • Direct Impact: Parquet file split functionality quality
  • Affected Area: Single Connector

Severity: MINOR

Improvement Suggestions:

// ReadStrategySplitFallbackTest.java
@Test
public void testParquetSplitWithValidRange() throws Exception {
    // Create a test Parquet file with multiple RowGroups
    Path testParquetFile = createTestParquetFile(3); // 3 RowGroups
    
    ParquetReadStrategy readStrategy = new ParquetReadStrategy();
    FileSourceSplit split = new FileSourceSplit(0, testParquetFile, 0, 1024, null);
    
    // Prepare read should succeed
    readStrategy.prepareRead(/* params */);
    
    // Verify that split is handled correctly
    // ...
}

@Test
public void testParquetSplitFailureMessage() throws Exception {
    // Test the enhanced error message
    Path invalidParquetFile = createInvalidParquetFile();
    
    ParquetFileSplitStrategy splitStrategy = new ParquetFileSplitStrategy();
    
    IOException exception = assertThrows(IOException.class, () -> {
        splitStrategy.getSplits(invalidParquetFile, /* params */);
    });
    
    assertTrue(exception.getMessage().contains("Failed to get split for file:"));
    assertTrue(exception.getMessage().contains(invalidParquetFile.toString()));
}

Rationale: Improve test coverage to ensure correctness of Parquet split and effectiveness of error handling.


Issue 6: Missing split support for JSON format

Location: Documentation mentions JSON format split support, but no explicit verification in code

Related Context:

  • Documentation: docs/en/connectors/source/S3File.md mentions json format support
  • Related classes: JsonReadStrategy (not modified in this PR)
  • Inheritance: JsonReadStrategy may inherit from TextReadStrategy or other classes

Problem Description:
The PR description mentions split support for json format, but no modifications to JsonReadStrategy.java are seen in the code change list. Need to confirm:

  1. Does JsonReadStrategy inherit from TextReadStrategy? (If so, split is automatically supported)
  2. Does JSON split require special line alignment logic? (e.g., multi-line JSON objects)
  3. Has JSON split functionality been tested?

Potential Risks:

  • Risk 1: If JsonReadStrategy does not inherit from TextReadStrategy, JSON format does not support split, but documentation misleads users
  • Risk 2: If JSON files contain multi-line JSON objects (e.g., JSON Lines), current logic may not handle correctly
  • Risk 3: Lack of testing, unable to confirm if JSON split works properly

Impact Scope:

  • Direct Impact: JSON file split functionality
  • Affected Area: Single Connector

Severity: MAJOR (if JSON actually not supported) / MINOR (if JSON already automatically supported)

Improvement Suggestions:
Further inspection of JsonReadStrategy implementation is needed:

  1. If JsonReadStrategy inherits from TextReadStrategy:

    • Add unit tests for JSON format
    • Document in docs that JSON only supports JSON Lines format (one JSON object per line)
  2. If JsonReadStrategy does not inherit from TextReadStrategy:

    • Implement similar split logic, or explicitly state JSON split is not supported
    • Update documentation to remove JSON split description

Rationale: Ensure documentation and implementation are consistent, avoiding user misunderstanding.


Issue 7: E2E tests do not cover all scenarios

Location: seatunnel-e2e/seatunnel-connector-v2-e2e/connector-file-s3-e2e/src/test/java/org/apache/seatunnel/e2e/connector/file/s3/S3FileWithFilterIT.java

Related Context:

  • Existing tests: testS3FileTextEnableSplitToAssert (tests CSV + header)
  • Missing scenarios: Parquet, JSON, text without header, split edge cases

Problem Description:
E2E tests only cover CSV + header scenarios, missing tests for the following:

  1. Parquet file split (need to verify RowGroup splitting)
  2. JSON file split
  3. Text file split without header
  4. Split edge cases (file size is exactly a multiple of file_split_size)
  5. Performance comparison with different file_split_size configurations

Potential Risks:

  • Risk 1: Parquet and JSON split functionality not verified by E2E, may fail in real environments
  • Risk 2: Edge cases not tested, may cause issues under specific data distributions
  • Risk 3: Unable to evaluate actual performance improvement of split functionality

Impact Scope:

  • Direct Impact: Test coverage and functionality reliability
  • Affected Area: Single Connector

Severity: MINOR

Improvement Suggestions:

// S3FileWithFilterIT.java

@Test
public void testS3FileParquetEnableSplit() throws Exception {
    // Test Parquet file with multiple RowGroups
    // Verify split behavior and data correctness
}

@Test
public void testS3FileJsonEnableSplit() throws Exception {
    // Test JSON Lines file
    // Verify split behavior and data correctness
}

@Test
public void testS3FileTextNoHeaderEnableSplit() throws Exception {
    // Test text file without header
    // Verify no header is skipped
}

@Test
public void testS3FileSplitBoundaryCase() throws Exception {
    // Test file size exactly equals file_split_size
    // Verify no data duplication or loss at boundaries
}

Rationale: Improve test coverage to ensure all declared supported formats and scenarios are verified by E2E.


Issue 8: Documentation lacks detailed explanation of split limitations

Location: docs/en/connectors/source/S3File.md and docs/zh/connectors/source/S3File.md

Problem Description:
Documentation mentions that only "non-compressed formats" support split, but does not explain in detail:

  1. Which compression formats are not supported? (gz, zip, bz2, lz4, snappy, etc.)
  2. What happens if users enable split for compressed files? (Automatic fallback? Error?)
  3. Does snappy compression used internally by Parquet support split? (Should support, because Parquet split is logical)
  4. Will split affect data order? (May, because of parallel reading)

Potential Risks:

  • Risk 1: Users may enable split for compressed files expecting performance improvement, but actually no effect
  • Risk 2: Users don't understand why certain files cannot be split, causing confusion
  • Risk 3: Order-dependent scenarios (e.g., log files) may have data order changed due to split

Impact Scope:

  • Direct Impact: User understanding and configuration
  • Affected Area: Single Connector's user experience

Severity: MINOR

Improvement Suggestions:

Add detailed explanation in documentation:

### File Split Limitations

- **Supported formats**: 
  - Text (plain text files)
  - CSV (including with/without header)
  - JSON Lines (one JSON object per line)
  - Parquet (split by RowGroup)

- **Unsupported formats**:
  - Compressed text files (e.g., .gz, .bz2, .zip, .lz4) - split will be automatically disabled
  - Excel (.xlsx)
  - XML
  - Single-line JSON files (not JSON Lines)

- **Behavior with unsupported formats**:
  If you enable `enable_file_split` for an unsupported format, the system will 
  automatically fall back to non-splitting mode. A warning log will be emitted.

- **Data ordering**:
  When file split is enabled, data may be read out of order across splits. 
  If strict ordering is required, do not enable file split or use a single-split strategy.

- **Parquet compression**:
  Parquet files with internal compression (e.g., Snappy, Gzip) are fully supported,
  because Parquet split is based on RowGroup boundaries, not byte ranges.

Rationale: Provide clear and complete limitation descriptions to help users correctly understand and configure.


Issue 9: Missing Metrics for split performance monitoring

Location: The entire PR adds no Metrics-related code

Related Context:

  • SeaTunnel Metrics framework
  • Related classes: FileSourceReader, FileSourceSplitEnumerator

Problem Description:
The PR does not add any Metrics to monitor split behavior and performance, including:

  1. Actual number of splits generated
  2. Size distribution of each split
  3. Split read duration
  4. Split failure rate
  5. Frequency of split alignment adjustments

Potential Risks:

  • Risk 1: Users cannot monitor whether split functionality is effective
  • Risk 2: Unable to tune file_split_size parameter
  • Risk 3: Unable to diagnose split-related performance issues
  • Risk 4: Unable to evaluate actual effectiveness of split functionality

Impact Scope:

  • Direct Impact: Observability and operational capabilities
  • Affected Area: Single Connector

Severity: MINOR

Improvement Suggestions:

// FileSourceReader.java (pseudocode)
private Counter splitCounter;
private Histogram splitSizeHistogram;
private Timer splitReadTimer;

public void open() {
    splitCounter = context.metricRegistry().counter("file.split.count");
    splitSizeHistogram = context.metricRegistry().histogram("file.split.size");
    splitReadTimer = context.metricRegistry().timer("file.split.read.time");
}

public void readNext() {
    Timer.Context timeContext = splitReadTimer.time();
    try {
        // ... reading logic ...
        splitCounter.inc();
        splitSizeHistogram.update(splitSize);
    } finally {
        timeContext.stop();
    }
}

Rationale: Provide observability to help users monitor and tune split functionality.


@yzeng1618
Copy link
Collaborator Author

The issues described above have been supplemented and fixed.

@corgy-w
Copy link
Contributor

corgy-w commented Feb 6, 2026

@chl-wxpDid you implement data files in another format? If so, take a look at this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants