Skip to content

[Feature] [Connector-V2] [JDBC] Add PostgreSQL COPY Command Support to JDBC Source - #10406#10468

Open
Ruiii-w wants to merge 8 commits intoapache:devfrom
Ruiii-w:dev
Open

[Feature] [Connector-V2] [JDBC] Add PostgreSQL COPY Command Support to JDBC Source - #10406#10468
Ruiii-w wants to merge 8 commits intoapache:devfrom
Ruiii-w:dev

Conversation

@Ruiii-w
Copy link

@Ruiii-w Ruiii-w commented Feb 9, 2026

Purpose of this pull request

This pull request introduces support for the PostgreSQL COPY protocol in the JDBC Source connector.
The COPY command is significantly faster than standard SELECT queries for bulk data retrieval. This feature allows users to enable COPY mode for PostgreSQL sources to improve read performance.

Note: This feature was developed based on SeaTunnel version 2.3.8, but has been merged into dev and the functionality remains normal.

Key features added:

  • Support for COPY (SELECT ...) TO STDOUT statement generation.
  • Support for partitioned reads (sharding) within COPY statements.
  • Support for both CSV and Binary formats in COPY.
  • New configuration options: use_copy_statement, binary, and pg_copy_buffer_size.

Does this PR introduce any user-facing change?

Yes, this PR adds new configuration options for JDBC Source (specifically for PostgreSQL).

New Options:

Name Type Required Default Value Description
use_copy_statement Boolean No false Whether to use COPY method for reading.
binary Boolean No false Whether to use binary format for COPY reading. Only takes effect when use_copy_statement=true.
pg_copy_buffer_size Int No 1048576 Buffer size for COPY reading (bytes). Only takes effect when use_copy_statement=true.

Documentation Details (per new specifications):

  1. Data Source Title: JDBC PostgreSQL Source (Copy Mode)
  2. Connector Support Version: SeaTunnel 2.3.8+
  3. Data Source Description: Supports reading data from PostgreSQL using the COPY protocol for high-performance bulk data extraction.
  4. Supported Engines: Spark, Flink, SeaTunnel Zeta
  5. Supported Data Source List: PostgreSQL
  6. Dependencies: org.postgresql:postgresql (Standard JDBC Driver)
  7. Data Type Mapping: Fully compatible with existing JDBC PostgreSQL type mappings.
  8. Options: See the "New Options" table above.

Example Configuration:

source {
  Jdbc {
    url = "jdbc:postgresql://localhost:5432/test"
    driver = "org.postgresql.Driver"
    user = "postgres"
    password = "password"
    query = "select * from my_table"
    # Enable COPY mode
    use_copy_statement = true
    pg_copy_buffer_size = 1048576
  }
}

How was this patch tested?

  • Verified locally using org.apache.seatunnel.example.engine.SeaTunnelEngineLocalExample.
  • Configured the local example to use the PostgreSQL JDBC source with use_copy_statement = true and verified that data was correctly extracted and processed.
  • Validated that the COPY SQL was generated correctly and executed against a local PostgreSQL instance.

Check list

  • If any new Jar binary package adding in your PR, please add License Notice according
    https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md
  • If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
  • If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
  • If you are contributing the connector code, please check that the following files are updated:
    1. Update https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties and add new connector information in it
    2. Update the pom file of https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml
    3. Add ci label in https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml
    4. Add e2e testcase in https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/
    5. Update connector https://github.com/apache/seatunnel/blob/dev/config/plugin_config

@DanielCarter-stack
Copy link

Issue 1: Feature not integrated into data reading flow

Location:

  • Main file: seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyInput.java
  • Related files:
    • seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/JdbcInputFormat.java
    • seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/JdbcSourceReader.java

Issue Description:
The PgCopyInput class and its related COPY functionality are fully implemented, but not integrated into the actual data reading flow. Code search reveals:

  • JdbcInputFormat still uses the traditional PreparedStatement + ResultSet approach to read data
  • PgCopyInput has no callers
  • Newly added configuration items use_copy_statement, binary, pg_copy_buffer_size are not being used

Potential Risks:

  • Feature unavailable: Users configuring use_copy_statement=true will see no effect
  • Insufficient testing: Lack of integration testing makes it impossible to verify the feature works correctly
  • Misleading documentation: Users following the documentation will not achieve expected results

Impact Scope:

  • Direct impact: PostgreSQL JDBC Source users
  • Indirect impact: Large-volume data synchronization tasks relying on performance improvements
  • Affected area: Single Connector (JDBC PostgreSQL Source)

Severity: BLOCKER

Improvement Suggestions:
Branch logic needs to be added in JdbcInputFormat or JdbcSourceReader to choose between traditional approach or COPY approach based on configuration:

// JdbcInputFormat.java
public void open(JdbcSourceSplit inputSplit) throws IOException {
    try {
        splitTableSchema = tables.get(inputSplit.getTablePath()).getTableSchema();
        splitTableId = inputSplit.getTablePath().toString();
        
        // Added: Choose reading method based on configuration
        if (config.isUseCopyStatement() && isPostgreSQLDialect()) {
            // Using COPY method
            PgCopyInput copyInput = new PgCopyInput(config, jdbcDialect, chunkSplitter, splitTableSchema, splitTableId);
            copyInput.open(inputSplit);
            // Using PgCopyInput to read data...
        } else {
            // Traditional method
            statement = chunkSplitter.generateSplitStatement(inputSplit, splitTableSchema);
            resultSet = statement.executeQuery();
            hasNext = resultSet.next();
        }
    } catch (SQLException se) {
        throw new JdbcConnectorException(...);
    }
}

private boolean isPostgreSQLDialect() {
    return jdbcDialect instanceof PostgresDialect || 
           jdbcDialect.dialectName().toLowerCase().contains("postgres");
}

Rationale: Only by integrating PgCopyInput into the data reading flow can the feature be truly usable. This is a blocking issue that must be fixed before merge.


Issue 2: Thread safety issues caused by static variables

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyBinaryReader.java:89-90

private static int BUFFER_SIZE;
private static int MAX_BUFFER_SIZE;

Related Context:

  • Constructor: PgCopyBinaryReader.java:114-129
  • Initialization logic: PgCopyBinaryReader.java:118-127

Issue Description:
BUFFER_SIZE and MAX_BUFFER_SIZE are declared as static, but they are initialized in the constructor based on the passed pgCopyBufferSize parameter. This leads to:

  1. Multi-threading race condition: When multiple threads create PgCopyBinaryReader instances simultaneously, static variables will be repeatedly overwritten
  2. Configuration inconsistency: Different instances may be configured with different buffer sizes, but the actual value used is the last set value
  3. Memory waste: All instances share the same MAX_BUFFER_SIZE, which may cause some instances to use excessively large buffers

Potential Risks:

  • Data race: Multiple threads modifying static variables simultaneously may cause buffer.capacity() and MAX_BUFFER_SIZE to be inconsistent
  • Memory overflow: If one instance is configured with 1GB buffer size, all instances will attempt to allocate 1TB maximum buffer
  • Performance degradation: Buffer size configuration becomes ineffective, all instances use the same buffer size

Impact Scope:

  • Direct impact: All concurrent instances of PgCopyBinaryReader
  • Indirect impact: All PostgreSQL Source tasks using COPY Binary mode
  • Affected area: PostgreSQL JDBC Source (Binary mode)

Severity: CRITICAL

Improvement Suggestions:
Change static variables to instance variables:

// PgCopyBinaryReader.java
public final class PgCopyBinaryReader implements PgCopyReader {
    // Remove static modifier
    private final int bufferSize;
    private final int maxBufferSize;
    private ByteBuffer buffer;
    
    public PgCopyBinaryReader(InputStream stream, TableSchema schema, Integer pgCopyBufferSize) {
        this.stream = stream;
        this.rowType = schema.toPhysicalRowDataType();
        this.fieldTypes = rowType.getFieldTypes();
        
        // Calculate and assign to instance variable
        this.bufferSize = pgCopyBufferSize == null
                ? DEFAULT_BUFFER_SIZE
                : 1 << (32 - Integer.numberOfLeadingZeros(pgCopyBufferSize - 1));
        this.maxBufferSize = bufferSize * 1024;
        this.buffer = ByteBuffer.allocate(bufferSize).order(ByteOrder.BIG_ENDIAN);
    }
    
    private void ensureCapacityFor(int required) {
        if (required <= buffer.capacity()) return;
        if (required > maxBufferSize) {  // Use instance variable
            throw new JdbcConnectorException(...);
        }
        // ...
    }
}

Rationale: Each instance should have independent buffer configuration. Static variables will cause serious concurrency issues. This is a blocking issue that must be fixed.


Issue 3: Documentation and code default values are inconsistent

Location:

  • Documentation: docs/en/connectors/source/Jdbc.md and docs/zh/connectors/source/Jdbc.md
  • Code: seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/config/JdbcSourceOptions.java:169-173
// Default value in code
public static final Option<Integer> PG_COPY_BUFFER_SIZE =
        Options.key("pg_copy_buffer_size")
                .intType()
                .defaultValue(65536)  // 64KB
                .withDescription("Postgres copy buffer size");

Issue Description:
The documentation claims the default value of pg_copy_buffer_size is 1048576 (1MB), but the actual default value in code is 65536 (64KB), a 16-fold difference.

Potential Risks:

  • Configuration misleading: Users may rely on the default value in documentation and not explicitly configure this parameter, resulting in actual performance not meeting expectations
  • Troubleshooting difficulties: When users encounter performance issues, they will understand based on documentation that buffer size is 1MB, leading to incorrect troubleshooting direction

Impact Scope:

  • Direct impact: Users using COPY functionality
  • Affected area: PostgreSQL JDBC Source

Severity: MAJOR

Improvement Suggestions:
Unify the default values in documentation and code. Based on PostgreSQL COPY best practices, recommend using 1MB as the default value:

// JdbcSourceOptions.java
public static final Option<Integer> PG_COPY_BUFFER_SIZE =
        Options.key("pg_copy_buffer_size")
                .intType()
                .defaultValue(1048576)  // Change to 1MB, consistent with documentation
                .withDescription("Postgres copy buffer size (bytes). Only takes effect when use_copy_statement=true.");

Or, if 64KB is the more appropriate default value, update the documentation:

| pg_copy_buffer_size | Int | No | 65536 | Buffer size for COPY reading (bytes). |

Rationale: Inconsistency between documentation and code will cause user confusion and configuration errors. Must be unified before merging.


Issue 4: SQL injection risk

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyInput.java:76-79

String selectSql = chunkSplitter.generateSplitQuerySQL(split, tableSchema);
String copySql = String.format(
    "COPY (%s) TO STDOUT WITH %s", selectSql, useBinary ? "BINARY" : "CSV");

Related Context:

  • Caller: PgCopyInput.open() method
  • SQL generation: ChunkSplitter.generateSplitQuerySQL()
  • Log output: PgCopyInput.java:82

Issue Description:
PgCopyInput directly uses string concatenation to build COPY SQL without any validation or escaping of selectSql. Although generateSplitQuerySQL() is controlled internally by SeaTunnel, if:

  1. split.getSplitQuery() contains user-provided raw SQL
  2. tablePath or table names contain special characters (such as semicolons, comment symbols)
  3. Future extensions support dynamic query construction

Then SQL injection may occur.

Potential Risks:

  • SQL injection: Malicious users can inject arbitrary SQL commands by constructing special table names or query conditions
  • Data leakage: Injected SQL may read data from other tables
  • Privilege escalation: Using COPY command to write files may lead to server file leakage

Impact Scope:

  • Direct impact: All PostgreSQL Source tasks using COPY functionality
  • Affected area: PostgreSQL JDBC Source

Severity: CRITICAL

Improvement Suggestions:

  1. Input validation: Perform basic validation on selectSql to ensure it does not contain dangerous SQL keywords or structures:
private void validateSqlSafety(String sql) {
    String upperSql = sql.toUpperCase();
    // Check for dangerous keyword combinations
    if (upperSql.contains(";") || 
        upperSql.contains("--") || 
        upperSql.contains("/*") ||
        upperSql.matches(".*\\b(DROP|DELETE|INSERT|UPDATE|ALTER|CREATE|TRUNCATE)\\b.*")) {
        throw new JdbcConnectorException(
            CommonErrorCodeDeprecated.ILLEGAL_ARGUMENT,
            "Potentially unsafe SQL detected: " + sql);
    }
}

public void open(JdbcSourceSplit split) {
    String selectSql = chunkSplitter.generateSplitQuerySQL(split, tableSchema);
    validateSqlSafety(selectSql);  // Add validation
    String copySql = String.format(
        "COPY (%s) TO STDOUT WITH %s", selectSql, useBinary ? "BINARY" : "CSV");
    // ...
}
  1. Use parameterized queries: Although COPY command does not support PreparedStatement, you can use quoteIdentifier() to escape identifiers:
String copySql = String.format(
    "COPY (%s) TO STDOUT WITH %s", 
    selectSql.replaceAll(";", ""),  // Remove semicolon
    useBinary ? "BINARY" : "CSV");

Rationale: SQL injection is a serious security issue. Although the risk is low in the current scenario (SQL is generated by SeaTunnel), security programming best practices should be followed.


Issue 5: ChunkSplitter interface changes affect all subclasses

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/ChunkSplitter.java:370-373

public String generateSplitQuerySQL(JdbcSourceSplit split, TableSchema tableSchema) {
    throw new UnsupportedOperationException("Not supported by this splitter");
}

Related Context:

  • Subclass: FixedChunkSplitter.java:196-246 (implemented)
  • Subclass: DynamicChunkSplitter (not implemented)

Issue Description:
A new abstract method generateSplitQuerySQL() was added to the ChunkSplitter base class, but only the FixedChunkSplitter version is implemented. When DynamicChunkSplitter is called, it will throw UnsupportedOperationException.

Potential Risks:

  • Runtime exception: Users configuring dynamic sharding (useDynamicSplitter=true) and enabling COPY will crash at runtime
  • Feature limitation: COPY functionality is only available in fixed sharding mode, limiting user choices
  • Inconsistency: use_copy_statement configuration is visible to all Splitter types, but only FixedChunkSplitter supports it

Impact Scope:

  • Direct impact: PostgreSQL Source tasks using dynamic sharding
  • Indirect impact: Large-volume scenarios relying on dynamic sharding
  • Affected area: PostgreSQL JDBC Source (dynamic sharding mode)

Severity: MAJOR

Improvement Suggestions:
There are two solutions:

Solution A: Implement the method for DynamicChunkSplitter:

// DynamicChunkSplitter.java
@Override
public String generateSplitQuerySQL(JdbcSourceSplit split, TableSchema tableSchema) {
    // Dynamic sharding is typically used for tables without primary keys or uneven shard keys
    // Use split.query directly as SELECT SQL
    if (StringUtils.isNotBlank(split.getSplitQuery())) {
        return split.getSplitQuery();
    }
    // Fall back to full table scan
    return String.format("SELECT * FROM %s", 
                        jdbcDialect.tableIdentifier(split.getTablePath()));
}

Solution B: Restrict COPY functionality to only support fixed sharding in configuration:

// JdbcSourceConfig.java
public static JdbcSourceConfig of(ReadonlyConfig config) {
    // ...
    boolean useCopy = config.get(JdbcSourceOptions.USE_COPY_STATEMENT);
    boolean useDynamic = config.getOptional(JdbcSourceOptions.PARTITION_COLUMN).isEmpty();
    
    if (useCopy && useDynamic) {
        throw new IllegalArgumentException(
            "use_copy_statement is only supported with fixed partition mode. " +
            "Please configure partition_column or set use_copy_statement=false.");
    }
    // ...
}

Rationale: Interface changes need to consider all implementation classes, otherwise runtime exceptions will occur. It is recommended to implement at least Solution B before merging, adding configuration validation.


Issue 6: Lack of integration tests and E2E tests

Location:

  • Unit tests: seatunnel-connectors-v2/connector-jdbc/src/test/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgNumericDecoderTest.java
  • E2E tests: seatunnel-e2e/seatunnel-connector-v2-e2e/ (no related tests found)

Issue Description:
The PR contains only one unit test file PgNumericDecoderTest.java, which tests the decoding logic of NUMERIC type. But lacks:

  1. Unit tests for PgCopyBinaryReader
  2. Unit tests for PgCopyCsvReader
  3. Integration tests for PgCopyInput
  4. End-to-end E2E tests (reading data from PostgreSQL to other Sinks)

Potential Risks:

  • Feature unverifiable: Cannot verify whether the feature works correctly through CI/CD processes
  • Regression risk: Future code changes may break COPY functionality without test coverage
  • User unable to reproduce: PR description states "verified using local SeaTunnelEngineLocalExample", but no reproducible test script is provided

Impact Scope:

  • Direct impact: Code quality and maintainability
  • Indirect impact: Stability of future versions
  • Affected area: Entire JDBC Connector

Severity: MAJOR

Improvement Suggestions:

  1. Add PgCopyReader unit tests:
// PgCopyBinaryReaderTest.java
@Test
public void testBinaryHeaderParsing() {
    byte[] header = "PGCOPY\n\377\r\n\0\0\0\0\0\0\0\0\0".getBytes();
    InputStream stream = new ByteArrayInputStream(header);
    PgCopyBinaryReader reader = new PgCopyBinaryReader(stream, schema, 65536);
    // Verify headerParsed is true
}

@Test
public void testRowParsing() {
    // Construct test data: row with 2 fields
    byte[] rowData = {...};
    InputStream stream = new ByteArrayInputStream(rowData);
    PgCopyBinaryReader reader = new PgCopyBinaryReader(stream, schema, 65536);
    assertTrue(reader.hasNext());
    SeaTunnelRow row = reader.next();
    // Verify field values
}
  1. Add integration tests:
// JdbcPostgresCopyIT.java
@ExtendWith(PostgresContainerExtension.class)
public class JdbcPostgresCopyIT {
    @Test
    public void testCopyRead() throws Exception {
        // 1. Prepare test data
        // 2. Configure Source: use_copy_statement=true, binary=true
        // 3. Execute sync task
        // 4. Verify correctness of read data
    }
}
  1. Provide reproducible test scripts:
    Add to PR description:
# Test steps
# 1. Start PostgreSQL
docker run -d -p 5432:5432 -e POSTGRES_PASSWORD=test postgres:14

# 2. Prepare test data
psql -h localhost -U postgres -d test -c "CREATE TABLE test_copy AS SELECT generate_series(1, 10000) AS id, 'test_' || id AS name"

# 3. Configure SeaTunnel task
# cat test_copy.conf
source {
  Jdbc {
    url = "jdbc:postgresql://localhost:5432/test"
    driver = "org.postgresql.Driver"
    user = "postgres"
    password = "test"
    query = "SELECT * FROM test_copy"
    use_copy_statement = true
    binary = true
  }
}

sink {
  Console {}
}

# 4. Run task
./bin/seatunnel.sh -c test_copy.conf

Rationale: Insufficient test coverage is a code quality issue. Especially for core functionality (such as high-performance data transfer), sufficient test coverage is essential.


Issue 7: Improper log level may leak sensitive information

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyInput.java:82

LOG.info("Open PG COPY split={}, sql={}", split.splitId(), copySql);

Related Context:

  • Logging framework: SLF4J
  • Log location: PgCopyInput.open() method

Issue Description:
Logging complete COPY SQL at INFO level may contain:

  1. Sensitive business data (if query contains plaintext values)
  2. Table structure information (may leak data model)
  3. Filter conditions (may leak business logic)

In production environments, INFO level logs are persisted to log systems, posing information leakage risks.

Potential Risks:

  • Data leakage: Log systems may be accessed by unauthorized personnel, leading to sensitive information leakage
  • Log bloat: Logging SQL every time a split is opened will cause log volume to surge
  • Performance impact: Frequent INFO logging will affect performance

Impact Scope:

  • Direct impact: Production environment security
  • Indirect impact: Log storage costs
  • Affected area: All tasks using COPY functionality

Severity: MINOR

Improvement Suggestions:
Change log level from INFO to DEBUG:

// PgCopyInput.java
public void open(JdbcSourceSplit split) {
    try {
        String selectSql = chunkSplitter.generateSplitQuerySQL(split, tableSchema);
        String copySql = String.format(
            "COPY (%s) TO STDOUT WITH %s", selectSql, useBinary ? "BINARY" : "CSV");
        
        Connection conn = getConnection();
        LOG.debug("Open PG COPY split={}, sql={}", split.splitId(), copySql);  // Change to DEBUG
        
        copyManagerProxy = new CopyManagerProxy(conn);
        copyStream = copyManagerProxy.copyOutAsStream(copySql);
        // ...
    }
}

Additionally, a switch can be added to configuration to control whether to log detailed SQL:

// JdbcSourceOptions.java
public static final Option<Boolean> LOG_SQL =
        Options.key("log_sql")
                .booleanType()
                .defaultValue(false)
                .withDescription("Whether to log SQL statements (DEBUG level). Only for debugging.");

// PgCopyInput.java
if (config.isLogSql()) {
    LOG.debug("Open PG COPY split={}, sql={}", split.splitId(), copySql);
} else {
    LOG.info("Open PG COPY split={}", split.splitId());  // Do not log SQL
}

Rationale: Protecting sensitive information is a security best practice. Recommend changing the log level to DEBUG before merging, or adding a configuration switch.


Issue 8: Unreasonable buffer size configuration may cause memory overflow

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyBinaryReader.java:127

MAX_BUFFER_SIZE = BUFFER_SIZE * 1024;

Related Context:

  • Initialization: PgCopyBinaryReader.java:118-127
  • Expansion logic: PgCopyBinaryReader.java:207-233

Issue Description:
The maximum buffer size is set to 1024 times the initial size:

  • Default initial size: 64KB (65536 bytes)
  • Maximum buffer: 64MB (65536 * 1024)
  • If user configures 1MB initial size, maximum buffer will reach 1GB

Although the code has ensureCapacityFor() checks, in extreme cases (single field is very large), memory overflow may still occur.

Potential Risks:

  • Memory overflow: If PostgreSQL table has large TEXT/BYTEA fields, buffer expansion to the upper limit may be triggered
  • Configuration risk: Users may configure excessively large initial buffer size, causing all instances to attempt to allocate huge buffers
  • Concurrency amplification: Multiple concurrent tasks expanding buffers simultaneously may cause host memory exhaustion

Impact Scope:

  • Direct impact: Tasks using COPY Binary mode
  • Indirect impact: Cluster stability
  • Affected area: PostgreSQL JDBC Source (Binary mode)

Severity: MAJOR

Improvement Suggestions:

  1. Set reasonable upper limit:
// PgCopyBinaryReader.java
private static final int MAX_ALLOWED_BUFFER_SIZE = 256 * 1024 * 1024;  // 256MB hard limit

public PgCopyBinaryReader(InputStream stream, TableSchema schema, Integer pgCopyBufferSize) {
    // ...
    MAX_BUFFER_SIZE = Math.min(bufferSize * 1024, MAX_ALLOWED_BUFFER_SIZE);
    this.buffer = ByteBuffer.allocate(bufferSize).order(ByteOrder.BIG_ENDIAN);
}
  1. Add configuration validation:
// JdbcSourceOptions.java
public static final Option<Integer> PG_COPY_BUFFER_SIZE =
        Options.key("pg_copy_buffer_size")
                .intType()
                .defaultValue(65536)
                .withDescription("Postgres copy buffer size (bytes). Must be between 65536 and 10485760 (10MB).");

// JdbcSourceConfig.java
public static JdbcSourceConfig of(ReadonlyConfig config) {
    int bufferSize = config.get(JdbcSourceOptions.PG_COPY_BUFFER_SIZE);
    if (bufferSize < 65536 || bufferSize > 10 * 1024 * 1024) {
        throw new IllegalArgumentException(
            "pg_copy_buffer_size must be between 64KB and 10MB, got: " + bufferSize);
    }
    // ...
}
  1. Add documentation说明:
    In configuration documentation, clearly explain buffer size limitations and recommended values:
| pg_copy_buffer_size | Int | No | 65536 | Buffer size for COPY reading (bytes). Recommended range: 65536-1048576 (64KB-1MB). Larger values may improve performance but increase memory usage. Maximum allowed: 10485760 (10MB). |

Rationale: Memory management is a critical issue in production environments. Reasonable upper limits must be set to prevent memory overflow.


Issue 9: Improper exception handling may lead to resource leaks

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyUtils.java:41-50

public static void closeQuietly(Object obj) {
    if (obj instanceof Closeable) {
        Closeable c = (Closeable) obj;
        try {
            c.close();
        } catch (Exception ignored) {
            // Exception ignored
        }
    }
}

Related Context:

  • Caller: PgCopyInput.close()
  • Resource types: PgCopyReader, InputStream, CopyManagerProxy

Issue Description:
The closeQuietly() method silently ignores all close exceptions, which may lead to:

  1. Undetected resource leaks: If close() method throws an exception, it indicates resources may not be properly released, but the exception is ignored and the problem is masked
  2. Uncertain transaction state: For database connections, close failure may cause transactions not to be properly rolled back or committed
  3. Difficult debugging: Resource leak issues in production environments are difficult to troubleshoot

Potential Risks:

  • Resource leaks: Database connections, file handles and other resources not properly released
  • Connection pool exhaustion: Large numbers of leaked connections cause connection pool exhaustion
  • Data inconsistency: Transactions not properly ended cause data inconsistency

Impact Scope:

  • Direct impact: Tasks using COPY functionality
  • Indirect impact: Database connection pool availability
  • Affected area: PostgreSQL JDBC Source

Severity: MAJOR

Improvement Suggestions:

  1. Log close exceptions:
public static void closeQuietly(Object obj, String resourceName) {
    if (obj instanceof Closeable) {
        Closeable c = (Closeable) obj;
        try {
            c.close();
        } catch (Exception e) {
            // Log exception but do not throw
            LOG.warn("Failed to close resource: {}", resourceName, e);
            // Optional: Add exception to monitoring metrics
        }
    }
}

// PgCopyInput.java
@Override
public void close() {
    PgCopyUtils.closeQuietly(reader, "PgCopyReader");
    PgCopyUtils.closeQuietly(copyStream, "PgCopyInputStream");
    PgCopyUtils.closeQuietly(copyManagerProxy, "CopyManagerProxy");
}
  1. Use try-with-resources:
    If possible, refactor PgCopyInput to use try-with-resources:
public void open(JdbcSourceSplit split) {
    try {
        copyManagerProxy = new CopyManagerProxy(getConnection());
        copyStream = copyManagerProxy.copyOutAsStream(copySql);
        reader = createReader(copyStream);
        hasNext = reader.hasNext();
    } catch (Exception e) {
        // Ensure resource cleanup on exception
        close();
        throw e;
    }
}
  1. Add resource leak detection (optional, for testing):
private boolean closed = false;

@Override
public void close() {
    if (closed) {
        LOG.warn("PgCopyInput already closed");
        return;
    }
    try {
        // Close resource
    } finally {
        closed = true;
    }
}

@Override
protected void finalize() throws Throwable {
    try {
        if (!closed) {
            LOG.error("PgCopyInput leaked, not closed properly");
        }
    } finally {
        super.finalize();
    }
}

Rationale: Proper resource management is the foundation of stability. Close exceptions should not be silently ignored. Recommend at least logging warning messages.


Issue 10: Non-standard configuration namespace

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/config/JdbcSourceOptions.java:163-173

public static final Option<Boolean> BINARY =
        Options.key("binary")
                .booleanType()
                .defaultValue(false)
                .withDescription("Use binary copy mode");

public static final Option<Integer> PG_COPY_BUFFER_SIZE =
        Options.key("pg_copy_buffer_size")
                .intType()
                .defaultValue(65536)
                .withDescription("Postgres copy buffer size");

Issue Description:
Configuration item naming is non-standard:

  1. Lack of namespace: binary and pg_copy_buffer_size should belong to copy namespace
  2. Naming inconsistency: pg_copy_buffer_size uses underscore separation, while other configuration items (such as partition_column) also use underscore separation, but SeaTunnel configuration tends to use underscores
  3. Unclear prefix: binary is too generic and may conflict with other future binary functionality

Potential Risks:

  • Configuration conflicts: Future addition of other binary-related functionality may cause naming conflicts
  • Poor readability: Users cannot tell from configuration names that these are COPY-related configurations
  • Poor extensibility: If supporting other database COPY functionality (such as MySQL's LOAD DATA), configuration items will be confusing

Impact Scope:

  • Direct impact: Configuration readability and maintainability
  • Affected area: JDBC Source configuration

Severity: MINOR

Improvement Suggestions:
Use namespaced configuration names:

// JdbcSourceOptions.java
public static final Option<Boolean> COPY_ENABLED =
        Options.key("copy.enabled")
                .booleanType()
                .defaultValue(false)
                .withDescription("Whether to use COPY method for reading (PostgreSQL only).");

public static final Option<Boolean>_COPY_BINARY =
        Options.key("copy.binary")
                .booleanType()
                .defaultValue(false)
                .withDescription("Whether to use binary format for COPY reading. Only takes effect when copy.enabled=true.");

public static final Option<Integer> COPY_BUFFER_SIZE =
        Options.key("copy.buffer_size")
                .intType()
                .defaultValue(65536)
                .withDescription("Buffer size for COPY reading (bytes). Only takes effect when copy.enabled=true.");

Corresponding configuration example:

source {
  Jdbc {
    url = "jdbc:postgresql://localhost:5432/test"
    # ...
    copy {
      enabled = true
      binary = true
      buffer_size = 1048576
    }
  }
}

Rationale: Using namespaces can improve configuration readability and extensibility, avoiding future naming conflicts. Although this is a minor issue, recommend optimizing before merging.


Issue 11: Exactly-Once semantic incompatibility

Location:
Entire COPY functionality implementation

Issue Description:
The COPY protocol is streaming and does not support offset recording and checkpoint resumption. This is incompatible with SeaTunnel's Exactly-Once semantics:

  1. No offset concept: COPY protocol does not support resuming from intermediate positions, only from the beginning
  2. State cannot be serialized: PgCopyInput and PgCopyReader do not implement Serializable, cannot save state in checkpoints
  3. Difficult failure recovery: If a task fails during reading, after restart it can only read from the beginning, potentially producing duplicate data

Potential Risks:

  • Data duplication: After task failure and restart, it will re-read from the split's beginning, causing data duplication
  • Exactly-Once configuration misleading: Users enable is_exactly_once=true and use_copy_statement=true, but Exactly-Once cannot actually be guaranteed
  • State inconsistency: After checkpoint recovery, data position cannot be restored

Impact Scope:

  • Direct impact: Tasks requiring Exactly-Once semantics
  • Indirect impact: Data consistency
  • Affected area: PostgreSQL JDBC Source (COPY mode)

Severity: MAJOR

Improvement Suggestions:

  1. Add configuration validation:
// JdbcSourceConfig.java
public static JdbcSourceConfig of(ReadonlyConfig config) {
    boolean useCopy = config.get(JdbcSourceOptions.COPY_ENABLED);
    boolean isExactlyOnce = config.get(JdbcCommonOptions.IS_EXACTLY_ONCE);
    
    if (useCopy && isExactlyOnce) {
        throw new IllegalArgumentException(
            "use_copy_statement is incompatible with is_exactly_once=true. " +
            "COPY protocol does not support offset-based recovery. " +
            "Please either set use_copy_statement=false or is_exactly_once=false.");
    }
    // ...
}
  1. Clearly state in documentation:
### Limitations

- **Exactly-Once**: COPY mode does NOT support exactly-once semantics. When `use_copy_statement=true`, 
  the connector operates in at-least-once mode. If you require exactly-once guarantees, please keep 
  `use_copy_statement=false` (use standard SELECT queries).

- **Fault Tolerance**: COPY mode cannot recover from intermediate positions. If a task fails, it will 
  re-read the entire split from the beginning, which may result in duplicate records.
  1. Provide best practices:
    Add usage scenario recommendations in documentation:
### When to use COPY mode?

**Recommended scenarios**:
- Full data loads (not incremental sync)
- One-time batch migrations
- Scenarios where duplicate records can be tolerated or deduplicated downstream

**NOT recommended for**:
- Incremental CDC scenarios
- Exactly-once required workloads
- Frequent fault recovery environments

Rationale: Feature limitations should be clearly stated in the design phase to avoid user misuse causing data consistency issues. This is a problem that must be resolved before merging.


Issue 12: Lack of quantitative data for performance improvements

Location:
PR description and documentation

Issue Description:
The PR description and documentation claim that the COPY command "is significantly faster than standard SELECT queries", but lack:

  1. Benchmark data: No performance comparison test results provided
  2. Quantitative metrics: No specific multiples or percentages of performance improvement stated
  3. Applicable scenarios: No explanation of what data volumes and scenarios show significant performance improvements
  4. Cost analysis: No explanation of additional costs of COPY mode (such as memory usage, CPU usage)

Potential Risks:

  • Excessive expectations: Users may expect unrealistic performance improvements
  • Misuse risk: Users may enable COPY in small data volume scenarios, actually reducing performance
  • Wrong optimization direction: Lack of data support makes it impossible to optimize specifically

Impact Scope:

  • Direct impact: User usage decisions
  • Affected area: PostgreSQL JDBC Source users

Severity: MINOR

Improvement Suggestions:

  1. Provide benchmark results (optional, but strongly recommended):
### Performance Benchmarks

We tested COPY mode against standard SELECT queries on a PostgreSQL 14 instance with different data volumes:

| Rows | Data Size | SELECT (time) | COPY CSV (time) | COPY Binary (time) | Speedup (CSV) | Speedup (Binary) |
|------|-----------|---------------|-----------------|-------------------|---------------|------------------|
| 10K  | 1 MB      | 0.5s          | 0.3s            | 0.2s              | 1.7x          | 2.5x             |
| 100K | 10 MB     | 5.2s          | 2.1s            | 1.4s              | 2.5x          | 3.7x             |
| 1M   | 100 MB    | 58s           | 18s             | 12s               | 3.2x          | 4.8x             |
| 10M  | 1 GB      | 650s          | 185s            | 120s              | 3.5x          | 5.4x             |

**Test Environment**:
- PostgreSQL 14.5 on Ubuntu 20.04
- 4 vCPU, 16GB RAM
- Network: 1Gbps
- SeaTunnel 2.3.8

**Conclusion**: COPY mode provides significant performance improvements, especially for large data volumes (>100K rows). Binary format is consistently faster than CSV format.
  1. Add usage recommendations:
### Performance Tuning

**Buffer Size**:
- Default (64KB): Good for most workloads
- Large data (>1GB rows): Increase to 1MB or larger
- Limited memory: Decrease to 32KB

**Binary vs CSV**:
- Binary: Faster, but requires PostgreSQL-specific parsing
- CSV: Slightly slower, but more portable
- Recommendation: Use Binary for production, CSV for debugging

**Partition Strategy**:
- COPY mode works best with parallel partitioning
- Ensure `partition_column` is properly configured for optimal performance
  1. Add monitoring metric descriptions:
### Monitoring

When COPY mode is enabled, the following metrics are logged at task completion:
- `rows_parsed`: Total number of rows read
- `bytes_read`: Total bytes transferred
- `buffer_expansions`: Number of times the buffer was expanded
- `elapsed_ms`: Total time spent reading
- `rows_per_second`: Read throughput
- `bytes_per_second`: Data transfer rate

Example log:

PG COPY summary: rows=1000000 rows, bytes=104857600 B, expansions=2 times,
elapsed=15000 ms, rows_per_second=66666 rows/s, bytes_per_second=6990506 B/s

Rationale: Although this is not a technical issue, providing performance data can help users make informed decisions and also demonstrates the value of the feature. Recommend adding at least simple performance comparison explanations before merging.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants