[Feature] [Connector-V2] [JDBC] Add PostgreSQL COPY Command Support to JDBC Source - #10406 by Ruiii-w · Pull Request #10468 · apache/seatunnel

Ruiii-w · 2026-02-09T01:51:22Z

Purpose of this pull request

This pull request introduces support for the PostgreSQL COPY protocol in the JDBC Source connector.
The COPY command is significantly faster than standard SELECT queries for bulk data retrieval. This feature allows users to enable COPY mode for PostgreSQL sources to improve read performance.

Note: This feature was developed based on SeaTunnel version 2.3.8, but has been merged into dev and the functionality remains normal.

Key features added:

Support for COPY (SELECT ...) TO STDOUT statement generation.
Support for partitioned reads (sharding) within COPY statements.
Support for both CSV and Binary formats in COPY.
New configuration options: use_copy_statement, binary, and pg_copy_buffer_size.

Does this PR introduce any user-facing change?

Yes, this PR adds new configuration options for JDBC Source (specifically for PostgreSQL).

New Options:

Name	Type	Required	Default Value	Description
`use_copy_statement`	Boolean	No	`false`	Whether to use `COPY` method for reading.
`binary`	Boolean	No	`false`	Whether to use binary format for `COPY` reading. Only takes effect when `use_copy_statement=true`.
`pg_copy_buffer_size`	Int	No	`1048576`	Buffer size for `COPY` reading (bytes). Only takes effect when `use_copy_statement=true`.

Documentation Details (per new specifications):

Data Source Title: JDBC PostgreSQL Source (Copy Mode)
Connector Support Version: SeaTunnel 2.3.8+
Data Source Description: Supports reading data from PostgreSQL using the COPY protocol for high-performance bulk data extraction.
Supported Engines: Spark, Flink, SeaTunnel Zeta
Supported Data Source List: PostgreSQL
Dependencies: org.postgresql:postgresql (Standard JDBC Driver)
Data Type Mapping: Fully compatible with existing JDBC PostgreSQL type mappings.
Options: See the "New Options" table above.

Example Configuration:

source {
  Jdbc {
    url = "jdbc:postgresql://localhost:5432/test"
    driver = "org.postgresql.Driver"
    user = "postgres"
    password = "password"
    query = "select * from my_table"
    # Enable COPY mode
    use_copy_statement = true
    pg_copy_buffer_size = 1048576
  }
}

How was this patch tested?

Verified locally using org.apache.seatunnel.example.engine.SeaTunnelEngineLocalExample.
Configured the local example to use the PostgreSQL JDBC source with use_copy_statement = true and verified that data was correctly extracted and processed.
Validated that the COPY SQL was generated correctly and executed against a local PostgreSQL instance.

Check list

If any new Jar binary package adding in your PR, please add License Notice according
https://github.com/apache/seatunnel/blob/dev/docs/en/contribution/new-license.md
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update https://github.com/apache/seatunnel/blob/dev/plugin-mapping.properties and add new connector information in it
2. Update the pom file of https://github.com/apache/seatunnel/blob/dev/seatunnel-dist/pom.xml
3. Add ci label in https://github.com/apache/seatunnel/blob/dev/.github/workflows/labeler/label-scope-conf.yml
4. Add e2e testcase in https://github.com/apache/seatunnel/tree/dev/seatunnel-e2e/seatunnel-connector-v2-e2e/
5. Update connector https://github.com/apache/seatunnel/blob/dev/config/plugin_config

…o JDBC Source

DanielCarter-stack · 2026-02-09T02:01:35Z

Issue 1: Feature not integrated into data reading flow

Location:

Main file: seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyInput.java
Related files:
- seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/JdbcInputFormat.java
- seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/JdbcSourceReader.java

Issue Description:
The PgCopyInput class and its related COPY functionality are fully implemented, but not integrated into the actual data reading flow. Code search reveals:

JdbcInputFormat still uses the traditional PreparedStatement + ResultSet approach to read data
PgCopyInput has no callers
Newly added configuration items use_copy_statement, binary, pg_copy_buffer_size are not being used

Potential Risks:

Feature unavailable: Users configuring use_copy_statement=true will see no effect
Insufficient testing: Lack of integration testing makes it impossible to verify the feature works correctly
Misleading documentation: Users following the documentation will not achieve expected results

Impact Scope:

Direct impact: PostgreSQL JDBC Source users
Indirect impact: Large-volume data synchronization tasks relying on performance improvements
Affected area: Single Connector (JDBC PostgreSQL Source)

Severity: BLOCKER

Improvement Suggestions:
Branch logic needs to be added in JdbcInputFormat or JdbcSourceReader to choose between traditional approach or COPY approach based on configuration:

// JdbcInputFormat.java
public void open(JdbcSourceSplit inputSplit) throws IOException {
    try {
        splitTableSchema = tables.get(inputSplit.getTablePath()).getTableSchema();
        splitTableId = inputSplit.getTablePath().toString();
        
        // Added: Choose reading method based on configuration
        if (config.isUseCopyStatement() && isPostgreSQLDialect()) {
            // Using COPY method
            PgCopyInput copyInput = new PgCopyInput(config, jdbcDialect, chunkSplitter, splitTableSchema, splitTableId);
            copyInput.open(inputSplit);
            // Using PgCopyInput to read data...
        } else {
            // Traditional method
            statement = chunkSplitter.generateSplitStatement(inputSplit, splitTableSchema);
            resultSet = statement.executeQuery();
            hasNext = resultSet.next();
        }
    } catch (SQLException se) {
        throw new JdbcConnectorException(...);
    }
}

private boolean isPostgreSQLDialect() {
    return jdbcDialect instanceof PostgresDialect || 
           jdbcDialect.dialectName().toLowerCase().contains("postgres");
}

Rationale: Only by integrating PgCopyInput into the data reading flow can the feature be truly usable. This is a blocking issue that must be fixed before merge.

Issue 2: Thread safety issues caused by static variables

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyBinaryReader.java:89-90

private static int BUFFER_SIZE;
private static int MAX_BUFFER_SIZE;

Related Context:

Constructor: PgCopyBinaryReader.java:114-129
Initialization logic: PgCopyBinaryReader.java:118-127

Issue Description:
BUFFER_SIZE and MAX_BUFFER_SIZE are declared as static, but they are initialized in the constructor based on the passed pgCopyBufferSize parameter. This leads to:

Multi-threading race condition: When multiple threads create PgCopyBinaryReader instances simultaneously, static variables will be repeatedly overwritten
Configuration inconsistency: Different instances may be configured with different buffer sizes, but the actual value used is the last set value
Memory waste: All instances share the same MAX_BUFFER_SIZE, which may cause some instances to use excessively large buffers

Potential Risks:

Data race: Multiple threads modifying static variables simultaneously may cause buffer.capacity() and MAX_BUFFER_SIZE to be inconsistent
Memory overflow: If one instance is configured with 1GB buffer size, all instances will attempt to allocate 1TB maximum buffer
Performance degradation: Buffer size configuration becomes ineffective, all instances use the same buffer size

Impact Scope:

Direct impact: All concurrent instances of PgCopyBinaryReader
Indirect impact: All PostgreSQL Source tasks using COPY Binary mode
Affected area: PostgreSQL JDBC Source (Binary mode)

Severity: CRITICAL

Improvement Suggestions:
Change static variables to instance variables:

// PgCopyBinaryReader.java
public final class PgCopyBinaryReader implements PgCopyReader {
    // Remove static modifier
    private final int bufferSize;
    private final int maxBufferSize;
    private ByteBuffer buffer;
    
    public PgCopyBinaryReader(InputStream stream, TableSchema schema, Integer pgCopyBufferSize) {
        this.stream = stream;
        this.rowType = schema.toPhysicalRowDataType();
        this.fieldTypes = rowType.getFieldTypes();
        
        // Calculate and assign to instance variable
        this.bufferSize = pgCopyBufferSize == null
                ? DEFAULT_BUFFER_SIZE
                : 1 << (32 - Integer.numberOfLeadingZeros(pgCopyBufferSize - 1));
        this.maxBufferSize = bufferSize * 1024;
        this.buffer = ByteBuffer.allocate(bufferSize).order(ByteOrder.BIG_ENDIAN);
    }
    
    private void ensureCapacityFor(int required) {
        if (required <= buffer.capacity()) return;
        if (required > maxBufferSize) {  // Use instance variable
            throw new JdbcConnectorException(...);
        }
        // ...
    }
}

Rationale: Each instance should have independent buffer configuration. Static variables will cause serious concurrency issues. This is a blocking issue that must be fixed.

Issue 3: Documentation and code default values are inconsistent

Location:

Documentation: docs/en/connectors/source/Jdbc.md and docs/zh/connectors/source/Jdbc.md
Code: seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/config/JdbcSourceOptions.java:169-173

// Default value in code
public static final Option<Integer> PG_COPY_BUFFER_SIZE =
        Options.key("pg_copy_buffer_size")
                .intType()
                .defaultValue(65536)  // 64KB
                .withDescription("Postgres copy buffer size");

Issue Description:
The documentation claims the default value of pg_copy_buffer_size is 1048576 (1MB), but the actual default value in code is 65536 (64KB), a 16-fold difference.

Potential Risks:

Configuration misleading: Users may rely on the default value in documentation and not explicitly configure this parameter, resulting in actual performance not meeting expectations
Troubleshooting difficulties: When users encounter performance issues, they will understand based on documentation that buffer size is 1MB, leading to incorrect troubleshooting direction

Impact Scope:

Direct impact: Users using COPY functionality
Affected area: PostgreSQL JDBC Source

Severity: MAJOR

Improvement Suggestions:
Unify the default values in documentation and code. Based on PostgreSQL COPY best practices, recommend using 1MB as the default value:

// JdbcSourceOptions.java
public static final Option<Integer> PG_COPY_BUFFER_SIZE =
        Options.key("pg_copy_buffer_size")
                .intType()
                .defaultValue(1048576)  // Change to 1MB, consistent with documentation
                .withDescription("Postgres copy buffer size (bytes). Only takes effect when use_copy_statement=true.");

Or, if 64KB is the more appropriate default value, update the documentation:

| pg_copy_buffer_size | Int | No | 65536 | Buffer size for COPY reading (bytes). |

Rationale: Inconsistency between documentation and code will cause user confusion and configuration errors. Must be unified before merging.

Issue 4: SQL injection risk

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyInput.java:76-79

String selectSql = chunkSplitter.generateSplitQuerySQL(split, tableSchema);
String copySql = String.format(
    "COPY (%s) TO STDOUT WITH %s", selectSql, useBinary ? "BINARY" : "CSV");

Related Context:

Caller: PgCopyInput.open() method
SQL generation: ChunkSplitter.generateSplitQuerySQL()
Log output: PgCopyInput.java:82

Issue Description:
PgCopyInput directly uses string concatenation to build COPY SQL without any validation or escaping of selectSql. Although generateSplitQuerySQL() is controlled internally by SeaTunnel, if:

split.getSplitQuery() contains user-provided raw SQL
tablePath or table names contain special characters (such as semicolons, comment symbols)
Future extensions support dynamic query construction

Then SQL injection may occur.

Potential Risks:

SQL injection: Malicious users can inject arbitrary SQL commands by constructing special table names or query conditions
Data leakage: Injected SQL may read data from other tables
Privilege escalation: Using COPY command to write files may lead to server file leakage

Impact Scope:

Direct impact: All PostgreSQL Source tasks using COPY functionality
Affected area: PostgreSQL JDBC Source

Severity: CRITICAL

Improvement Suggestions:

Input validation: Perform basic validation on selectSql to ensure it does not contain dangerous SQL keywords or structures:

private void validateSqlSafety(String sql) {
    String upperSql = sql.toUpperCase();
    // Check for dangerous keyword combinations
    if (upperSql.contains(";") || 
        upperSql.contains("--") || 
        upperSql.contains("/*") ||
        upperSql.matches(".*\\b(DROP|DELETE|INSERT|UPDATE|ALTER|CREATE|TRUNCATE)\\b.*")) {
        throw new JdbcConnectorException(
            CommonErrorCodeDeprecated.ILLEGAL_ARGUMENT,
            "Potentially unsafe SQL detected: " + sql);
    }
}

public void open(JdbcSourceSplit split) {
    String selectSql = chunkSplitter.generateSplitQuerySQL(split, tableSchema);
    validateSqlSafety(selectSql);  // Add validation
    String copySql = String.format(
        "COPY (%s) TO STDOUT WITH %s", selectSql, useBinary ? "BINARY" : "CSV");
    // ...
}

Use parameterized queries: Although COPY command does not support PreparedStatement, you can use quoteIdentifier() to escape identifiers:

String copySql = String.format(
    "COPY (%s) TO STDOUT WITH %s", 
    selectSql.replaceAll(";", ""),  // Remove semicolon
    useBinary ? "BINARY" : "CSV");

Rationale: SQL injection is a serious security issue. Although the risk is low in the current scenario (SQL is generated by SeaTunnel), security programming best practices should be followed.

Issue 5: ChunkSplitter interface changes affect all subclasses

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/source/ChunkSplitter.java:370-373

public String generateSplitQuerySQL(JdbcSourceSplit split, TableSchema tableSchema) {
    throw new UnsupportedOperationException("Not supported by this splitter");
}

Related Context:

Subclass: FixedChunkSplitter.java:196-246 (implemented)
Subclass: DynamicChunkSplitter (not implemented)

Issue Description:
A new abstract method generateSplitQuerySQL() was added to the ChunkSplitter base class, but only the FixedChunkSplitter version is implemented. When DynamicChunkSplitter is called, it will throw UnsupportedOperationException.

Potential Risks:

Runtime exception: Users configuring dynamic sharding (useDynamicSplitter=true) and enabling COPY will crash at runtime
Feature limitation: COPY functionality is only available in fixed sharding mode, limiting user choices
Inconsistency: use_copy_statement configuration is visible to all Splitter types, but only FixedChunkSplitter supports it

Impact Scope:

Direct impact: PostgreSQL Source tasks using dynamic sharding
Indirect impact: Large-volume scenarios relying on dynamic sharding
Affected area: PostgreSQL JDBC Source (dynamic sharding mode)

Severity: MAJOR

Improvement Suggestions:
There are two solutions:

Solution A: Implement the method for DynamicChunkSplitter:

// DynamicChunkSplitter.java
@Override
public String generateSplitQuerySQL(JdbcSourceSplit split, TableSchema tableSchema) {
    // Dynamic sharding is typically used for tables without primary keys or uneven shard keys
    // Use split.query directly as SELECT SQL
    if (StringUtils.isNotBlank(split.getSplitQuery())) {
        return split.getSplitQuery();
    }
    // Fall back to full table scan
    return String.format("SELECT * FROM %s", 
                        jdbcDialect.tableIdentifier(split.getTablePath()));
}

Solution B: Restrict COPY functionality to only support fixed sharding in configuration:

// JdbcSourceConfig.java
public static JdbcSourceConfig of(ReadonlyConfig config) {
    // ...
    boolean useCopy = config.get(JdbcSourceOptions.USE_COPY_STATEMENT);
    boolean useDynamic = config.getOptional(JdbcSourceOptions.PARTITION_COLUMN).isEmpty();
    
    if (useCopy && useDynamic) {
        throw new IllegalArgumentException(
            "use_copy_statement is only supported with fixed partition mode. " +
            "Please configure partition_column or set use_copy_statement=false.");
    }
    // ...
}

Rationale: Interface changes need to consider all implementation classes, otherwise runtime exceptions will occur. It is recommended to implement at least Solution B before merging, adding configuration validation.

Issue 6: Lack of integration tests and E2E tests

Location:

Unit tests: seatunnel-connectors-v2/connector-jdbc/src/test/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgNumericDecoderTest.java
E2E tests: seatunnel-e2e/seatunnel-connector-v2-e2e/ (no related tests found)

Issue Description:
The PR contains only one unit test file PgNumericDecoderTest.java, which tests the decoding logic of NUMERIC type. But lacks:

Unit tests for PgCopyBinaryReader
Unit tests for PgCopyCsvReader
Integration tests for PgCopyInput
End-to-end E2E tests (reading data from PostgreSQL to other Sinks)

Potential Risks:

Feature unverifiable: Cannot verify whether the feature works correctly through CI/CD processes
Regression risk: Future code changes may break COPY functionality without test coverage
User unable to reproduce: PR description states "verified using local SeaTunnelEngineLocalExample", but no reproducible test script is provided

Impact Scope:

Direct impact: Code quality and maintainability
Indirect impact: Stability of future versions
Affected area: Entire JDBC Connector

Severity: MAJOR

Improvement Suggestions:

Add PgCopyReader unit tests:

// PgCopyBinaryReaderTest.java
@Test
public void testBinaryHeaderParsing() {
    byte[] header = "PGCOPY\n\377\r\n\0\0\0\0\0\0\0\0\0".getBytes();
    InputStream stream = new ByteArrayInputStream(header);
    PgCopyBinaryReader reader = new PgCopyBinaryReader(stream, schema, 65536);
    // Verify headerParsed is true
}

@Test
public void testRowParsing() {
    // Construct test data: row with 2 fields
    byte[] rowData = {...};
    InputStream stream = new ByteArrayInputStream(rowData);
    PgCopyBinaryReader reader = new PgCopyBinaryReader(stream, schema, 65536);
    assertTrue(reader.hasNext());
    SeaTunnelRow row = reader.next();
    // Verify field values
}

Add integration tests:

// JdbcPostgresCopyIT.java
@ExtendWith(PostgresContainerExtension.class)
public class JdbcPostgresCopyIT {
    @Test
    public void testCopyRead() throws Exception {
        // 1. Prepare test data
        // 2. Configure Source: use_copy_statement=true, binary=true
        // 3. Execute sync task
        // 4. Verify correctness of read data
    }
}

Provide reproducible test scripts:
Add to PR description:

# Test steps
# 1. Start PostgreSQL
docker run -d -p 5432:5432 -e POSTGRES_PASSWORD=test postgres:14

# 2. Prepare test data
psql -h localhost -U postgres -d test -c "CREATE TABLE test_copy AS SELECT generate_series(1, 10000) AS id, 'test_' || id AS name"

# 3. Configure SeaTunnel task
# cat test_copy.conf
source {
  Jdbc {
    url = "jdbc:postgresql://localhost:5432/test"
    driver = "org.postgresql.Driver"
    user = "postgres"
    password = "test"
    query = "SELECT * FROM test_copy"
    use_copy_statement = true
    binary = true
  }
}

sink {
  Console {}
}

# 4. Run task
./bin/seatunnel.sh -c test_copy.conf

Rationale: Insufficient test coverage is a code quality issue. Especially for core functionality (such as high-performance data transfer), sufficient test coverage is essential.

Issue 7: Improper log level may leak sensitive information

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyInput.java:82

LOG.info("Open PG COPY split={}, sql={}", split.splitId(), copySql);

Related Context:

Logging framework: SLF4J
Log location: PgCopyInput.open() method

Issue Description:
Logging complete COPY SQL at INFO level may contain:

Sensitive business data (if query contains plaintext values)
Table structure information (may leak data model)
Filter conditions (may leak business logic)

In production environments, INFO level logs are persisted to log systems, posing information leakage risks.

Potential Risks:

Data leakage: Log systems may be accessed by unauthorized personnel, leading to sensitive information leakage
Log bloat: Logging SQL every time a split is opened will cause log volume to surge
Performance impact: Frequent INFO logging will affect performance

Impact Scope:

Direct impact: Production environment security
Indirect impact: Log storage costs
Affected area: All tasks using COPY functionality

Severity: MINOR

Improvement Suggestions:
Change log level from INFO to DEBUG:

// PgCopyInput.java
public void open(JdbcSourceSplit split) {
    try {
        String selectSql = chunkSplitter.generateSplitQuerySQL(split, tableSchema);
        String copySql = String.format(
            "COPY (%s) TO STDOUT WITH %s", selectSql, useBinary ? "BINARY" : "CSV");
        
        Connection conn = getConnection();
        LOG.debug("Open PG COPY split={}, sql={}", split.splitId(), copySql);  // Change to DEBUG
        
        copyManagerProxy = new CopyManagerProxy(conn);
        copyStream = copyManagerProxy.copyOutAsStream(copySql);
        // ...
    }
}

Additionally, a switch can be added to configuration to control whether to log detailed SQL:

// JdbcSourceOptions.java
public static final Option<Boolean> LOG_SQL =
        Options.key("log_sql")
                .booleanType()
                .defaultValue(false)
                .withDescription("Whether to log SQL statements (DEBUG level). Only for debugging.");

// PgCopyInput.java
if (config.isLogSql()) {
    LOG.debug("Open PG COPY split={}, sql={}", split.splitId(), copySql);
} else {
    LOG.info("Open PG COPY split={}", split.splitId());  // Do not log SQL
}

Rationale: Protecting sensitive information is a security best practice. Recommend changing the log level to DEBUG before merging, or adding a configuration switch.

Issue 8: Unreasonable buffer size configuration may cause memory overflow

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyBinaryReader.java:127

MAX_BUFFER_SIZE = BUFFER_SIZE * 1024;

Related Context:

Initialization: PgCopyBinaryReader.java:118-127
Expansion logic: PgCopyBinaryReader.java:207-233

Issue Description:
The maximum buffer size is set to 1024 times the initial size:

Default initial size: 64KB (65536 bytes)
Maximum buffer: 64MB (65536 * 1024)
If user configures 1MB initial size, maximum buffer will reach 1GB

Although the code has ensureCapacityFor() checks, in extreme cases (single field is very large), memory overflow may still occur.

Potential Risks:

Memory overflow: If PostgreSQL table has large TEXT/BYTEA fields, buffer expansion to the upper limit may be triggered
Configuration risk: Users may configure excessively large initial buffer size, causing all instances to attempt to allocate huge buffers
Concurrency amplification: Multiple concurrent tasks expanding buffers simultaneously may cause host memory exhaustion

Impact Scope:

Direct impact: Tasks using COPY Binary mode
Indirect impact: Cluster stability
Affected area: PostgreSQL JDBC Source (Binary mode)

Severity: MAJOR

Improvement Suggestions:

Set reasonable upper limit:

// PgCopyBinaryReader.java
private static final int MAX_ALLOWED_BUFFER_SIZE = 256 * 1024 * 1024;  // 256MB hard limit

public PgCopyBinaryReader(InputStream stream, TableSchema schema, Integer pgCopyBufferSize) {
    // ...
    MAX_BUFFER_SIZE = Math.min(bufferSize * 1024, MAX_ALLOWED_BUFFER_SIZE);
    this.buffer = ByteBuffer.allocate(bufferSize).order(ByteOrder.BIG_ENDIAN);
}

Add configuration validation:

// JdbcSourceOptions.java
public static final Option<Integer> PG_COPY_BUFFER_SIZE =
        Options.key("pg_copy_buffer_size")
                .intType()
                .defaultValue(65536)
                .withDescription("Postgres copy buffer size (bytes). Must be between 65536 and 10485760 (10MB).");

// JdbcSourceConfig.java
public static JdbcSourceConfig of(ReadonlyConfig config) {
    int bufferSize = config.get(JdbcSourceOptions.PG_COPY_BUFFER_SIZE);
    if (bufferSize < 65536 || bufferSize > 10 * 1024 * 1024) {
        throw new IllegalArgumentException(
            "pg_copy_buffer_size must be between 64KB and 10MB, got: " + bufferSize);
    }
    // ...
}

Add documentation说明:
In configuration documentation, clearly explain buffer size limitations and recommended values:

| pg_copy_buffer_size | Int | No | 65536 | Buffer size for COPY reading (bytes). Recommended range: 65536-1048576 (64KB-1MB). Larger values may improve performance but increase memory usage. Maximum allowed: 10485760 (10MB). |

Rationale: Memory management is a critical issue in production environments. Reasonable upper limits must be set to prevent memory overflow.

Issue 9: Improper exception handling may lead to resource leaks

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/copy/PgCopyUtils.java:41-50

public static void closeQuietly(Object obj) {
    if (obj instanceof Closeable) {
        Closeable c = (Closeable) obj;
        try {
            c.close();
        } catch (Exception ignored) {
            // Exception ignored
        }
    }
}

Related Context:

Caller: PgCopyInput.close()
Resource types: PgCopyReader, InputStream, CopyManagerProxy

Issue Description:
The closeQuietly() method silently ignores all close exceptions, which may lead to:

Undetected resource leaks: If close() method throws an exception, it indicates resources may not be properly released, but the exception is ignored and the problem is masked
Uncertain transaction state: For database connections, close failure may cause transactions not to be properly rolled back or committed
Difficult debugging: Resource leak issues in production environments are difficult to troubleshoot

Potential Risks:

Resource leaks: Database connections, file handles and other resources not properly released
Connection pool exhaustion: Large numbers of leaked connections cause connection pool exhaustion
Data inconsistency: Transactions not properly ended cause data inconsistency

Impact Scope:

Direct impact: Tasks using COPY functionality
Indirect impact: Database connection pool availability
Affected area: PostgreSQL JDBC Source

Severity: MAJOR

Improvement Suggestions:

Log close exceptions:

public static void closeQuietly(Object obj, String resourceName) {
    if (obj instanceof Closeable) {
        Closeable c = (Closeable) obj;
        try {
            c.close();
        } catch (Exception e) {
            // Log exception but do not throw
            LOG.warn("Failed to close resource: {}", resourceName, e);
            // Optional: Add exception to monitoring metrics
        }
    }
}

// PgCopyInput.java
@Override
public void close() {
    PgCopyUtils.closeQuietly(reader, "PgCopyReader");
    PgCopyUtils.closeQuietly(copyStream, "PgCopyInputStream");
    PgCopyUtils.closeQuietly(copyManagerProxy, "CopyManagerProxy");
}

Use try-with-resources:
If possible, refactor PgCopyInput to use try-with-resources:

public void open(JdbcSourceSplit split) {
    try {
        copyManagerProxy = new CopyManagerProxy(getConnection());
        copyStream = copyManagerProxy.copyOutAsStream(copySql);
        reader = createReader(copyStream);
        hasNext = reader.hasNext();
    } catch (Exception e) {
        // Ensure resource cleanup on exception
        close();
        throw e;
    }
}

Add resource leak detection (optional, for testing):

private boolean closed = false;

@Override
public void close() {
    if (closed) {
        LOG.warn("PgCopyInput already closed");
        return;
    }
    try {
        // Close resource
    } finally {
        closed = true;
    }
}

@Override
protected void finalize() throws Throwable {
    try {
        if (!closed) {
            LOG.error("PgCopyInput leaked, not closed properly");
        }
    } finally {
        super.finalize();
    }
}

Rationale: Proper resource management is the foundation of stability. Close exceptions should not be silently ignored. Recommend at least logging warning messages.

Issue 10: Non-standard configuration namespace

Location:
seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/config/JdbcSourceOptions.java:163-173

public static final Option<Boolean> BINARY =
        Options.key("binary")
                .booleanType()
                .defaultValue(false)
                .withDescription("Use binary copy mode");

public static final Option<Integer> PG_COPY_BUFFER_SIZE =
        Options.key("pg_copy_buffer_size")
                .intType()
                .defaultValue(65536)
                .withDescription("Postgres copy buffer size");

Issue Description:
Configuration item naming is non-standard:

Lack of namespace: binary and pg_copy_buffer_size should belong to copy namespace
Naming inconsistency: pg_copy_buffer_size uses underscore separation, while other configuration items (such as partition_column) also use underscore separation, but SeaTunnel configuration tends to use underscores
Unclear prefix: binary is too generic and may conflict with other future binary functionality

Potential Risks:

Configuration conflicts: Future addition of other binary-related functionality may cause naming conflicts
Poor readability: Users cannot tell from configuration names that these are COPY-related configurations
Poor extensibility: If supporting other database COPY functionality (such as MySQL's LOAD DATA), configuration items will be confusing

Impact Scope:

Direct impact: Configuration readability and maintainability
Affected area: JDBC Source configuration

Severity: MINOR

Improvement Suggestions:
Use namespaced configuration names:

// JdbcSourceOptions.java
public static final Option<Boolean> COPY_ENABLED =
        Options.key("copy.enabled")
                .booleanType()
                .defaultValue(false)
                .withDescription("Whether to use COPY method for reading (PostgreSQL only).");

public static final Option<Boolean>_COPY_BINARY =
        Options.key("copy.binary")
                .booleanType()
                .defaultValue(false)
                .withDescription("Whether to use binary format for COPY reading. Only takes effect when copy.enabled=true.");

public static final Option<Integer> COPY_BUFFER_SIZE =
        Options.key("copy.buffer_size")
                .intType()
                .defaultValue(65536)
                .withDescription("Buffer size for COPY reading (bytes). Only takes effect when copy.enabled=true.");

Corresponding configuration example:

source {
  Jdbc {
    url = "jdbc:postgresql://localhost:5432/test"
    # ...
    copy {
      enabled = true
      binary = true
      buffer_size = 1048576
    }
  }
}

Rationale: Using namespaces can improve configuration readability and extensibility, avoiding future naming conflicts. Although this is a minor issue, recommend optimizing before merging.

Issue 11: Exactly-Once semantic incompatibility

Location:
Entire COPY functionality implementation

Issue Description:
The COPY protocol is streaming and does not support offset recording and checkpoint resumption. This is incompatible with SeaTunnel's Exactly-Once semantics:

No offset concept: COPY protocol does not support resuming from intermediate positions, only from the beginning
State cannot be serialized: PgCopyInput and PgCopyReader do not implement Serializable, cannot save state in checkpoints
Difficult failure recovery: If a task fails during reading, after restart it can only read from the beginning, potentially producing duplicate data

Potential Risks:

Data duplication: After task failure and restart, it will re-read from the split's beginning, causing data duplication
Exactly-Once configuration misleading: Users enable is_exactly_once=true and use_copy_statement=true, but Exactly-Once cannot actually be guaranteed
State inconsistency: After checkpoint recovery, data position cannot be restored

Impact Scope:

Direct impact: Tasks requiring Exactly-Once semantics
Indirect impact: Data consistency
Affected area: PostgreSQL JDBC Source (COPY mode)

Severity: MAJOR

Improvement Suggestions:

Add configuration validation:

// JdbcSourceConfig.java
public static JdbcSourceConfig of(ReadonlyConfig config) {
    boolean useCopy = config.get(JdbcSourceOptions.COPY_ENABLED);
    boolean isExactlyOnce = config.get(JdbcCommonOptions.IS_EXACTLY_ONCE);
    
    if (useCopy && isExactlyOnce) {
        throw new IllegalArgumentException(
            "use_copy_statement is incompatible with is_exactly_once=true. " +
            "COPY protocol does not support offset-based recovery. " +
            "Please either set use_copy_statement=false or is_exactly_once=false.");
    }
    // ...
}

Clearly state in documentation:

### Limitations

- **Exactly-Once**: COPY mode does NOT support exactly-once semantics. When `use_copy_statement=true`, 
  the connector operates in at-least-once mode. If you require exactly-once guarantees, please keep 
  `use_copy_statement=false` (use standard SELECT queries).

- **Fault Tolerance**: COPY mode cannot recover from intermediate positions. If a task fails, it will 
  re-read the entire split from the beginning, which may result in duplicate records.

Provide best practices:
Add usage scenario recommendations in documentation:

### When to use COPY mode?

**Recommended scenarios**:
- Full data loads (not incremental sync)
- One-time batch migrations
- Scenarios where duplicate records can be tolerated or deduplicated downstream

**NOT recommended for**:
- Incremental CDC scenarios
- Exactly-once required workloads
- Frequent fault recovery environments

Rationale: Feature limitations should be clearly stated in the design phase to avoid user misuse causing data consistency issues. This is a problem that must be resolved before merging.

Issue 12: Lack of quantitative data for performance improvements

Location:
PR description and documentation

Issue Description:
The PR description and documentation claim that the COPY command "is significantly faster than standard SELECT queries", but lack:

Benchmark data: No performance comparison test results provided
Quantitative metrics: No specific multiples or percentages of performance improvement stated
Applicable scenarios: No explanation of what data volumes and scenarios show significant performance improvements
Cost analysis: No explanation of additional costs of COPY mode (such as memory usage, CPU usage)

Potential Risks:

Excessive expectations: Users may expect unrealistic performance improvements
Misuse risk: Users may enable COPY in small data volume scenarios, actually reducing performance
Wrong optimization direction: Lack of data support makes it impossible to optimize specifically

Impact Scope:

Direct impact: User usage decisions
Affected area: PostgreSQL JDBC Source users

Severity: MINOR

Improvement Suggestions:

Provide benchmark results (optional, but strongly recommended):

### Performance Benchmarks

We tested COPY mode against standard SELECT queries on a PostgreSQL 14 instance with different data volumes:

| Rows | Data Size | SELECT (time) | COPY CSV (time) | COPY Binary (time) | Speedup (CSV) | Speedup (Binary) |
|------|-----------|---------------|-----------------|-------------------|---------------|------------------|
| 10K  | 1 MB      | 0.5s          | 0.3s            | 0.2s              | 1.7x          | 2.5x             |
| 100K | 10 MB     | 5.2s          | 2.1s            | 1.4s              | 2.5x          | 3.7x             |
| 1M   | 100 MB    | 58s           | 18s             | 12s               | 3.2x          | 4.8x             |
| 10M  | 1 GB      | 650s          | 185s            | 120s              | 3.5x          | 5.4x             |

**Test Environment**:
- PostgreSQL 14.5 on Ubuntu 20.04
- 4 vCPU, 16GB RAM
- Network: 1Gbps
- SeaTunnel 2.3.8

**Conclusion**: COPY mode provides significant performance improvements, especially for large data volumes (>100K rows). Binary format is consistently faster than CSV format.

Add usage recommendations:

### Performance Tuning

**Buffer Size**:
- Default (64KB): Good for most workloads
- Large data (>1GB rows): Increase to 1MB or larger
- Limited memory: Decrease to 32KB

**Binary vs CSV**:
- Binary: Faster, but requires PostgreSQL-specific parsing
- CSV: Slightly slower, but more portable
- Recommendation: Use Binary for production, CSV for debugging

**Partition Strategy**:
- COPY mode works best with parallel partitioning
- Ensure `partition_column` is properly configured for optimal performance

Add monitoring metric descriptions:

### Monitoring

When COPY mode is enabled, the following metrics are logged at task completion:
- `rows_parsed`: Total number of rows read
- `bytes_read`: Total bytes transferred
- `buffer_expansions`: Number of times the buffer was expanded
- `elapsed_ms`: Total time spent reading
- `rows_per_second`: Read throughput
- `bytes_per_second`: Data transfer rate

Example log:

PG COPY summary: rows=1000000 rows, bytes=104857600 B, expansions=2 times,
elapsed=15000 ms, rows_per_second=66666 rows/s, bytes_per_second=6990506 B/s

Rationale: Although this is not a technical issue, providing performance data can help users make informed decisions and also demonstrates the value of the feature. Recommend adding at least simple performance comparison explanations before merging.

hzbank and others added 7 commits February 3, 2026 13:39

[Feature][Connector-V2] support pg copy for connector jdbc source

d300166

Merge branch 'apache:dev' into dev

a245d41

jdbc.md update

163d5fe

Merge branch 'apache:dev' into dev

79b496e

[Feature] [Connector-V2] [JDBC] Add PostgreSQL COPY Command Support t…

5b5a68e

…o JDBC Source

Merge branch 'dev' of github.com:Ruiii-w/seatunnel into dev

0d9d073

Merge branch 'apache:dev' into dev

7441516

github-actions bot added document connectors-v2 jdbc api labels Feb 9, 2026

Refer to the review comments to modify the code, except for e2e module

8a3f086

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [Connector-V2] [JDBC] Add PostgreSQL COPY Command Support to JDBC Source - #10406#10468

[Feature] [Connector-V2] [JDBC] Add PostgreSQL COPY Command Support to JDBC Source - #10406#10468
Ruiii-w wants to merge 8 commits intoapache:devfrom
Ruiii-w:dev

Ruiii-w commented Feb 9, 2026

Uh oh!

DanielCarter-stack commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ruiii-w commented Feb 9, 2026

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

DanielCarter-stack commented Feb 9, 2026

Issue 1: Feature not integrated into data reading flow

Issue 2: Thread safety issues caused by static variables

Issue 3: Documentation and code default values are inconsistent

Issue 4: SQL injection risk

Issue 5: ChunkSplitter interface changes affect all subclasses

Issue 6: Lack of integration tests and E2E tests

Issue 7: Improper log level may leak sensitive information

Issue 8: Unreasonable buffer size configuration may cause memory overflow

Issue 9: Improper exception handling may lead to resource leaks

Issue 10: Non-standard configuration namespace

Issue 11: Exactly-Once semantic incompatibility

Issue 12: Lack of quantitative data for performance improvements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants