Skip to content

[Improve][connector-elasticsearchv2]Optimize Elasticsearch source che…#10446

Open
CosmosNi wants to merge 2 commits intoapache:devfrom
CosmosNi:improve-es-code
Open

[Improve][connector-elasticsearchv2]Optimize Elasticsearch source che…#10446
CosmosNi wants to merge 2 commits intoapache:devfrom
CosmosNi:improve-es-code

Conversation

@CosmosNi
Copy link
Contributor

@CosmosNi CosmosNi commented Feb 4, 2026

This change narrows the checkpoint lock scope in the Elasticsearch source reader so ES I/O, parsing, and sleeps happen outside the lock, reducing checkpoint delay while keeping collect atomic. It also adds a best‑effort SQL cursor close to release ES server‑side resources when SQL pagination is interrupted or completes.

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

@DanielCarter-stack
Copy link

Issue 1: PIT hasMore logic change may cause infinite loop or additional queries

Location: EsRestClient.java:1072

// Before modification
boolean hasMore = docs.size() > 0 && totalHits > 0 && docs.size() < totalHits;

// After modification
boolean hasMore = !docs.isEmpty();

Related Context:

  • Caller: ElasticsearchSourceReader.java:205 (while (pitResult.isHasMore()))
  • ES official documentation: PIT pagination should determine if there is more data through docs.size() < totalHits

Problem Description:
The original logic considered pagination semantics: when the number of returned documents is less than the total, it indicates there is more data. The new logic simplifies to "continue as long as there is data," which will lead to additional invalid queries when processing the last batch of data.

Potential Risks:

  • Risk 1: When querying the last batch of data, !docs.isEmpty() returns true, causing the loop to execute one more time, and ES returns empty results
  • Risk 2: Increases unnecessary network overhead and latency
  • Risk 3: If ES throws an exception instead of returning empty results when there is no data, it may cause task failure

Scope of Impact:

  • Direct Impact: All Source tasks using PIT API
  • Impact Area: Single Connector (Elasticsearch)

Severity: MAJOR

Improvement Suggestions:

// Restore the original logic, or if simplification is indeed necessary, explain the reason in the PR description
boolean hasMore = !docs.isEmpty() && docs.size() < totalHits;
// If totalHits is unreliable, other judgment methods can be used, but thorough testing is required

Rationale:

  • The original logic is the correct pagination judgment
  • If the original logic has problems, it should be explained in the PR description with test evidence
  • There is currently no evidence that the original logic has a bug

Issue 2: PR description does not match actual changes

Location: PR description

"This change narrows the checkpoint lock scope in the Elasticsearch source reader"

Related Context:

  • ElasticsearchSourceReader.java:78-91: The synchronized block scope of the pollNext() method was not changed
  • All changes in the PR are inside the scrollSearchResult() method

Problem Description:
The PR title and description claim to "optimize checkpoint lock scope," but the actual code does not modify the scope of the checkpoint lock at all. All operations (including ES I/O, parsing, sleep) are still inside the synchronized (output.getCheckpointLock()) block.

Potential Risks:

  • Risk 1: Misleads reviewers and users into thinking performance issues have been resolved
  • Risk 2: Actual checkpoint performance issues remain unfixed
  • Risk 3: If users do not report performance issues because of this PR, the problem will be hidden

Scope of Impact:

  • Direct Impact: All tasks using Elasticsearch Source
  • Indirect Impact: User experience and community trust

Severity: MAJOR

Improvement Suggestions:

  1. Correct PR description: Change the title to "[Improve][connector-elasticsearchv2] Add SQL cursor cleanup for Elasticsearch source"
  2. Or truly optimize checkpoint lock: If you really want to optimize the lock scope, you should modify it like this:
@Override
public void pollNext(Collector<SeaTunnelRow> output) throws Exception {
    ElasticsearchSourceSplit split;
    synchronized (output.getCheckpointLock()) {
        split = splits.poll();  // Only perform state reads inside the lock
    }
    
    if (split != null) {
        // Execute I/O outside the lock
        SeaTunnelRowType seaTunnelRowType = split.getSeaTunnelRowType();
        ElasticsearchConfig sourceIndexInfo = split.getElasticsearchConfig();
        scrollSearchResult(seaTunnelRowType, sourceIndexInfo, output);
    } else if (noMoreSplit) {
        synchronized (output.getCheckpointLock()) {
            log.info("Closed the bounded ELasticsearch source");
            context.signalNoMoreElement();
        }
    } else {
        Thread.sleep(pollNextWaitTime);  // Sleep outside the lock
    }
}

However, this change requires more careful concurrent safety review, because output.collect() needs to be called inside the lock

Rationale:

  • Code review must be based on actual changes, not PR descriptions
  • Mismatch between description and implementation can cause serious issues to be overlooked

Issue 3: Timing of SQL cursor closure may cause resources not to be released

Location: ElasticsearchSourceReader.java:119-127

} finally {
    if (StringUtils.isNotEmpty(cursor)) {
        try {
            esRestClient.closeSqlCursor(cursor);
        } catch (Exception e) {
            log.warn("Failed to close SQL cursor: " + cursor, e);
        }
    }
}

Related Context:

  • ES SQL API documentation: cursor indicates completion when the last query returns an empty string
  • EsRestClient.java:421-449: In the getDocsFromSqlResponse() method, scrollId is only set when the response contains the "cursor" field

Problem Description:
Under normal circumstances, ES SQL pagination will set the cursor to an empty string (or not return the cursor field) on the last return, at which point the while loop exits. The finally block will attempt to close the last non-empty cursor, which is correct.

However, there is an edge case: if searchBySql() returns empty results on the first call (for example, the queried table is empty), the cursor is null at this time and the close operation will not be executed. But if the first query returns a cursor but no rows (unlikely, but theoretically possible), the cursor will be closed, which is the correct behavior.

Potential Risks:

  • Risk 1: If ES returns a cursor but no data under abnormal circumstances, the finally block will correctly close it (this is good)
  • Risk 2: If outputFromScrollResult() throws an exception, the cursor will be correctly closed (this is good)
  • Actual Verification: This logic is correct and there is no problem

Scope of Impact:

  • Direct Impact: SQL query scenarios
  • Actual Impact: Positive (fixed resource leak)

Severity: MINOR (not a problem, but needs verification)

Improvement Suggestions:
It is recommended to add test cases to verify the following scenarios:

  1. Cursor closure upon normal completion
  2. Cursor closure upon abnormal interruption
  3. First query returns empty results

Rationale:

  • Although the logic appears correct, it needs to be verified through tests to verify the actual behavior of the ES API

Issue 4: Missing unit tests and integration tests

Location: Entire PR

Problem Description:
The PR does not contain any test code, but this is an important modification involving resource management and concurrent safety.

Potential Risks:

  • Risk 1: PIT hasMore logic changes are not covered by tests and may fail in certain edge cases
  • Risk 2: SQL cursor closure logic is not tested and cannot be verified whether it is correct under various abnormal circumstances
  • Risk 3: Future refactoring may break this logic

Scope of Impact:

  • Direct Impact: Code quality and maintainability
  • Indirect Impact: Users may encounter problems in production environments

Severity: MAJOR

Improvement Suggestions:
Add the following tests:

@Test
public void testSqlCursorClosedOnNormalCompletion() {
    // Test that cursor is closed when test completes normally
}

@Test
public void testSqlCursorClosedOnException() {
    // Test that cursor is closed when test throws exception
}

@Test
public void testPitHasMoreLogic() {
    // Test PIT hasMore return values in various scenarios
    // - Empty results
    // - Intermediate batch
    // - Last batch
    // - Total count is 0
}

Rationale:

  • Resource management logic must have test coverage
  • Behavior changes (PIT hasMore) must have test verification

*/
public boolean closeSqlCursor(String cursor) {
if (StringUtils.isEmpty(cursor)) {
log.warn("Attempted to close SQL cursor with empty cursor");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. I don't quite understand the log printing here. You can say that an empty cursor doesn't need to be closed, What do you think?

return false;
}
} catch (Exception ex) {
log.warn("Failed to close SQL cursor: " + cursor, ex);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When printing logs, it is best to standardize the format and use placeholders {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GET

try {
esRestClient.closeSqlCursor(cursor);
} catch (Exception e) {
log.warn("Failed to close SQL cursor: " + cursor, e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@zhangshenghang
Copy link
Member

Please add some tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants