[Improve][connector-elasticsearch-v2] Add slicing support and e2e coverage for Elasticsearch source#10454
[Improve][connector-elasticsearch-v2] Add slicing support and e2e coverage for Elasticsearch source#10454CosmosNi wants to merge 7 commits intoapache:devfrom
Conversation
…erage for Elasticsearch source
# Conflicts: # docs/en/connectors/source/Elasticsearch.md # docs/zh/connectors/source/Elasticsearch.md # seatunnel-connectors-v2/connector-elasticsearch/src/main/java/org/apache/seatunnel/connectors/seatunnel/elasticsearch/client/EsRestClient.java # seatunnel-connectors-v2/connector-elasticsearch/src/main/java/org/apache/seatunnel/connectors/seatunnel/elasticsearch/config/ElasticsearchConfig.java # seatunnel-connectors-v2/connector-elasticsearch/src/main/java/org/apache/seatunnel/connectors/seatunnel/elasticsearch/config/ElasticsearchSourceOptions.java # seatunnel-connectors-v2/connector-elasticsearch/src/main/java/org/apache/seatunnel/connectors/seatunnel/elasticsearch/source/ElasticsearchSourceReader.java # seatunnel-e2e/seatunnel-connector-v2-e2e/connector-elasticsearch-e2e/src/test/java/org/apache/seatunnel/e2e/connector/elasticsearch/ElasticsearchIT.java
…erage for Elasticsearch source
…erage for Elasticsearch source
…erage for Elasticsearch source
…erage for Elasticsearch source
…erage for Elasticsearch source
Issue 1: Shared PIT Resource Leak After Checkpoint RecoveryLocation: Related Context:
Problem Description:
Potential Risks:
Impact Scope:
Severity: MAJOR Improvement Suggestion: // Option 1: Extract PIT ID from restored split
public ElasticsearchSourceSplitEnumerator(
SourceSplitEnumerator.Context<ElasticsearchSourceSplit> context,
ElasticsearchSourceState sourceState,
ReadonlyConfig connConfig,
List<ElasticsearchConfig> elasticsearchConfigs) {
this.context = context;
this.connConfig = connConfig;
this.pendingSplit = new HashMap<>();
this.sharedPitIds = new HashMap<>();
this.shouldEnumerate = sourceState == null;
if (sourceState != null) {
this.shouldEnumerate = sourceState.isShouldEnumerate();
this.pendingSplit.putAll(sourceState.getPendingSplit());
// Restore sharedPitIds
for (List<ElasticsearchSourceSplit> splits : sourceState.getPendingSplit().values()) {
for (ElasticsearchSourceSplit split : splits) {
String pitId = split.getElasticsearchConfig().getPitId();
if (StringUtils.isNotEmpty(pitId)) {
String indexName = split.getElasticsearchConfig().getIndex();
sharedPitIds.putIfAbsent(indexName, pitId);
}
}
}
}
this.elasticsearchConfigs = elasticsearchConfigs;
}
// Option 2: Explicitly save sharedPitIds in ElasticsearchSourceState
// Need to modify ElasticsearchSourceState classRationale: Ensure that shared PIT resources can be properly tracked and cleaned up after checkpoint recovery. Issue 2: ElasticsearchConfig Missing serialVersionUIDLocation: Related Context:
Problem Description: Potential Risks:
Impact Scope:
Severity: MAJOR Improvement Suggestion: @Getter
@Setter
public class ElasticsearchConfig implements Serializable {
private static final long serialVersionUID = 1L; // Add serialVersionUID
private String index;
// ... other fields
}Rationale: Explicitly declaring serialVersionUID ensures serialization compatibility and avoids cross-version recovery failures. Issue 3: Duplicate SQL Mode Validation LogicLocations:
Related Context:
Problem Description: Potential Risks:
Impact Scope:
Severity: MINOR Improvement Suggestion: // Option 1: Validate only once in ElasticsearchSource (recommended)
// Keep validation in ElasticsearchSource.java
// Remove validation in ElasticsearchSourceSplitEnumerator.java, directly use elasticsearchConfig.getSliceMax()
// Option 2: Extract as static utility method
public static int validateSliceMaxForSearchType(SearchTypeEnum searchType, int sliceMax) {
if (SearchTypeEnum.SQL.equals(searchType) && sliceMax > 1) {
log.warn("SQL search_type does not support slicing. slice_max will be ignored.");
return 1;
}
return Math.max(1, sliceMax);
}Rationale: Eliminate code redundancy and improve maintainability. Issue 4: E2E Test Data Insufficient to Verify Data CorrectnessLocations:
Related Context:
Problem Description:
Potential Risks:
Impact Scope:
Severity: MAJOR Improvement Suggestion: @TestTemplate
public void testElasticsearchWithPITSlice(TestContainer container)
throws IOException, InterruptedException {
Container.ExecResult execResult =
container.executeJob("/elasticsearch/elasticsearch_source_with_pit_slice.conf");
Assertions.assertEquals(0, execResult.getExitCode());
List<String> sinkData = readSinkDataWithSchema("st_index_pit_slice");
// 1. Verify data uniqueness (newly added)
Set<String> uniqueData = new HashSet<>(sinkData);
Assertions.assertEquals(sinkData.size(), uniqueData.size(),
"Data should not have duplicates");
// 2. Verify data integrity (existing)
Assertions.assertIterableEquals(mapTestDatasetForDSL(), sinkData);
// 3. Verify data volume (newly added)
int expectedCount = (int) mapTestDatasetForDSL().stream().count();
Assertions.assertEquals(expectedCount, sinkData.size(),
"Data count should match expected");
}
// Increase test data volume (in generateTestDataSet1)
for (int i = 0; i < 1000; i++) { // Increase from 100 to 1000
// ...
}Rationale: Ensure the correctness of slice functionality and avoid data duplication or loss issues in production environments. Issue 5: Missing Fallback Handling for PIT Creation FailureLocation: Related Context: sharedPitId = sharedPitIds.computeIfAbsent(
indexName,
key -> esRestClient.createPointInTime(
key, elasticsearchConfig.getPitKeepAlive()));Problem Description:
Potential Risks:
Impact Scope:
Severity: MINOR Improvement Suggestion: if (useSharedPit) {
try {
sharedPitId = sharedPitIds.computeIfAbsent(
indexName,
key -> {
try {
return esRestClient.createPointInTime(
key, elasticsearchConfig.getPitKeepAlive());
} catch (Exception e) {
log.warn("Failed to create shared PIT for index: {}, fallback to sliceMax=1. Error: {}",
key, e.getMessage());
return null;
}
});
// If PIT creation fails, fall back to not using slices
if (sharedPitId == null) {
sliceMax = 1;
useSharedPit = false;
}
} catch (Exception e) {
log.warn("Exception during PIT creation for index: {}, fallback to sliceMax=1",
indexName, e);
sliceMax = 1;
useSharedPit = false;
}
}Rationale: Improve system fault tolerance and availability. |
This change adds configurable slicing to the Elasticsearch source (slice_max), enabling parallel reads for Scroll and PIT while keeping SQL mode unchanged. It propagates slice metadata through splits, injects slice parameters into Scroll/PIT requests, and logs active slice info at runtime. E2E coverage is extended with PIT/Scroll slicing scenarios, and docs are updated to describe the new option and examples.
Purpose of this pull request
Does this PR introduce any user-facing change?
How was this patch tested?
Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.