[Fix][Core] Fix Memory leak by zhangshenghang · Pull Request #10418 · apache/seatunnel

zhangshenghang · 2026-01-29T11:55:01Z

Purpose of this pull request

In extreme cases, it can cause resources in IMAP not to be released. For example, network communication is unstable

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

… structures

…nd testing

DanielCarter-stack · 2026-01-29T13:23:36Z

Issue 1: Missing Maximum Retry Limit

Location:

CoordinatorService.java:496-562 (processPendingPipelineCleanup method)
PipelineCleanupRecord.java:54 (attemptCount field)

Related Context:

Definition: PipelineCleanupRecord.java:54
Update: CoordinatorService.java:514 - updated.setAttemptCount(record.getAttemptCount() + 1);
Usage: None (recorded but not used)

Problem Description:
The attemptCount field records the number of cleanup attempts, but it is not used in the code to limit the number of retries. This means that if a Pipeline continuously fails to be cleaned up (for example, if a Worker node is permanently offline), it will remain in the cleanup queue indefinitely, attempting once every 60 seconds, and will never be removed.

Potential Risks:

Risk1: If a large number of Pipelines cannot be cleaned up (for example, large-scale Worker failure), pendingPipelineCleanupIMap will continue to grow, occupying Hazelcast memory
Risk2: Each cleanup attempt generates logs and Hazelcast operations, resulting in resource waste
Risk3: Operations personnel cannot distinguish between "temporary failures" and "permanent failures"

Impact Scope:

Direct Impact: CoordinatorService.cleanupPendingPipelines()
Indirect Impact: Hazelcast IMap memory usage
Affected Area: Core framework, all users using SeaTunnel

Severity: MAJOR

Improvement Suggestions:

// CoordinatorService.java
private static final int MAX_CLEANUP_ATTEMPTS = 10; // Maximum 10 attempts (10 minutes)

private void processPendingPipelineCleanup(
        PipelineLocation pipelineLocation, PipelineCleanupRecord record) {
    // ... existing logic ...
    
    // Add retry count check at the end of the method
    if (updated.getAttemptCount() >= MAX_CLEANUP_ATTEMPTS && !updated.isCleaned()) {
        logger.severe(
            String.format(
                "Pipeline %s cleanup failed after %d attempts. Giving up. " +
                "Metrics cleaned: %s, TaskGroups cleaned: %s/%s",
                pipelineLocation,
                updated.getAttemptCount(),
                updated.isMetricsImapCleaned(),
                updated.getCleanedTaskGroups().size(),
                updated.getTaskGroups() != null ? updated.getTaskGroups().size() : 0));
        
        // Delete record even if giving up, to avoid infinite accumulation
        pendingPipelineCleanupIMap.remove(pipelineLocation, record);
        return;
    }
    
    // ... existing logic ...
}

Rationale:

Prevent infinite retries
Release Hazelcast memory
Alert operations personnel that manual intervention is needed
10-minute retry window is sufficient to handle temporary network fluctuations

Issue 2: Missing Record Expiration Time (TTL)

Location:

PipelineCleanupRecord.java:52-53 (createTimeMillis, lastAttemptTimeMillis)
Constant.java:62 (IMAP_PENDING_PIPELINE_CLEANUP definition)

Related Context:

Creation: JobMaster.java:929 - now (createTimeMillis)
Update: CoordinatorService.java:513 - updated.setLastAttemptTimeMillis(now)

Problem Description:
The cleanup records do not have TTL (Time To Live) set. Although records are deleted after successful cleanup, in certain abnormal situations (such as code bugs, Hazelcast failures), records may remain in the IMap forever. In addition, even if records have been in the queue for a long time (for example, several days), there is no forced expiration mechanism.

Potential Risks:

Risk1: Orphaned records occupy Hazelcast memory
Risk2: If Hazelcast IMap has no TTL configured, records will exist permanently
Risk3: Long-running clusters may accumulate a large number of expired records

Impact Scope:

Direct Impact: IMAP_PENDING_PIPELINE_CLEANUP IMap
Indirect Impact: Hazelcast cluster memory usage
Affected Area: Core framework, production environment

Severity: MAJOR

Improvement Suggestions:

// Option 1: Set TTL in IMap configuration (recommended)
// Constant.java
public static final String IMAP_PENDING_PIPELINE_CLEANUP = "engine_pendingPipelineCleanup";

// In Hazelcast configuration (requires documentation or setting in initialization code):
config.getMapConfig("engine_pendingPipelineCleanup")
    .setMaxIdleSeconds(86400); // Expires after 24 hours of inactivity

// Option 2: Check expiration time in code
private static final long CLEANUP_RECORD_TTL_MILLIS = TimeUnit.HOURS.toMillis(24); // 24 hours

private void processPendingPipelineCleanup(
        PipelineLocation pipelineLocation, PipelineCleanupRecord record) {
    long now = System.currentTimeMillis();
    
    // Check if the record has expired
    if (now - record.getCreateTimeMillis() > CLEANUP_RECORD_TTL_MILLIS) {
        logger.warning(
            String.format(
                "Pipeline cleanup record %s expired after %d ms. Removing.",
                pipelineLocation,
                now - record.getCreateTimeMillis()));
        pendingPipelineCleanupIMap.remove(pipelineLocation, record);
        return;
    }
    
    // ... existing logic ...
}

Rationale:

Prevent unlimited accumulation of records
Automatic cleanup even in abnormal situations
24-hour TTL provides sufficient retry window

Issue 3: PipelineCleanupRecord's isCleaned() Method Has NPE Risk

Location: PipelineCleanupRecord.java:132-137

public boolean isCleaned() {
    return metricsImapCleaned
            && taskGroups != null
            && cleanedTaskGroups != null
            && cleanedTaskGroups.containsAll(taskGroups.keySet());
}

Related Context:

Caller: CoordinatorService.java:559 - if (updated.isCleaned())
Data Source: Deserialization (readData) or newly created

Problem Description:
Although taskGroups and cleanedTaskGroups are initialized to empty collections during construction, during deserialization (readData() method), if size is -1, they will be set to null:

// PipelineCleanupRecord.java:104-114
int taskGroupsSize = in.readInt();
if (taskGroupsSize >= 0) {
    taskGroups = new HashMap<>(taskGroupsSize);
    // ...
} else {
    taskGroups = null; // May be null
}

Although the current code only writes -1 when the collection is null in writeData(), this is a potential NPE risk point, especially during cross-version serialization/deserialization.

Potential Risks:

Risk1: If serialization logic is modified in the future, NPE may be introduced
Risk2: Errors may occur when interacting with records from old versions
Risk3: taskGroups.keySet() will throw NPE when taskGroups is null

Impact Scope:

Direct Impact: PipelineCleanupRecord.isCleaned()
Caller: CoordinatorService.processPendingPipelineCleanup()
Affected Area: Core cleanup logic

Severity: MINOR (current code will not trigger, but defensive programming suggests fixing)

Improvement Suggestions:

public boolean isCleaned() {
    if (!metricsImapCleaned) {
        return false;
    }
    if (taskGroups == null || taskGroups.isEmpty()) {
        // If there are no taskGroups, only check if metrics are cleaned up
        return metricsImapCleaned;
    }
    if (cleanedTaskGroups == null) {
        return false;
    }
    return cleanedTaskGroups.containsAll(taskGroups.keySet());
}

Or a more concise version:

public boolean isCleaned() {
    return metricsImapCleaned
            && (taskGroups == null || taskGroups.isEmpty()
                || (cleanedTaskGroups != null 
                    && cleanedTaskGroups.containsAll(taskGroups.keySet())));
}

Rationale:

Defensive programming, avoid NPE
Explicitly handle empty collection edge cases
Improve code robustness

Issue 4: shouldCleanup Logic Duplication

Location:

JobMaster.java:902-904
CoordinatorService.java:589-595

Problem Description:
The same cleanup condition judgment appears repeatedly in two classes:

// JobMaster.java:902-904
boolean shouldCleanup =
        PipelineStatus.CANCELED.equals(pipelineStatus)
                || (PipelineStatus.FINISHED.equals(pipelineStatus) && !savepointEnd);

// CoordinatorService.java:589-595
private boolean shouldCleanup(PipelineCleanupRecord record) {
    if (record == null || record.getFinalStatus() == null) {
        return false;
    }
    if (record.isSavepointEnd()) {
        return false;
    }
    return PipelineStatus.CANCELED.equals(record.getFinalStatus())
            || PipelineStatus.FINISHED.equals(record.getFinalStatus());
}

This violates the DRY (Don't Repeat Yourself) principle. If cleanup conditions need to be adjusted in the future, both places must be modified simultaneously.

Potential Risks:

Risk1: Future modifications may miss one location, causing inconsistency
Risk2: Increased code maintenance cost

Impact Scope:

Direct Impact: JobMaster.enqueuePipelineCleanupIfNeeded() and CoordinatorService.shouldCleanup()
Affected Area: Code maintainability

Severity: MINOR

Improvement Suggestions:

// Add static utility method in PipelineCleanupRecord
public static boolean shouldCleanup(
        PipelineStatus finalStatus, 
        boolean isSavepointEnd) {
    if (finalStatus == null) {
        return false;
    }
    if (isSavepointEnd) {
        return false;
    }
    return PipelineStatus.CANCELED.equals(finalStatus)
            || PipelineStatus.FINISHED.equals(finalStatus);
}

// Add instance method in PipelineCleanupRecord
public boolean shouldCleanup() {
    return shouldCleanup(this.finalStatus, this.savepointEnd);
}

// JobMaster.java usage
boolean shouldCleanup = PipelineCleanupRecord.shouldCleanup(
        pipelineStatus, 
        savepointEnd);

// CoordinatorService.java usage
private boolean shouldCleanup(PipelineCleanupRecord record) {
    return record != null && record.shouldCleanup();
}

Rationale:

Eliminate code duplication
Improve maintainability
Concentrate business logic in the data model

Issue 5: Cleanup Interval Hardcoded

Location: CoordinatorService.java:113

private static final int PIPELINE_CLEANUP_INTERVAL_SECONDS = 60;

Problem Description:
The cleanup interval is hardcoded to 60 seconds and cannot be adjusted according to actual scenarios. For production environments, 60 seconds may be too long (resource release delay) or too short (frequent cleanup tasks).

Potential Risks:

Risk1: In resource-sensitive scenarios, 60-second delay may cause memory pressure
Risk2: Unable to dynamically adjust based on cluster scale

Impact Scope:

Direct Impact: CoordinatorService.pipelineCleanupScheduler scheduling frequency
Affected Area: Production environment tuning

Severity: MINOR

Improvement Suggestions:

// Option 1: Read from configuration file
private final int pipelineCleanupIntervalSeconds;

public CoordinatorService(
        @NonNull NodeEngineImpl nodeEngine,
        @NonNull SeaTunnelServer seaTunnelServer,
        EngineConfig engineConfig) {
    // ...
    this.pipelineCleanupIntervalSeconds = 
        engineConfig.getCoordinatorServiceConfig()
            .getPipelineCleanupIntervalSeconds() != null
            ? engineConfig.getCoordinatorServiceConfig().getPipelineCleanupIntervalSeconds()
            : 60; // Default 60 seconds
    
    pipelineCleanupScheduler.scheduleAtFixedRate(
            this::cleanupPendingPipelines,
            this.pipelineCleanupIntervalSeconds,
            this.pipelineCleanupIntervalSeconds,
            TimeUnit.SECONDS);
}

// Option 2: Use dynamic interval (exponential backoff)
// Calculate next cleanup time in PipelineCleanupRecord based on attemptCount

Rationale:

Improve flexibility
Support tuning for different scenarios
Leave room for future exponential backoff strategy

Issue 6: FAILED Status Pipelines Not Cleaned Up

Location:

JobMaster.java:902-904 (shouldCleanup logic)
SubPlan.java:242-288 (getPipelineEndState method)

Problem Description:
The logic of shouldCleanup only handles CANCELED and FINISHED statuses, and does not handle FAILED status:

boolean shouldCleanup =
        PipelineStatus.CANCELED.equals(pipelineStatus)
                || (PipelineStatus.FINISHED.equals(pipelineStatus) && !savepointEnd);

But from the definition of PipelineStatus.isEndState(), FAILED is also an end state:

// PipelineStatus.java:76-78
public boolean isEndState() {
    return this == FINISHED || this == CANCELED || this == FAILED;
}

This means FAILED Pipelines will not be added to the cleanup queue, and their resources (metrics and TaskGroupContext) may not be cleaned up.

Related Context:
Looking at the getPipelineEndState() method of SubPlan.java, Pipelines may enter FAILED status in the following situations:

Task execution failure (failedTaskNum > 0)
Checkpoint failure
Checkpoint failure during Cancel process

Potential Risks:

Risk1: Metrics of FAILED status Pipelines will not be cleaned up (metricsImap leak)
Risk2: TaskGroupContext of FAILED status Pipelines will not be cleaned up (Worker memory leak)
Risk3: Frequently failing tasks will accelerate memory leaks

Impact Scope:

Direct Impact: Resource release for all failed tasks
Indirect Impact: Hazelcast IMap and Worker node memory
Affected Area: All users using SeaTunnel, especially frequent failure scenarios

Severity: CRITICAL (this is a serious omission)

Improvement Suggestions:

// JobMaster.java:902-904
boolean shouldCleanup =
        PipelineStatus.CANCELED.equals(pipelineStatus)
                || PipelineStatus.FAILED.equals(pipelineStatus)
                || (PipelineStatus.FINISHED.equals(pipelineStatus) && !savepointEnd);

// Or use isEndState() but exclude savepoint
boolean shouldCleanup = pipelineStatus.isEndState() && !savepointEnd;

// CoordinatorService.java:589-595
private boolean shouldCleanup(PipelineCleanupRecord record) {
    if (record == null || record.getFinalStatus() == null) {
        return false;
    }
    if (record.isSavepointEnd()) {
        return false;
    }
    // Modify to support all end states
    return record.getFinalStatus().isEndState();
}

Rationale:

FAILED status Pipelines also need to release resources
This may be a serious memory leak point
After fixing, test cases need to be added to verify cleanup of FAILED status

Issue 7: Missing JavaDoc Documentation

Location: PipelineCleanupRecord.java:39-42

@Data
@NoArgsConstructor
@AllArgsConstructor
public class PipelineCleanupRecord implements IdentifiedDataSerializable {
    // No class-level JavaDoc
}

Problem Description:
The newly added PipelineCleanupRecord class lacks JavaDoc documentation, which is detrimental to other developers understanding its purpose and design intent.

Impact Scope:

Direct Impact: Code readability and maintainability
Affected Area: Future developers maintaining this code

Severity: MINOR

Improvement Suggestions:

/**
 * Record tracking the cleanup state of a finished pipeline.
 * 
 * <p>This record is persisted in Hazelcast IMap (IMAP_PENDING_PIPELINE_CLEANUP) 
 * and used by the background cleanup task in {@link CoordinatorService} 
 * to asynchronously release resources when the synchronous cleanup fails.
 * 
 * <p><b>Resources tracked:</b>
 * <ul>
 *   <li>{@link #metricsImapCleaned} - Metrics in {@code IMAP_RUNNING_JOB_METRICS}</li>
 *   <li>{@link #taskGroups} - TaskGroups with their worker addresses</li>
 *   <li>{@link #cleanedTaskGroups} - TaskGroups whose contexts have been cleaned</li>
 * </ul>
 * 
 * <p><b>Cleanup conditions:</b>
 * <ul>
 *   <li>CANCELED pipelines are always cleaned</li>
 *   <li>FINISHED pipelines are cleaned unless they ended with savepoint</li>
 *   <li>FAILED pipelines are cleaned (note: original code may have missed this)</li>
 * </ul>
 * 
 * <p><b>Lifecycle:</b>
 * <ol>
 *   <li>Created by {@link JobMaster#enqueuePipelineCleanupIfNeeded}</li>
 *   <li>Updated by {@link CoordinatorService#processPendingPipelineCleanup}</li>
 *   <li>Removed when {@link #isCleaned()} returns true</li>
 * </ol>
 * 
 * @see PipelineLocation
 * @see PipelineStatus
 * @see org.apache.seatunnel.engine.server.task.operation.CleanTaskGroupContextOperation
 */
@Data
@NoArgsConstructor
@AllArgsConstructor
public class PipelineCleanupRecord implements IdentifiedDataSerializable {
    // ...
}

Rationale:

Apache top-level projects require good documentation
Help other developers quickly understand design intent
Record key architectural decisions

davidzollo · 2026-02-03T11:05:18Z

davidzollo · 2026-02-04T15:38:40Z

...atunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/master/JobMaster.java

+                        && checkpointManager != null
+                        && checkpointManager.isPipelineSavePointEnd(pipelineLocation);
+        boolean shouldCleanup =
+                PipelineStatus.CANCELED.equals(pipelineStatus)


Should we add cleanup for failed pipelines here?

davidzollo · 2026-02-04T15:44:06Z

Good job
Here is a review from claudecode. You can check first.

Issue: Missing Retry Limit - Could Cause IMAP Memory Leak

Severity: High
File: CoordinatorService.java
Location: CoordinatorService.java:497

Problem Description:

The code tracks attemptCount but does not enforce a maximum retry limit
If a worker node permanently fails, TaskGroupContext cleanup will never succeed
Cleanup records will accumulate indefinitely in the IMAP, causing the IMAP itself to leak memory

Impact:

In a large-scale cluster with frequent job failures, thousands of cleanup records could accumulate
Each PipelineCleanupRecord occupies ~500 bytes
10,000 records = ~5MB memory, multiplied by Hazelcast replication factor

Risk Scenario:

Day 1: 100 failed cleanup records
Day 7: 700 failed cleanup records
Day 30: 3,000 failed cleanup records
→ IMAP grows indefinitely, defeating the purpose of this PR

Recommended Fix:

// Add this constant to CoordinatorService
private static final int MAX_CLEANUP_ATTEMPTS = 100; // ~100 hours of retries at 60s intervals

private void processPendingPipelineCleanup(
        PipelineLocation pipelineLocation, PipelineCleanupRecord record) {

    // Add this check at the beginning
    if (record.getAttemptCount() > MAX_CLEANUP_ATTEMPTS) {
        logger.warning(String.format(
            "Pipeline %s cleanup exceeded max attempts (%d), giving up and removing record. " +
            "Metrics cleaned: %s, TaskGroups cleaned: %d/%d",
            pipelineLocation, MAX_CLEANUP_ATTEMPTS,
            record.isMetricsImapCleaned(),
            record.getCleanedTaskGroups() != null ? record.getCleanedTaskGroups().size() : 0,
            record.getTaskGroups() != null ? record.getTaskGroups().size() : 0));
        removePendingCleanupRecord(pipelineLocation, record);
        return;
    }

    // Continue with existing cleanup logic...
}

Issue : Concurrent IMAP Traversal - Risk of ConcurrentModificationException

Severity: High
File: CoordinatorService.java
Location: CoordinatorService.java:477-480

Problem Description:

cleanupPendingPipelines() directly iterates over pendingCleanupIMap.entrySet()
If another thread (e.g., JobMaster) modifies the IMAP during iteration, ConcurrentModificationException will be thrown
This causes the entire cleanup task to fail, requiring a 60-second wait until the next retry

Current Code:

for (Map.Entry<PipelineLocation, PipelineCleanupRecord> entry :
        pendingCleanupIMap.entrySet()) {  // ← Unsafe: direct iteration
    processPendingPipelineCleanup(entry.getKey(), entry.getValue());
}

Impact:

Cleanup task crashes silently
Resources remain uncleaned until the next scheduled run
In high-throughput clusters, this could happen frequently

Risk Scenario:

Thread 1 (Cleanup): Iterating pendingCleanupIMap.entrySet()
Thread 2 (JobMaster): Adds new cleanup record to IMAP
→ Thread 1 throws ConcurrentModificationException
→ Cleanup fails for all pending pipelines
→ Must wait 60 seconds for next attempt

Recommended Fix:

private void cleanupPendingPipelines() {
    if (!isActive) {
        return;
    }

    IMap<PipelineLocation, PipelineCleanupRecord> pendingCleanupIMap =
        this.pendingPipelineCleanupIMap;
    if (pendingCleanupIMap == null || pendingCleanupIMap.isEmpty()) {
        return;
    }

    // Copy the key set first to avoid concurrent modification
    Set<PipelineLocation> keys;
    try {
        keys = new HashSet<>(pendingCleanupIMap.keySet());
    } catch (Exception e) {
        logger.warning("Failed to get pending cleanup keys: " + e.getMessage());
        return;
    }

    // Now iterate over the copied key set
    for (PipelineLocation key : keys) {
        try {
            PipelineCleanupRecord record = pendingCleanupIMap.get(key);
            if (record != null) {
                processPendingPipelineCleanup(key, record);
            }
        } catch (HazelcastInstanceNotActiveException e) {
            logger.warning("Skip cleanup: hazelcast not active");
            break;
        } catch (Throwable t) {
            logger.warning(String.format(
                "Failed to cleanup pipeline %s: %s", key, t.getMessage()), t);
            // Continue with next pipeline instead of crashing
        }
    }
}

Issue : Infinite Loop - Risk of Thread Starvation and CPU Spike

Severity: High
File: JobMaster.java
Location: JobMaster.java:934-951

Problem Description:

enqueuePipelineCleanupIfNeeded() uses while(true) for CAS operations without an exit condition
In extreme high-concurrency scenarios, multiple threads competing for the same pipeline could cause indefinite spinning
This consumes CPU resources and may block the calling thread indefinitely

Current Code:

while (true) {  // ← No exit condition!
    PipelineCleanupRecord existing = pendingCleanupIMap.get(pipelineLocation);
    if (existing == null) {
        PipelineCleanupRecord prev = pendingCleanupIMap.putIfAbsent(pipelineLocation, newRecord);
        if (prev == null) return;
        existing = prev;
    }
    PipelineCleanupRecord merged = existing.mergeFrom(newRecord);
    if (merged.equals(existing)) return;
    if (pendingCleanupIMap.replace(pipelineLocation, existing, merged)) {
        return;
    }
    // ← If replace fails, loop continues indefinitely
}

Impact:

Thread starvation: calling thread may spin indefinitely
CPU spike: 100% CPU usage on one core
System instability: may affect job submission and scheduling

Risk Scenario:

10 threads simultaneously try to enqueue cleanup for Pipeline X
Each thread:
  1. Reads current record
  2. Merges with new data
  3. Attempts CAS replace
  4. Fails due to another thread's update
  5. Retries indefinitely

Result: All 10 threads spinning, consuming 1000% CPU

Recommended Fix:

// Add this constant to JobMaster
private static final int MAX_ENQUEUE_RETRIES = 100;

public void enqueuePipelineCleanupIfNeeded(
        PipelineLocation pipelineLocation, PipelineStatus pipelineStatus) {

    // ... existing validation logic ...

    int retryCount = 0;
    while (retryCount < MAX_ENQUEUE_RETRIES) {
        PipelineCleanupRecord existing = pendingCleanupIMap.get(pipelineLocation);

        if (existing == null) {
            PipelineCleanupRecord prev = pendingCleanupIMap.putIfAbsent(pipelineLocation, newRecord);
            if (prev == null) {
                return; // Success
            }
            existing = prev;
        }

        PipelineCleanupRecord merged = existing.mergeFrom(newRecord);
        if (merged.equals(existing)) {
            return; // No changes needed
        }

        if (pendingCleanupIMap.replace(pipelineLocation, existing, merged)) {
            return; // Success
        }

        retryCount++;

        // Optional: Add backoff to reduce contention
        if (retryCount % 10 == 0) {
            try {
                Thread.sleep(1); // 1ms backoff every 10 retries
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                logger.warning("Enqueue cleanup interrupted for pipeline: " + pipelineLocation);
                return;
            }
        }
    }

    // Failed after max retries - log error but don't throw exception
    logger.error(String.format(
        "Failed to enqueue pipeline cleanup for %s after %d retries due to high contention. " +
        "Cleanup may be delayed until next scheduled run.",
        pipelineLocation, MAX_ENQUEUE_RETRIES));
}

zhangshenghang · 2026-02-09T02:00:17Z

Good job Here is a review from claudecode. You can check first.

Issue: Missing Retry Limit - Could Cause IMAP Memory Leak

Severity: High File: CoordinatorService.java Location: CoordinatorService.java:497

Problem Description:

The code tracks attemptCount but does not enforce a maximum retry limit
If a worker node permanently fails, TaskGroupContext cleanup will never succeed
Cleanup records will accumulate indefinitely in the IMAP, causing the IMAP itself to leak memory

Impact:

In a large-scale cluster with frequent job failures, thousands of cleanup records could accumulate
Each PipelineCleanupRecord occupies ~500 bytes
10,000 records = ~5MB memory, multiplied by Hazelcast replication factor

Risk Scenario:

Day 1: 100 failed cleanup records
Day 7: 700 failed cleanup records
Day 30: 3,000 failed cleanup records
→ IMAP grows indefinitely, defeating the purpose of this PR

Recommended Fix:

// Add this constant to CoordinatorService
private static final int MAX_CLEANUP_ATTEMPTS = 100; // ~100 hours of retries at 60s intervals

private void processPendingPipelineCleanup(
        PipelineLocation pipelineLocation, PipelineCleanupRecord record) {

    // Add this check at the beginning
    if (record.getAttemptCount() > MAX_CLEANUP_ATTEMPTS) {
        logger.warning(String.format(
            "Pipeline %s cleanup exceeded max attempts (%d), giving up and removing record. " +
            "Metrics cleaned: %s, TaskGroups cleaned: %d/%d",
            pipelineLocation, MAX_CLEANUP_ATTEMPTS,
            record.isMetricsImapCleaned(),
            record.getCleanedTaskGroups() != null ? record.getCleanedTaskGroups().size() : 0,
            record.getTaskGroups() != null ? record.getTaskGroups().size() : 0));
        removePendingCleanupRecord(pipelineLocation, record);
        return;
    }

    // Continue with existing cleanup logic...
}

Issue : Concurrent IMAP Traversal - Risk of ConcurrentModificationException

Severity: High File: CoordinatorService.java Location: CoordinatorService.java:477-480

Problem Description:

cleanupPendingPipelines() directly iterates over pendingCleanupIMap.entrySet()
If another thread (e.g., JobMaster) modifies the IMAP during iteration, ConcurrentModificationException will be thrown
This causes the entire cleanup task to fail, requiring a 60-second wait until the next retry

Current Code:

for (Map.Entry<PipelineLocation, PipelineCleanupRecord> entry :
        pendingCleanupIMap.entrySet()) {  // ← Unsafe: direct iteration
    processPendingPipelineCleanup(entry.getKey(), entry.getValue());
}

Impact:

Cleanup task crashes silently
Resources remain uncleaned until the next scheduled run
In high-throughput clusters, this could happen frequently

Risk Scenario:

Thread 1 (Cleanup): Iterating pendingCleanupIMap.entrySet()
Thread 2 (JobMaster): Adds new cleanup record to IMAP
→ Thread 1 throws ConcurrentModificationException
→ Cleanup fails for all pending pipelines
→ Must wait 60 seconds for next attempt

Recommended Fix:

private void cleanupPendingPipelines() {
    if (!isActive) {
        return;
    }

    IMap<PipelineLocation, PipelineCleanupRecord> pendingCleanupIMap =
        this.pendingPipelineCleanupIMap;
    if (pendingCleanupIMap == null || pendingCleanupIMap.isEmpty()) {
        return;
    }

    // Copy the key set first to avoid concurrent modification
    Set<PipelineLocation> keys;
    try {
        keys = new HashSet<>(pendingCleanupIMap.keySet());
    } catch (Exception e) {
        logger.warning("Failed to get pending cleanup keys: " + e.getMessage());
        return;
    }

    // Now iterate over the copied key set
    for (PipelineLocation key : keys) {
        try {
            PipelineCleanupRecord record = pendingCleanupIMap.get(key);
            if (record != null) {
                processPendingPipelineCleanup(key, record);
            }
        } catch (HazelcastInstanceNotActiveException e) {
            logger.warning("Skip cleanup: hazelcast not active");
            break;
        } catch (Throwable t) {
            logger.warning(String.format(
                "Failed to cleanup pipeline %s: %s", key, t.getMessage()), t);
            // Continue with next pipeline instead of crashing
        }
    }
}

Issue : Infinite Loop - Risk of Thread Starvation and CPU Spike

Severity: High File: JobMaster.java Location: JobMaster.java:934-951

Problem Description:

enqueuePipelineCleanupIfNeeded() uses while(true) for CAS operations without an exit condition
In extreme high-concurrency scenarios, multiple threads competing for the same pipeline could cause indefinite spinning
This consumes CPU resources and may block the calling thread indefinitely

Current Code:

while (true) {  // ← No exit condition!
    PipelineCleanupRecord existing = pendingCleanupIMap.get(pipelineLocation);
    if (existing == null) {
        PipelineCleanupRecord prev = pendingCleanupIMap.putIfAbsent(pipelineLocation, newRecord);
        if (prev == null) return;
        existing = prev;
    }
    PipelineCleanupRecord merged = existing.mergeFrom(newRecord);
    if (merged.equals(existing)) return;
    if (pendingCleanupIMap.replace(pipelineLocation, existing, merged)) {
        return;
    }
    // ← If replace fails, loop continues indefinitely
}

Impact:

Thread starvation: calling thread may spin indefinitely
CPU spike: 100% CPU usage on one core
System instability: may affect job submission and scheduling

Risk Scenario:

10 threads simultaneously try to enqueue cleanup for Pipeline X
Each thread:
  1. Reads current record
  2. Merges with new data
  3. Attempts CAS replace
  4. Fails due to another thread's update
  5. Retries indefinitely

Result: All 10 threads spinning, consuming 1000% CPU

Recommended Fix:

// Add this constant to JobMaster
private static final int MAX_ENQUEUE_RETRIES = 100;

public void enqueuePipelineCleanupIfNeeded(
        PipelineLocation pipelineLocation, PipelineStatus pipelineStatus) {

    // ... existing validation logic ...

    int retryCount = 0;
    while (retryCount < MAX_ENQUEUE_RETRIES) {
        PipelineCleanupRecord existing = pendingCleanupIMap.get(pipelineLocation);

        if (existing == null) {
            PipelineCleanupRecord prev = pendingCleanupIMap.putIfAbsent(pipelineLocation, newRecord);
            if (prev == null) {
                return; // Success
            }
            existing = prev;
        }

        PipelineCleanupRecord merged = existing.mergeFrom(newRecord);
        if (merged.equals(existing)) {
            return; // No changes needed
        }

        if (pendingCleanupIMap.replace(pipelineLocation, existing, merged)) {
            return; // Success
        }

        retryCount++;

        // Optional: Add backoff to reduce contention
        if (retryCount % 10 == 0) {
            try {
                Thread.sleep(1); // 1ms backoff every 10 retries
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                logger.warning("Enqueue cleanup interrupted for pipeline: " + pipelineLocation);
                return;
            }
        }
    }

    // Failed after max retries - log error but don't throw exception
    logger.error(String.format(
        "Failed to enqueue pipeline cleanup for %s after %d retries due to high contention. " +
        "Cleanup may be delayed until next scheduled run.",
        pipelineLocation, MAX_ENQUEUE_RETRIES));
}

Issue 1： The maximum number of retries must not be set, as it can easily lead to data loss and program exceptions. Deletion must be performed through normal logic.

Issue 2：It will not throw a ConcurrentModificationException. The result returned by IMap.entrySet is a cloned one, not distributed data.

Issue3：What is actually triggered is at the end of SubPlan. It is not a frequent call. For more stability, I will add a sleep instead of setting a maximum number of times.

zhangshenghang added 2 commits January 26, 2026 16:13

[Feat][Cleanup] Implement pipeline cleanup mechanism and related data…

53d971b

… structures

[Feat][Cleanup] Enhance pipeline cleanup with serialization support a…

0b90895

…nd testing

github-actions bot added the Zeta label Jan 29, 2026

Merge branch 'dev' into fix-oom-26

0eaa47c

davidzollo reviewed Feb 4, 2026

View reviewed changes

zhangshenghang added 2 commits February 9, 2026 10:02

[Fix] Add delay to cleanup process to prevent race conditions

78f4b15

[Cleanup] Remove unnecessary whitespace in Constant.java

d6eb1e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Core] Fix Memory leak#10418

[Fix][Core] Fix Memory leak#10418
zhangshenghang wants to merge 5 commits intoapache:devfrom
zhangshenghang:fix-oom-26

zhangshenghang commented Jan 29, 2026

Uh oh!

DanielCarter-stack commented Jan 29, 2026

Uh oh!

davidzollo commented Feb 3, 2026

Uh oh!

davidzollo Feb 4, 2026

Uh oh!

davidzollo commented Feb 4, 2026

Uh oh!

zhangshenghang commented Feb 9, 2026

Issue: Missing Retry Limit - Could Cause IMAP Memory Leak

Issue : Concurrent IMAP Traversal - Risk of ConcurrentModificationException

Issue : Infinite Loop - Risk of Thread Starvation and CPU Spike

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhangshenghang commented Jan 29, 2026

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

DanielCarter-stack commented Jan 29, 2026

Issue 1: Missing Maximum Retry Limit

Issue 2: Missing Record Expiration Time (TTL)

Issue 3: PipelineCleanupRecord's isCleaned() Method Has NPE Risk

Issue 4: shouldCleanup Logic Duplication

Issue 5: Cleanup Interval Hardcoded

Issue 6: FAILED Status Pipelines Not Cleaned Up

Issue 7: Missing JavaDoc Documentation

Uh oh!

davidzollo commented Feb 3, 2026

Uh oh!

davidzollo Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

davidzollo commented Feb 4, 2026

Issue: Missing Retry Limit - Could Cause IMAP Memory Leak

Issue : Concurrent IMAP Traversal - Risk of ConcurrentModificationException

Issue : Infinite Loop - Risk of Thread Starvation and CPU Spike

Uh oh!

zhangshenghang commented Feb 9, 2026

Issue: Missing Retry Limit - Could Cause IMAP Memory Leak

Issue : Concurrent IMAP Traversal - Risk of ConcurrentModificationException

Issue : Infinite Loop - Risk of Thread Starvation and CPU Spike

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants