[Fix][Zeta] make the job failed when triggering checkpoint fails (apache#10442) by Sephiroth1024 · Pull Request #10448 · apache/seatunnel

Sephiroth1024 · 2026-02-04T13:07:55Z

Purpose of this pull request

Fix #10442

Does this PR introduce any user-facing change?

How was this patch tested?

Add unit test CheckpointBarrierTriggerErrorTest#testCheckpointBarrierTriggerError

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

dybyte · 2026-02-04T13:28:20Z

Please enable CI following the instructions.

DanielCarter-stack · 2026-02-04T13:29:45Z

Issue 1: InterruptedException Handling Does Not Follow Best Practices

Location: CheckpointCoordinator.java:685-690

} catch (InterruptedException e) {
    handleCoordinatorError(
            "triggering checkpoint barrier has been interrupted",
            e,
            CheckpointCloseReason.CHECKPOINT_INSIDE_ERROR);
    throw new RuntimeException(e);  // ⚠️ Problem here
}

Problem Description:
Java concurrent programming best practice is: after catching InterruptedException, you should restore the interrupt status (Thread.currentThread().interrupt()) instead of directly throwing RuntimeException. The current approach leads to:

Callers cannot distinguish between "true interruption" and "business exception"
Interrupt status is lost, upper-level code cannot properly respond to interruption
Violates Java concurrent programming best practices

Related Context:

Other places in the same class don't have InterruptedException handling for comparison
Hazelcast's InvocationFuture.get() will throw InterruptedException
Upper-level caller: startTriggerPendingCheckpoint() executes asynchronously in executorService

Potential Risks:

Risk 1: If the thread pool is shut down, lost interrupt status will cause graceful shutdown to fail
Risk 2: May lead to "phantom interruption" issues (exception is thrown but interrupt status is not restored)

Scope of Impact:

Direct impact: startTriggerPendingCheckpoint() method
Indirect impact: Thread pool management of the entire checkpoint coordinator
Impact area: Core framework (all jobs using checkpoint)

Severity: MAJOR

Improvement Suggestions:

} catch (InterruptedException e) {
    handleCoordinatorError(
            "triggering checkpoint barrier has been interrupted",
            e,
            CheckpointCloseReason.CHECKPOINT_INSIDE_ERROR);
    Thread.currentThread().interrupt();  // Restore interrupted state
    throw new RuntimeException(e);
}

Rationale:

Follows Java concurrent programming best practices (see "Java Concurrency in Practice")
Ensures interrupt status is not lost
Allows upper-level code to properly handle interruption
Consistent with TaskCallTimer.java:123-125 handling pattern

Issue 2: InterruptedException Branch May Cause Duplicate Error Handling

Location: CheckpointCoordinator.java:685-690

} catch (InterruptedException e) {
    handleCoordinatorError(...);  // First processing
    throw new RuntimeException(e);  // Throw exception
}

Problem Description:
In the InterruptedException branch, handleCoordinatorError() is called first (which sets status to FAILED and cleans up resources), then RuntimeException is thrown. This leads to:

Exception will be caught by whenCompleteAsync() of startTriggerPendingCheckpoint() (line 645)
May trigger a second handleCoordinatorError() (line 648)
Although there is a isDone() check to prevent duplicate processing, the code logic is unclear

Related Context:

// Lines 645-651
completableFuture.whenCompleteAsync(
    (completedCheckpoint, error) -> {
        if (error != null) {
            handleCoordinatorError(  // Possibly second call
                    "trigger checkpoint failed",
                    error,
                    CheckpointCloseReason.CHECKPOINT_INSIDE_ERROR);

Potential Risks:

Risk 1: Code logic is confusing, difficult for maintainers to understand
Risk 2: If isDone() check fails, it will cause duplicate resource cleanup
Risk 3: Error messages may be inconsistent ("interrupted" vs "trigger checkpoint failed")

Scope of Impact:

Direct impact: Exception handling chain of startTriggerPendingCheckpoint() method
Indirect impact: Correctness of Checkpoint state machine
Impact area: Core framework

Severity: MINOR

Improvement Suggestions:

} catch (InterruptedException e) {
    Thread.currentThread().interrupt();
    // Do not call handleCoordinatorError in catch block
    // Let exception propagate to whenCompleteAsync for unified handling
    throw e;
} catch (Exception e) {
    handleCoordinatorError(
            "triggering checkpoint barrier failed",
            e,
            CheckpointCloseReason.CHECKPOINT_INSIDE_ERROR);
    return;
}

Or:

} catch (InterruptedException e) {
    handleCoordinatorError(
            "triggering checkpoint barrier has been interrupted",
            e,
            CheckpointCloseReason.CHECKPOINT_INSIDE_ERROR);
    Thread.currentThread().interrupt();
    return;  // Do not throw exception
} catch (Exception e) {
    handleCoordinatorError(
            "triggering checkpoint barrier failed",
            e,
            CheckpointCloseReason.CHECKPOINT_INSIDE_ERROR);
    return;
}

Rationale:

Avoid duplicate error handling
Code logic is clearer
Ensure error message consistency
Reduce dependency on isDone() check

Issue 3: Direct Failure for Transient Network Faults May Be Too Aggressive

Location: CheckpointCoordinator.java:691-696

} catch (Exception e) {
    handleCoordinatorError(
            "triggering checkpoint barrier failed",
            e,
            CheckpointCloseReason.CHECKPOINT_INSIDE_ERROR);
    return;  // Fail immediately, no retry
}

Problem Description:
The current implementation causes the job to fail immediately upon any checkpoint barrier trigger failure. However, in distributed systems, network jitter or transient failures are common:

May be a temporary network partition that recovers in a few seconds
May be a brief timeout caused by GC
May be a Hazelcast transient failure

Direct failure leads to:

Users need to manually restart the job
May lose data being processed
Reduces system availability

Related Context:

Checkpoint configuration has checkpoint.timeout (default 60 seconds)
Checkpoint configuration has no retry-related configuration
Other checkpoint errors (such as timeout) also have no retry mechanism

Potential Risks:

Risk 1: Reduces availability in production environments
Risk 2: Users may mistakenly think this is system instability
Risk 3: Inconsistent with checkpoint.timeout semantics (timeout is waiting, here it's immediate failure)

Scope of Impact:

Direct impact: All streaming jobs with checkpoint enabled
Indirect impact: Operations complexity in production environments
Impact area: All users using Zeta engine

Severity: MINOR (but may be MAJOR if production environment network is unstable)

Improvement Suggestions:
This is a design trade-off issue with several possible approaches:

Option A: Add retry mechanism (recommended for long-term improvement)

// Add in CheckpointConfig
private int checkpointBarrierTriggerRetryTimes = 3;  // Retry 3 times by default

// Add retry logic in triggerCheckpoint
public InvocationFuture<?>[] triggerCheckpoint(CheckpointBarrier checkpointBarrier) {
    int retry = 0;
    while (retry < coordinatorConfig.getCheckpointBarrierTriggerRetryTimes()) {
        try {
            return plan.getStartingSubtasks().stream()
                .map(taskLocation -> new CheckpointBarrierTriggerOperation(checkpointBarrier, taskLocation))
                .map(checkpointManager::sendOperationToMemberNode)
                .toArray(InvocationFuture[]::new);
        } catch (Exception e) {
            retry++;
            if (retry >= coordinatorConfig.getCheckpointBarrierTriggerRetryTimes()) {
                throw e;
            }
            LOG.warn("Retry triggering checkpoint barrier, attempt {}", retry);
            Thread.sleep(1000 * retry);  // Exponential backoff
        }
    }
}

Option B: Configure whether to fail immediately

// Add in CheckpointConfig
private boolean failImmediatelyOnBarrierTriggerError = false;  // Do not fail immediately by default

// Modify catch block
} catch (Exception e) {
    if (coordinatorConfig.isFailImmediatelyOnBarrierTriggerError()) {
        handleCoordinatorError(...);
        return;
    } else {
        LOG.error("triggering checkpoint barrier failed, but job will continue", e);
        return;  // Only return, do not call handleCoordinatorError
    }
}

Option C: Keep current implementation (most conservative)

// No modifications, but document in notes:
// "If checkpoint barrier trigger fails, the job will fail immediately.
//  This is to ensure data consistency. Consider increasing checkpoint timeout
//  if you encounter transient network issues."

Rationale:

Improve system availability
Consistent with best practices of other distributed systems (e.g., Flink's checkpoint retry mechanism)
Give users more control
Balance data consistency and availability

Note: This is a larger improvement, recommended for a follow-up PR. The current PR can be merged first to fix the "silent failure" bug.

Issue 4: Error Messages Not Specific Enough, Hard to Diagnose Problems

Location: CheckpointCoordinator.java:686-687, 693-694

handleCoordinatorError(
        "triggering checkpoint barrier has been interrupted",  // ⚠️ Missing key information
        e,
        CheckpointCloseReason.CHECKPOINT_INSIDE_ERROR);

Problem Description:
Current error messages lack key diagnostic information, such as:

What is the Checkpoint ID?
Which Task node failed?
How many Tasks succeeded/failed?
What is the specific reason for failure?

This requires users and operations personnel to view full stack traces and logs when troubleshooting, reducing diagnosability.

Related Context:

pendingCheckpoint.getInfo() can get detailed checkpoint information
plan.getStartingSubtasks() can get all Task information
InvocationFuture can get execution result of each Task

Potential Risks:

Risk 1: Increase troubleshooting time
Risk 2: Users may not be able to quickly locate the problem
Risk 3: Monitoring alerts cannot provide useful information

Scope of Impact:

Direct impact: Quality of error logs
Indirect impact: Troubleshooting efficiency
Impact area: All users using checkpoint

Severity: MINOR

Improvement Suggestions:

} catch (InterruptedException e) {
    handleCoordinatorError(
            String.format(
                    "Triggering checkpoint barrier %s has been interrupted. Pending tasks: %d",
                    pendingCheckpoint.getInfo(),
                    plan.getStartingSubtasks().size()),
            e,
            CheckpointCloseReason.CHECKPOINT_INSIDE_ERROR);
    Thread.currentThread().interrupt();
    throw new RuntimeException(e);
} catch (Exception e) {
    handleCoordinatorError(
            String.format(
                    "Failed to trigger checkpoint barrier %s. Checkpoint type: %s, Pending tasks: %d",
                    pendingCheckpoint.getInfo(),
                    pendingCheckpoint.getCheckpointType(),
                    plan.getStartingSubtasks().size()),
            e,
            CheckpointCloseReason.CHECKPOINT_INSIDE_ERROR);
    return;
}

Rationale:

Provide more detailed context information
Convenient for quickly locating problems
Consistent with style of other error messages (see lines 649, 657)
Does not affect performance (string construction only on exception)

Issue 6: Test Code Uses Incorrect jobId, Test Cannot Properly Validate

Location: CheckpointBarrierTriggerErrorTest.java:33-34, 40

long jobId = System.currentTimeMillis();  // Line 33: Define variable
startJob(System.currentTimeMillis(), CONF_PATH);  // Line 34: Pass new timestamp!

// ...

Assertions.assertEquals(
        server.getCoordinatorService().getJobStatus(jobId),  // Line 40: Use jobId variable
        JobStatus.RUNNING);

Problem Description:
The test code defines jobId variable at line 33, but when calling startJob() at line 34, passes a new System.currentTimeMillis(). This leads to:

The actual submitted jobId is different from the jobId variable value
Lines 40 and 54 use jobId to query job status, querying a non-existent job
Test can never pass (will timeout)

Confidence: 100% (this is a clear bug)

Related Context:

startJob() method uses the passed jobid to submit the job
getJobStatus(jobId) queries status of the specified jobId
Other tests (such as CheckpointErrorRestoreEndTest.java:42) correctly use the same jobId

Potential Risks:

Risk 1: Test cannot pass, CI will fail
Risk 2: False test passing gives team a false sense of security
Risk 3: Waste of CI resources and developer time

Scope of Impact:

Direct impact: CheckpointBarrierTriggerErrorTest test
Indirect impact: CI/CD process
Impact area: Single test case

Severity: BLOCKER (test must be fixed before merging)

Improvement Suggestions:

@Test
public void testCheckpointBarrierTriggerError() throws NoSuchFieldException, IllegalAccessException {
    long jobId = System.currentTimeMillis();
    startJob(jobId, CONF_PATH);  // Fix: Use jobId variable instead of getting timestamp again

    await().atMost(120000, TimeUnit.MILLISECONDS)
            .untilAsserted(
                    () ->
                            Assertions.assertEquals(
                                    server.getCoordinatorService().getJobStatus(jobId),
                                    JobStatus.RUNNING));

    // ... rest of code unchanged
}

Rationale:

Fix obvious bug so test can run correctly
Consistent with practice of other tests
Ensures test can validate PR's fix effect

Issue 8: Checkpoint Configuration in Test Configuration File May Cause Test Instability

Location: stream_fake_to_console_checkpoint_barrier_trigger_error.conf:24-25

checkpoint.interval = 1000  # 1 秒
checkpoint.timeout = 60000  # 60 秒

Problem Description:
checkpoint.interval = 1000 (1 second) in test configuration means:

After job starts, checkpoints will trigger frequently (once per second)
Test makes the first trigger fail through Mockito
But checkpoints will continue to trigger afterward
If job doesn't fail in time, multiple checkpoints may be running

This may lead to:

Test instability (timing issues)
Test timeout (360 seconds may not be enough)
Multiple threads operating simultaneously, hard to predict behavior

Related Context:

Other tests (such as CheckpointErrorRestoreEndTest) use longer intervals
Test timeout is set to 360 seconds (6 minutes), already quite long

Potential Risks:

Risk 1: Test may be unstable (sometimes passes, sometimes fails)
Risk 2: Extend CI time
Risk 3: May mask other concurrent bugs

Scope of Impact:

Direct impact: CheckpointBarrierTriggerErrorTest test
Indirect impact: CI/CD process
Impact area: Single test case

Severity: MINOR

Improvement Suggestions:

env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 10000  # 修改：改为 10 秒，减少触发频率
  checkpoint.timeout = 60000
}

Rationale:

Reduce test timing sensitivity
Reduce unnecessary checkpoint triggers
Improve test stability
Shorten test execution time

Sephiroth1024 · 2026-02-05T02:48:46Z

Please enable CI following the instructions.

@dybyte thx for reminding me

dybyte · 2026-02-05T04:00:39Z

Option C: Keep current implementation (most conservative)

Note: This is a larger improvement, recommended for a follow-up PR. The current PR can be merged first to fix the "silent failure" bug.

+1. There is already an open PR for the retry mechanism here: #10223

Sephiroth1024 · 2026-02-05T05:45:34Z

Option C: Keep current implementation (most conservative)

Note: This is a larger improvement, recommended for a follow-up PR. The current PR can be merged first to fix the "silent failure" bug.

+1. There is already an open PR for the retry mechanism here: #10223

got it

so do i need to close this PR or just keep it open and wait to be reviewed

dybyte · 2026-02-05T09:12:00Z

Option C: Keep current implementation (most conservative)

Note: This is a larger improvement, recommended for a follow-up PR. The current PR can be merged first to fix the "silent failure" bug.

+1. There is already an open PR for the retry mechanism here: #10223

got it

so do i need to close this PR or just keep it open and wait to be reviewed

The suggestion was to keep this PR open and proceed with it as a bug fix. The retry/tolerance improvement can be handled separately in a follow-up PR.

Sephiroth1024 · 2026-02-05T09:26:10Z

Option C: Keep current implementation (most conservative)

Note: This is a larger improvement, recommended for a follow-up PR. The current PR can be merged first to fix the "silent failure" bug.

+1. There is already an open PR for the retry mechanism here: #10223

got it
so do i need to close this PR or just keep it open and wait to be reviewed

The suggestion was to keep this PR open and proceed with it as a bug fix. The retry/tolerance improvement can be handled separately in a follow-up PR.

ok, got it, ty
and could you help review this PR

...erver/src/main/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointCoordinator.java

...st/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointBarrierTriggerErrorTest.java

...ngine-server/src/test/resources/stream_fake_to_console_checkpoint_barrier_trigger_error.conf

…trigger_error.conf

dybyte · 2026-02-06T16:15:51Z

Could you please retry the CI?

dybyte

LGTM. Thanks @Sephiroth1024

...st/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointBarrierTriggerErrorTest.java

[Fix][Zeta] make the job failed when triggering checkpoint fails

81708f5

github-actions bot added the Zeta label Feb 4, 2026

Sephiroth1024 added 3 commits February 4, 2026 21:45

Merge branch 'dev' into fix-checkpoint-error

0af639e

format code by spotless

14f4c9a

add license header

66b45bf

dybyte reviewed Feb 5, 2026

View reviewed changes

Sephiroth1024 added 2 commits February 5, 2026 21:02

Merge branch 'dev' into fix-checkpoint-error

b2fa676

return directly when the checkpoint coordinator thread is interrupted

29528cc

Sephiroth1024 requested a review from dybyte February 5, 2026 13:24

dybyte reviewed Feb 6, 2026

View reviewed changes

...ngine-server/src/test/resources/stream_fake_to_console_checkpoint_barrier_trigger_error.conf Show resolved Hide resolved

Sephiroth1024 added 2 commits February 6, 2026 16:42

Merge branch 'dev' into fix-checkpoint-error

88f63f1

add job.retry.times = 0 in stream_fake_to_console_checkpoint_barrier_…

fb0ad75

…trigger_error.conf

Sephiroth1024 requested a review from dybyte February 6, 2026 08:44

dybyte approved these changes Feb 7, 2026

View reviewed changes

...st/java/org/apache/seatunnel/engine/server/checkpoint/CheckpointBarrierTriggerErrorTest.java Outdated Show resolved Hide resolved

github-actions bot added the reviewed label Feb 7, 2026

Sephiroth1024 added 2 commits February 7, 2026 21:44

Merge branch 'dev' into fix-checkpoint-error

2109a25

reduce timeout to 120s in CheckpointBarrierTriggerErrorTest

35cb014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Zeta] make the job failed when triggering checkpoint fails (apache#10442)#10448

[Fix][Zeta] make the job failed when triggering checkpoint fails (apache#10442)#10448
Sephiroth1024 wants to merge 10 commits intoapache:devfrom
Sephiroth1024:fix-checkpoint-error

Sephiroth1024 commented Feb 4, 2026

Uh oh!

dybyte commented Feb 4, 2026

Uh oh!

DanielCarter-stack commented Feb 4, 2026

Uh oh!

Sephiroth1024 commented Feb 5, 2026

Uh oh!

dybyte commented Feb 5, 2026

Uh oh!

Sephiroth1024 commented Feb 5, 2026

Uh oh!

dybyte commented Feb 5, 2026

Uh oh!

Sephiroth1024 commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dybyte commented Feb 6, 2026

Uh oh!

dybyte left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Sephiroth1024 commented Feb 4, 2026

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

dybyte commented Feb 4, 2026

Uh oh!

DanielCarter-stack commented Feb 4, 2026

Issue 1: InterruptedException Handling Does Not Follow Best Practices

Issue 2: InterruptedException Branch May Cause Duplicate Error Handling

Issue 3: Direct Failure for Transient Network Faults May Be Too Aggressive

Issue 4: Error Messages Not Specific Enough, Hard to Diagnose Problems

Issue 6: Test Code Uses Incorrect jobId, Test Cannot Properly Validate

Issue 8: Checkpoint Configuration in Test Configuration File May Cause Test Instability

Uh oh!

Sephiroth1024 commented Feb 5, 2026

Uh oh!

dybyte commented Feb 5, 2026

Uh oh!

Sephiroth1024 commented Feb 5, 2026

Uh oh!

dybyte commented Feb 5, 2026

Uh oh!

Sephiroth1024 commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dybyte commented Feb 6, 2026

Uh oh!

dybyte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants