Skip to content

[Bug] [seatunnel-engine-server] seatunnel will never perform a checkpoint again once a previous checkpoint fails #10442

@Sephiroth1024

Description

@Sephiroth1024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When we perform a checkpoint, a counter called pendingCounter will increase one.
At the end of CheckpointCoordinator#startTriggerPendingCheckpoint.

Image

But once it failed, for example, operation timeout (the stack shows as below).

CheckpointCoordinator#startTriggerPendingCheckpoint
-> CheckpointCoordinator#triggerCheckpoint
     -> CheckpointManager#sendOperationToMemberNode
          -> JobMaster#queryTaskGroupAddress (It will access an IMap and may cost a lot. For example, the GetOperation is routed to a partition that has had many operations in its OperationQueue already.)

It will never perform a checkpoint again because the pendingCounter is 1 and when we perform the checkpoint next time, the code in the red box will be true and it will always be true.

Image

So I was wondering if we need to reset pendingCounter to 0 when an exception occurs while performing a checkpoint.

SeaTunnel Version

2.3.12

SeaTunnel Config

//

Running Command

//

Error Exception

//

Zeta or Flink or Spark Version

No response

Java or Scala Version

No response

Screenshots

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions