Search before asking
What happened
When we perform a checkpoint, a counter called pendingCounter will increase one.
At the end of CheckpointCoordinator#startTriggerPendingCheckpoint.
But once it failed, for example, operation timeout (the stack shows as below).
CheckpointCoordinator#startTriggerPendingCheckpoint
-> CheckpointCoordinator#triggerCheckpoint
-> CheckpointManager#sendOperationToMemberNode
-> JobMaster#queryTaskGroupAddress (It will access an IMap and may cost a lot. For example, the GetOperation is routed to a partition that has had many operations in its OperationQueue already.)
It will never perform a checkpoint again because the pendingCounter is 1 and when we perform the checkpoint next time, the code in the red box will be true and it will always be true.
So I was wondering if we need to reset pendingCounter to 0 when an exception occurs while performing a checkpoint.
SeaTunnel Version
2.3.12
SeaTunnel Config
Running Command
Error Exception
Zeta or Flink or Spark Version
No response
Java or Scala Version
No response
Screenshots
No response
Are you willing to submit PR?
Code of Conduct