[FLINK-22266] Fix stop-with-savepoint operation in AdaptiveScheduler #15884

rmetzger · 2021-05-10T15:13:53Z

Brief change log

When the creation of the savepoint fails, we go back into the Executing state (because the job is still running). However, this doesn't work, because Executing state is deploying the job when entering the state.
In this change, we are moving the job deployment out of the Executing state, and assert in the state that it is running.

I didn't chose the option of restarting the job, as this is potentially an expensive operation for the user.

Once #15882 has been merged, I'll adjust this PR.

flinkbot · 2021-05-10T15:17:15Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit ddabee5 (Sat Aug 28 12:10:46 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-05-10T15:43:56Z

CI report:

8f0b36f Azure: CANCELED
ddabee5 Azure: PENDING

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

tillrohrmann

Thanks for creating this PR @rmetzger. I think your solution does not work because it breaks with the contract that only StateWithExecutionGraph states can process updateTaskExecutionState messages. Concretely, with this change I think that we will ignore deployment failures.

Independent of this, we seem to be lacking test coverage for deployment failures on the unit test level as far as I can tell.

tillrohrmann · 2021-05-11T13:39:03Z

...untime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/CreatingExecutionGraph.java

+    private void handleDeploymentFailure(ExecutionVertex executionVertex, JobException e) {
+        executionVertex.markFailed(e);
+    }


ExecutionVertex.markFailed will trigger an updateTaskExecutionState which will only be processed by an StateWithExecutionGraph. Hence, I think this will now simply be ignored.

tillrohrmann · 2021-05-11T13:42:52Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StopWithSavepoint.java

+                // creating the savepoint has failed but job is still running
+                Preconditions.checkState(getExecutionGraph().getState() == JobStatus.RUNNING);


Why did you introduce this checkState? Wouldn't this be caught by the Executing state?

tillrohrmann · 2021-05-11T13:43:24Z

...me/src/test/java/org/apache/flink/runtime/scheduler/adaptive/CreatingExecutionGraphTest.java

                    CreatingExecutionGraph.ExecutionGraphWithVertexParallelism.create(
                            executionGraph, new TestingVertexParallelism()));
+
+            assertThat(mockExecutionJobVertex.isExecutionDeployed(), is(true));


There is no test which ensures the proper behavior if deploying fails in state CreatingExecutionGraph.

rmetzger · 2021-05-12T18:48:32Z

Thanks a lot for your feedback. I agree with your findings, and I've reworked the change accordingly.

Independent of this, we seem to be lacking test coverage for deployment failures on the unit test level as far as I can tell.

I added a test for this

rmetzger · 2021-05-13T15:21:50Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java

+        } else {
+            throw new IllegalStateException(
+                    "Unexpected executing state behavior " + executingStateBehavior);
+        }


Having slept over this one night, I'm not so sure anymore if this is the right approach. We can probably always assume the execution graph to be in state RUNNING, and on Behavior.EXPECT_RUNNING we can go through all ExecutionVertex and check if their state is running. I'll try to look into this Monday morning latest.

I changed it.

tillrohrmann

Thanks for updating this PR @rmetzger. I think the fix looks good. I had one comment concerning a possible simplification and another comment concerning the test coverage of the failed deploy call. Please take a look.

tillrohrmann · 2021-05-18T09:11:36Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java

+        if (executingStateBehavior == Behavior.DEPLOY_ON_ENTER) {
+            onAllExecutionVertexes(this::deploySafely);
+        } else if (executingStateBehavior == Behavior.EXPECT_RUNNING) {
+            onAllExecutionVertexes(this::expectRunning);
+        } else {
+            throw new IllegalStateException(
+                    "Unexpected executing state behavior " + executingStateBehavior);
+        }


Couldn't we say that we deploy all not currently running Executions when entering this state?

Indeed, that's a good simplification. I pushed a commit addressing this item.

tillrohrmann · 2021-05-18T10:49:54Z

flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/ExecutingTest.java

+                    ((FailOnDeployMockExecutionVertex) mejv.getMockExecutionVertex())
+                            .getMarkedFailure(),
+                    is(instanceOf(JobException.class)));


Do we have a test which ensures that Execution.markFailed will result in the proper exception handling in the Executing state?

The error handling of markFailed is difficult to test, because so many components are involved. But in my opinion, we have good test coverage:

markFailed will (through the DefaultExecutionGraph) notify the InternalFailuresListener about the task failure. The UpdateSchedulerNgOnInternalFailuresListener implementation used by adaptive scheduler will call updateTaskExecutionState on the scheduler. This chain of calls will be used for example for the failure in the AdaptiveSchedulerITCase.testGlobalFailoverCanRecoverState() test.

For the Executing state, we have tests that exceptions during deployment lead to a markFailed call (testExecutionVertexMarkedAsFailedOnDeploymentFailure), and failures reported via updateTaskExecutionState to appropriate error handling (testFailureReportedViaUpdateTaskExecutionStateCausesFailingOnNoRestart, testFailureReportedViaUpdateTaskExecutionStateCausesRestart, testFalseReportsViaUpdateTaskExecutionStateAreIgnored).

Adding a test that a markFailed call will notify the InternalFailuresListener is out of the scope of the ExecutingTest (because we are testing the ExecutionVertex and Execution classes).
Adding a test that a markFailed call will call updateTaskExecutionState will need to go through a test specific InternalFailuresListener: Since all the relevant calls on ExecutingState are already covered, this would only test the test specific InternalFailuresListener.

Ok, sounds good. Thanks for all the details.

tillrohrmann

Thanks for updating this PR @rmetzger. LGTM. +1 for merging after resolving my last comment.

tillrohrmann · 2021-05-20T10:04:31Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java

+                if (executionVertex.getExecutionState() != ExecutionState.RUNNING) {
+                    deploySafely(executionVertex);
+                }


Should we say that we call deploy if the ExecutionState is CREATED or SCHEDULED?

rmetzger · 2021-05-20T10:08:45Z

Thanks a lot for your review. I'll address your last comment & merge the change.

This closes #15884

rmetzger added the review=description? label May 10, 2021

rmetzger added component=Runtime/Coordination component=Runtime/Checkpointing labels May 10, 2021

tillrohrmann requested changes May 11, 2021

View reviewed changes

rmetzger force-pushed the FLINK-22266-stopwithsavepoint branch 2 times, most recently from 7b64aa1 to 4c6fcf4 Compare May 12, 2021 18:47

rmetzger commented May 13, 2021

View reviewed changes

rmetzger force-pushed the FLINK-22266-stopwithsavepoint branch from 4c6fcf4 to 94a973c Compare May 17, 2021 12:03

tillrohrmann reviewed May 18, 2021

View reviewed changes

rmetzger force-pushed the FLINK-22266-stopwithsavepoint branch from 94a973c to 8f0b36f Compare May 20, 2021 07:07

tillrohrmann approved these changes May 20, 2021

View reviewed changes

[FLINK-22266] Fix stop-with-savepoint operation in AdaptiveScheduler

ddabee5

rmetzger force-pushed the FLINK-22266-stopwithsavepoint branch from 8f0b36f to ddabee5 Compare May 20, 2021 10:39

rmetzger closed this in 1106542 May 20, 2021

rmetzger added a commit that referenced this pull request May 20, 2021

[FLINK-22266] Fix stop-with-savepoint operation in AdaptiveScheduler

040bd81

This closes #15884

		// creating the savepoint has failed but job is still running
		Preconditions.checkState(getExecutionGraph().getState() == JobStatus.RUNNING);

[FLINK-22266] Fix stop-with-savepoint operation in AdaptiveScheduler #15884

[FLINK-22266] Fix stop-with-savepoint operation in AdaptiveScheduler #15884

Uh oh!

Conversation

rmetzger commented May 10, 2021

Brief change log

Uh oh!

flinkbot commented May 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Checks

Review Progress

Uh oh!

flinkbot commented May 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmetzger commented May 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tillrohrmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmetzger commented May 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

flinkbot commented May 10, 2021 •

edited

Loading

flinkbot commented May 10, 2021 •

edited

Loading