[SPARK-25451][SPARK-26100][CORE]Aggregated metrics table doesn't show the right number of the total tasks #23038

shahidki31 · 2018-11-14T21:05:33Z

What changes were proposed in this pull request?

Total tasks in the aggregated table and the tasks table are not matching some times in the WEBUI.
We need to force update the executor summary of the particular executorId, when ever last task of that executor has reached. Currently it force update based on last task on the stage end. So, for some particular executorId task might miss at the stage end.

How was this patch tested?

Tests to reproduce:

bin/spark-shell --master yarn --conf spark.executor.instances=3
sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect()

Before patch:

After patch:

shahidki31 · 2018-11-14T23:41:21Z

cc @vanzin @srowen kindly review

SparkQA · 2018-11-15T00:19:51Z

Test build #98840 has finished for PR 23038 at commit ed98958.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2018-11-15T00:29:42Z

I'll update the PR, for test failures

shahidki31 · 2018-11-15T00:48:47Z

retest this please

SparkQA · 2018-11-15T04:09:09Z

Test build #98849 has finished for PR 23038 at commit b7a47c2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2018-11-15T04:20:49Z

Retest this please

SparkQA · 2018-11-15T05:33:43Z

Test build #98850 has finished for PR 23038 at commit c53ca48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-15T08:05:01Z

Test build #98856 has finished for PR 23038 at commit c53ca48.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2018-11-15T08:15:30Z

retest this please

SparkQA · 2018-11-15T12:39:35Z

Test build #98863 has finished for PR 23038 at commit c53ca48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-11-15T22:14:50Z

core/src/main/scala/org/apache/spark/status/api/v1/api.scala


 class ExecutorStageSummary private[spark](
    val taskTime : Long,
+    val activeTasks: Int,


You don't need to expose this in the public API to fix the bug, do you?

Thank you @vanzin for the review.
Actually my objective is to get the last task of the particular executorId of the stage. If corresponding activeTasks == 0, then force update in the kvstore.

In stages, jobs, exec has "activeTasks" and using the parameter, it force update on the last task.

spark/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

Line 631 in 9a5fda6

conditionalLiveUpdate(exec, now, exec.activeTasks == 0)

spark/core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

Line 563 in 9a5fda6

stage.activeTasks == 0 &&

You didn't answer my question.

Okay. I will try without exposing in the public API

Hi @vanzin , I have modified based your comment. Kindly review

SparkQA · 2018-11-16T06:35:05Z

Test build #98893 has finished for PR 23038 at commit 0d92185.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-16T06:43:07Z

Test build #98895 has finished for PR 23038 at commit 7c3a80b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-16T06:47:34Z

Test build #98894 has finished for PR 23038 at commit 805ebb8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

What you had before was fine and probably faster than doing multiple hash lookups like this.

You just did not need to change the public API at all.

Also, is there a unit test that can be written? IIRC the unit tests disable the conditional updates so it may be hard to add one.

vanzin · 2018-11-16T18:51:01Z

core/src/main/scala/org/apache/spark/status/LiveEntity.scala


  val executorSummaries = new HashMap[String, LiveExecutorStageSummary]()

+  val activeTaskPerExecutor = new HashMap[String, Int]().withDefaultValue(0)


activeTasksPerExecutor

I will add one UT

Hi @vanzin I have added a UT. Kindly review

vanzin · 2018-11-17T00:05:43Z

core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala


+  test("SPARK-25451: total tasks in the executor summary should match total stage tasks") {
+    val testConf = conf.clone()
+      .set("spark.ui.liveUpdate.period", s"${Int.MaxValue}s")


Use the config constant, like the existing code.

vanzin · 2018-11-17T00:06:10Z

core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala

+    }
+
+    tasks.filter(_.index < 2).foreach { task =>
+        time += 1


Whole block is indented too far.

Updated the code

vanzin · 2018-11-17T00:06:25Z

core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala

+    listener.onJobEnd(SparkListenerJobEnd(1, time, JobFailed(new RuntimeException("Bad Executor"))))
+
+    tasks.filter(_.index >= 2).foreach { task =>
+        time += 1


vanzin · 2018-11-17T00:06:49Z

core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala

+
+    val esummary = store.view(classOf[ExecutorStageSummaryWrapper]).asScala.map(_.info)
+    esummary.foreach {
+      execSummary => assert(execSummary.failedTasks == 2)


keep execSummary => in the previous line.

vanzin · 2018-11-17T00:09:09Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

-      conditionalLiveUpdate(esummary, now, removeStage)
+
+      val isLastTask = (stage.activeTasksPerExecutor(event.taskInfo.executorId) == 0) &&
+        ((stage.status == v1.StageStatus.COMPLETE) || (stage.status == v1.StageStatus.FAILED))


Not sure why this extra condition is needed?

This issue occurs, when the taskEvent comes after stageEnd. Because during 'OnStageCompletd' event, we are writing all the esummary to the store. So, 'OnTaskEnd' method, we just need to force write only if the stageCompleted event already have happened.

Yes. the stageCompleted check isn't really required, as here we only update on the last task of each executors. I updated the code

SparkQA · 2018-11-17T00:48:36Z

Test build #98931 has finished for PR 23038 at commit cbd885a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-17T02:03:44Z

Test build #98935 has finished for PR 23038 at commit 93181aa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-17T12:28:24Z

Test build #98956 has finished for PR 23038 at commit ecac386.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2018-11-17T13:17:28Z

Hi @vanzin ,
For history server also the same issue happens (see https://issues.apache.org/jira/browse/SPARK-26100) , and the current fix alone is not sufficient for history server case.

In History server, both 'Jobs' table and 'Aggregated Metrics' table shows incorrect total tasks.
The reason is because, when the condition satisfies, we force update only live store.

I have modified the code, so that the issue will not happen for both History and live UI.

shahidki31 · 2018-11-17T13:31:55Z

App UI from History server after the patch:

SparkQA · 2018-11-17T17:21:29Z

Test build #98968 has finished for PR 23038 at commit a21bc0c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2018-11-17T17:27:52Z

It is random failure

SparkQA · 2018-11-17T17:35:29Z

Test build #98967 has finished for PR 23038 at commit dca941d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-17T17:50:24Z

Test build #98966 has finished for PR 23038 at commit ad30c36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-11-19T18:21:29Z

core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala

+
+    val esummary = store.view(classOf[ExecutorStageSummaryWrapper]).asScala.map(_.info)
+    esummary.foreach { execSummary =>
+      assert(execSummary.failedTasks == 2)


Nit: also check succeededTasks and killedTasks

Thanks @gengliangwang , I updated the test.

SparkQA · 2018-11-19T22:22:12Z

Test build #99012 has finished for PR 23038 at commit ed85016.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2018-11-19T22:25:37Z

It is a random failure. Jenkins, retest this please

SparkQA · 2018-11-20T02:48:42Z

Test build #99025 has finished for PR 23038 at commit ed85016.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2018-11-25T15:50:24Z

@vanzin could you please check the updated changes, thanks

vanzin

Looks ok. conditionalLiveUpdate has a single call after your changes, so it feels like it should be cleaned up since apparently it has issues that might also affect the remaining call. Anyway, that can be a separate bug.

merging to master (will fix the indentation before pushing).

vanzin · 2018-11-26T21:07:05Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

+      // If the last task of the executor finished, then update the esummary
+      // for both live and history events.
+      if (isLastTask) {
+         update(esummary, now)


indentation is off

…w the right number of the total tasks Total tasks in the aggregated table and the tasks table are not matching some times in the WEBUI. We need to force update the executor summary of the particular executorId, when ever last task of that executor has reached. Currently it force update based on last task on the stage end. So, for some particular executorId task might miss at the stage end. Tests to reproduce: ``` bin/spark-shell --master yarn --conf spark.executor.instances=3 sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() ``` Before patch: ![screenshot from 2018-11-15 02-24-05](https://user-images.githubusercontent.com/23054875/48511776-b0d36480-e87d-11e8-89a8-ab97216e2c21.png) After patch: ![screenshot from 2018-11-15 02-32-38](https://user-images.githubusercontent.com/23054875/48512141-c39a6900-e87e-11e8-8535-903e1d11d13e.png) Closes #23038 from shahidki31/SPARK-25451. Authored-by: Shahid <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit fbf62b7) Signed-off-by: Marcelo Vanzin <[email protected]>

vanzin · 2018-11-26T21:18:01Z

(Also merged to 2.4.)

shahidki31 · 2018-11-26T21:23:14Z

Thank you @vanzin

vanzin · 2018-11-26T22:14:58Z

This seems to have broken the master build, probably some other change that happened since this was last tested. Will send a follow up.

shahidki31 · 2018-11-26T22:28:45Z

Oh. I could have re based and tested in local. Thanks @vanzin for the fix.

…w the right number of the total tasks Total tasks in the aggregated table and the tasks table are not matching some times in the WEBUI. We need to force update the executor summary of the particular executorId, when ever last task of that executor has reached. Currently it force update based on last task on the stage end. So, for some particular executorId task might miss at the stage end. Tests to reproduce: ``` bin/spark-shell --master yarn --conf spark.executor.instances=3 sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() ``` Before patch: ![screenshot from 2018-11-15 02-24-05](https://user-images.githubusercontent.com/23054875/48511776-b0d36480-e87d-11e8-89a8-ab97216e2c21.png) After patch: ![screenshot from 2018-11-15 02-32-38](https://user-images.githubusercontent.com/23054875/48512141-c39a6900-e87e-11e8-8535-903e1d11d13e.png) Closes apache#23038 from shahidki31/SPARK-25451. Authored-by: Shahid <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

…obs in the history server UI The root cause of the problem is, whenever the taskEnd event comes after stageCompleted event, execSummary is updating only for live UI. we need to update for history UI too. To see the previous discussion, refer: PR for apache#23038, https://issues.apache.org/jira/browse/SPARK-26100. Added UT. Manually verified Test step to reproduce: ``` bin/spark-shell --master yarn --conf spark.executor.instances=3 sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() ``` Open Executors page from the History UI Before patch: ![screenshot from 2018-11-29 22-13-34](https://user-images.githubusercontent.com/23054875/49246338-a21ead00-f43a-11e8-8214-f1020420be52.png) After patch: ![screenshot from 2018-11-30 00-54-49](https://user-images.githubusercontent.com/23054875/49246353-aa76e800-f43a-11e8-98ef-7faecaa7a50e.png) Closes apache#23181 from shahidki31/executorUpdate. Authored-by: Shahid <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

…w the right number of the total tasks Total tasks in the aggregated table and the tasks table are not matching some times in the WEBUI. We need to force update the executor summary of the particular executorId, when ever last task of that executor has reached. Currently it force update based on last task on the stage end. So, for some particular executorId task might miss at the stage end. Tests to reproduce: ``` bin/spark-shell --master yarn --conf spark.executor.instances=3 sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() ``` Before patch: ![screenshot from 2018-11-15 02-24-05](https://user-images.githubusercontent.com/23054875/48511776-b0d36480-e87d-11e8-89a8-ab97216e2c21.png) After patch: ![screenshot from 2018-11-15 02-32-38](https://user-images.githubusercontent.com/23054875/48512141-c39a6900-e87e-11e8-8535-903e1d11d13e.png) Closes apache#23038 from shahidki31/SPARK-25451. Authored-by: Shahid <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit fbf62b7) Signed-off-by: Marcelo Vanzin <[email protected]>

Stages page doesn't show the right number of the total tasks

b7a47c2

shahidki31 force-pushed the SPARK-25451 branch from ed98958 to b7a47c2 Compare November 15, 2018 00:48

update

c53ca48

vanzin reviewed Nov 15, 2018

View reviewed changes

address comment

805ebb8

shahidki31 force-pushed the SPARK-25451 branch from 0d92185 to 805ebb8 Compare November 16, 2018 02:21

remove space

7c3a80b

vanzin reviewed Nov 16, 2018

View reviewed changes

shahidki31 added 2 commits November 17, 2018 01:51

update

cbd885a

Add UT

5b13b77

shahidki31 force-pushed the SPARK-25451 branch from 93181aa to 5b13b77 Compare November 16, 2018 22:37

shahidki31 added 2 commits November 17, 2018 04:11

remove space

6e1d2fa

minor nit

50cc762

shahidki31 force-pushed the SPARK-25451 branch from 8109396 to 50cc762 Compare November 16, 2018 23:21

vanzin reviewed Nov 17, 2018

View reviewed changes

Update for history events

dca941d

shahidki31 force-pushed the SPARK-25451 branch from ad30c36 to dca941d Compare November 17, 2018 13:21

shahidki31 changed the title ~~[SPARK-25451][CORE][WEBUI]Aggregated metrics table doesn't show the right number of the total tasks~~ [SPARK-25451][SPARK-26100][CORE][WEBUI]Aggregated metrics table doesn't show the right number of the total tasks Nov 17, 2018

shahidki31 changed the title ~~[SPARK-25451][SPARK-26100][CORE][WEBUI]Aggregated metrics table doesn't show the right number of the total tasks~~ [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table doesn't show the right number of the total tasks Nov 17, 2018

minor nit

a21bc0c

gengliangwang reviewed Nov 19, 2018

View reviewed changes

update test

ed85016

vanzin reviewed Nov 26, 2018

View reviewed changes

asfgit closed this in fbf62b7 Nov 26, 2018

shahidki31 mentioned this pull request Nov 29, 2018

[SPARK-26219][CORE] Executor summary should get updated for failure jobs in the history server UI #23181

Closed


		val executorSummaries = new HashMap[String, LiveExecutorStageSummary]()

		val activeTaskPerExecutor = new HashMap[String, Int]().withDefaultValue(0)

[SPARK-25451][SPARK-26100][CORE]Aggregated metrics table doesn't show the right number of the total tasks #23038

[SPARK-25451][SPARK-26100][CORE]Aggregated metrics table doesn't show the right number of the total tasks #23038

Uh oh!

Conversation

shahidki31 commented Nov 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

shahidki31 commented Nov 14, 2018

Uh oh!

SparkQA commented Nov 15, 2018

Uh oh!

shahidki31 commented Nov 15, 2018

Uh oh!

shahidki31 commented Nov 15, 2018

Uh oh!

SparkQA commented Nov 15, 2018

Uh oh!

shahidki31 commented Nov 15, 2018

Uh oh!

SparkQA commented Nov 15, 2018

Uh oh!

SparkQA commented Nov 15, 2018

Uh oh!

shahidki31 commented Nov 15, 2018

Uh oh!

SparkQA commented Nov 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 16, 2018

Uh oh!

SparkQA commented Nov 16, 2018

Uh oh!

SparkQA commented Nov 16, 2018

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shahidki31 Nov 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 17, 2018

Uh oh!

SparkQA commented Nov 17, 2018

Uh oh!

shahidki31 commented Nov 14, 2018 •

edited

Loading

shahidki31 Nov 17, 2018 •

edited

Loading

shahidki31 commented Nov 17, 2018 •

edited

Loading