Skip to content

Conversation

@shahidki31
Copy link
Contributor

@shahidki31 shahidki31 commented Nov 14, 2018

What changes were proposed in this pull request?

Total tasks in the aggregated table and the tasks table are not matching some times in the WEBUI.
We need to force update the executor summary of the particular executorId, when ever last task of that executor has reached. Currently it force update based on last task on the stage end. So, for some particular executorId task might miss at the stage end.

How was this patch tested?

Tests to reproduce:

bin/spark-shell --master yarn --conf spark.executor.instances=3
sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() 

Before patch:
screenshot from 2018-11-15 02-24-05

After patch:
screenshot from 2018-11-15 02-32-38

@shahidki31
Copy link
Contributor Author

cc @vanzin @srowen kindly review

@SparkQA
Copy link

SparkQA commented Nov 15, 2018

Test build #98840 has finished for PR 23038 at commit ed98958.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Contributor Author

I'll update the PR, for test failures

@shahidki31
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 15, 2018

Test build #98849 has finished for PR 23038 at commit b7a47c2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Contributor Author

Retest this please

@SparkQA
Copy link

SparkQA commented Nov 15, 2018

Test build #98850 has finished for PR 23038 at commit c53ca48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2018

Test build #98856 has finished for PR 23038 at commit c53ca48.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 15, 2018

Test build #98863 has finished for PR 23038 at commit c53ca48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


class ExecutorStageSummary private[spark](
val taskTime : Long,
val activeTasks: Int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to expose this in the public API to fix the bug, do you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vanzin for the review.
Actually my objective is to get the last task of the particular executorId of the stage. If corresponding activeTasks == 0, then force update in the kvstore.

In stages, jobs, exec has "activeTasks" and using the parameter, it force update on the last task.

conditionalLiveUpdate(exec, now, exec.activeTasks == 0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't answer my question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I will try without exposing in the public API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vanzin , I have modified based your comment. Kindly review

@SparkQA
Copy link

SparkQA commented Nov 16, 2018

Test build #98893 has finished for PR 23038 at commit 0d92185.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2018

Test build #98895 has finished for PR 23038 at commit 7c3a80b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2018

Test build #98894 has finished for PR 23038 at commit 805ebb8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you had before was fine and probably faster than doing multiple hash lookups like this.

You just did not need to change the public API at all.

Also, is there a unit test that can be written? IIRC the unit tests disable the conditional updates so it may be hard to add one.


val executorSummaries = new HashMap[String, LiveExecutorStageSummary]()

val activeTaskPerExecutor = new HashMap[String, Int]().withDefaultValue(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activeTasksPerExecutor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add one UT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vanzin I have added a UT. Kindly review


test("SPARK-25451: total tasks in the executor summary should match total stage tasks") {
val testConf = conf.clone()
.set("spark.ui.liveUpdate.period", s"${Int.MaxValue}s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the config constant, like the existing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

tasks.filter(_.index < 2).foreach { task =>
time += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whole block is indented too far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code

listener.onJobEnd(SparkListenerJobEnd(1, time, JobFailed(new RuntimeException("Bad Executor"))))

tasks.filter(_.index >= 2).foreach { task =>
time += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


val esummary = store.view(classOf[ExecutorStageSummaryWrapper]).asScala.map(_.info)
esummary.foreach {
execSummary => assert(execSummary.failedTasks == 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep execSummary => in the previous line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

conditionalLiveUpdate(esummary, now, removeStage)

val isLastTask = (stage.activeTasksPerExecutor(event.taskInfo.executorId) == 0) &&
((stage.status == v1.StageStatus.COMPLETE) || (stage.status == v1.StageStatus.FAILED))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this extra condition is needed?

Copy link
Contributor Author

@shahidki31 shahidki31 Nov 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue occurs, when the taskEvent comes after stageEnd. Because during 'OnStageCompletd' event, we are writing all the esummary to the store. So, 'OnTaskEnd' method, we just need to force write only if the stageCompleted event already have happened.

Yes. the stageCompleted check isn't really required, as here we only update on the last task of each executors. I updated the code

@SparkQA
Copy link

SparkQA commented Nov 17, 2018

Test build #98931 has finished for PR 23038 at commit cbd885a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2018

Test build #98935 has finished for PR 23038 at commit 93181aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2018

Test build #98956 has finished for PR 23038 at commit ecac386.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Contributor Author

shahidki31 commented Nov 17, 2018

Hi @vanzin ,
For history server also the same issue happens (see https://issues.apache.org/jira/browse/SPARK-26100) , and the current fix alone is not sufficient for history server case.

In History server, both 'Jobs' table and 'Aggregated Metrics' table shows incorrect total tasks.
The reason is because, when the condition satisfies, we force update only live store.

I have modified the code, so that the issue will not happen for both History and live UI.

@shahidki31
Copy link
Contributor Author

App UI from History server after the patch:
screenshot from 2018-11-17 18-59-28

screenshot from 2018-11-17 18-59-45

@shahidki31 shahidki31 changed the title [SPARK-25451][CORE][WEBUI]Aggregated metrics table doesn't show the right number of the total tasks [SPARK-25451][SPARK-26100][CORE][WEBUI]Aggregated metrics table doesn't show the right number of the total tasks Nov 17, 2018
@shahidki31 shahidki31 changed the title [SPARK-25451][SPARK-26100][CORE][WEBUI]Aggregated metrics table doesn't show the right number of the total tasks [SPARK-25451][SPARK-26100][CORE]Aggregated metrics table doesn't show the right number of the total tasks Nov 17, 2018
@SparkQA
Copy link

SparkQA commented Nov 17, 2018

Test build #98968 has finished for PR 23038 at commit a21bc0c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Contributor Author

It is random failure

@SparkQA
Copy link

SparkQA commented Nov 17, 2018

Test build #98967 has finished for PR 23038 at commit dca941d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2018

Test build #98966 has finished for PR 23038 at commit ad30c36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val esummary = store.view(classOf[ExecutorStageSummaryWrapper]).asScala.map(_.info)
esummary.foreach { execSummary =>
assert(execSummary.failedTasks == 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: also check succeededTasks and killedTasks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gengliangwang , I updated the test.

@SparkQA
Copy link

SparkQA commented Nov 19, 2018

Test build #99012 has finished for PR 23038 at commit ed85016.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Contributor Author

It is a random failure. Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99025 has finished for PR 23038 at commit ed85016.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Contributor Author

@vanzin could you please check the updated changes, thanks

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok. conditionalLiveUpdate has a single call after your changes, so it feels like it should be cleaned up since apparently it has issues that might also affect the remaining call. Anyway, that can be a separate bug.

merging to master (will fix the indentation before pushing).

// If the last task of the executor finished, then update the esummary
// for both live and history events.
if (isLastTask) {
update(esummary, now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is off

asfgit pushed a commit that referenced this pull request Nov 26, 2018
…w the right number of the total tasks

Total tasks in the aggregated table and the tasks table are not matching some times in the WEBUI.
We need to force update the executor summary of the particular executorId, when ever last task of that executor has reached. Currently it force update based on last task on the stage end. So, for some particular executorId task might miss at the stage end.

Tests to reproduce:
```
bin/spark-shell --master yarn --conf spark.executor.instances=3
sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect()
```
Before patch:
![screenshot from 2018-11-15 02-24-05](https://user-images.githubusercontent.com/23054875/48511776-b0d36480-e87d-11e8-89a8-ab97216e2c21.png)

After patch:
![screenshot from 2018-11-15 02-32-38](https://user-images.githubusercontent.com/23054875/48512141-c39a6900-e87e-11e8-8535-903e1d11d13e.png)

Closes #23038 from shahidki31/SPARK-25451.

Authored-by: Shahid <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit fbf62b7)
Signed-off-by: Marcelo Vanzin <[email protected]>
@vanzin
Copy link
Contributor

vanzin commented Nov 26, 2018

(Also merged to 2.4.)

@asfgit asfgit closed this in fbf62b7 Nov 26, 2018
@shahidki31
Copy link
Contributor Author

Thank you @vanzin

@vanzin
Copy link
Contributor

vanzin commented Nov 26, 2018

This seems to have broken the master build, probably some other change that happened since this was last tested. Will send a follow up.

@shahidki31
Copy link
Contributor Author

Oh. I could have re based and tested in local. Thanks @vanzin for the fix.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…w the right number of the total tasks

Total tasks in the aggregated table and the tasks table are not matching some times in the WEBUI.
We need to force update the executor summary of the particular executorId, when ever last task of that executor has reached. Currently it force update based on last task on the stage end. So, for some particular executorId task might miss at the stage end.

Tests to reproduce:
```
bin/spark-shell --master yarn --conf spark.executor.instances=3
sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect()
```
Before patch:
![screenshot from 2018-11-15 02-24-05](https://user-images.githubusercontent.com/23054875/48511776-b0d36480-e87d-11e8-89a8-ab97216e2c21.png)

After patch:
![screenshot from 2018-11-15 02-32-38](https://user-images.githubusercontent.com/23054875/48512141-c39a6900-e87e-11e8-8535-903e1d11d13e.png)

Closes apache#23038 from shahidki31/SPARK-25451.

Authored-by: Shahid <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…obs in the history server UI

The root cause of the problem is, whenever the taskEnd event comes after stageCompleted event, execSummary is updating only for live UI. we need to update for history UI too.

To see the previous discussion, refer: PR for apache#23038, https://issues.apache.org/jira/browse/SPARK-26100.

Added UT. Manually verified

Test step to reproduce:

```
bin/spark-shell --master yarn --conf spark.executor.instances=3
sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect()
```

Open Executors page from the History UI

Before patch:
![screenshot from 2018-11-29 22-13-34](https://user-images.githubusercontent.com/23054875/49246338-a21ead00-f43a-11e8-8214-f1020420be52.png)

After patch:
![screenshot from 2018-11-30 00-54-49](https://user-images.githubusercontent.com/23054875/49246353-aa76e800-f43a-11e8-98ef-7faecaa7a50e.png)

Closes apache#23181 from shahidki31/executorUpdate.

Authored-by: Shahid <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…w the right number of the total tasks

Total tasks in the aggregated table and the tasks table are not matching some times in the WEBUI.
We need to force update the executor summary of the particular executorId, when ever last task of that executor has reached. Currently it force update based on last task on the stage end. So, for some particular executorId task might miss at the stage end.

Tests to reproduce:
```
bin/spark-shell --master yarn --conf spark.executor.instances=3
sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect()
```
Before patch:
![screenshot from 2018-11-15 02-24-05](https://user-images.githubusercontent.com/23054875/48511776-b0d36480-e87d-11e8-89a8-ab97216e2c21.png)

After patch:
![screenshot from 2018-11-15 02-32-38](https://user-images.githubusercontent.com/23054875/48512141-c39a6900-e87e-11e8-8535-903e1d11d13e.png)

Closes apache#23038 from shahidki31/SPARK-25451.

Authored-by: Shahid <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit fbf62b7)
Signed-off-by: Marcelo Vanzin <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…w the right number of the total tasks

Total tasks in the aggregated table and the tasks table are not matching some times in the WEBUI.
We need to force update the executor summary of the particular executorId, when ever last task of that executor has reached. Currently it force update based on last task on the stage end. So, for some particular executorId task might miss at the stage end.

Tests to reproduce:
```
bin/spark-shell --master yarn --conf spark.executor.instances=3
sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect()
```
Before patch:
![screenshot from 2018-11-15 02-24-05](https://user-images.githubusercontent.com/23054875/48511776-b0d36480-e87d-11e8-89a8-ab97216e2c21.png)

After patch:
![screenshot from 2018-11-15 02-32-38](https://user-images.githubusercontent.com/23054875/48512141-c39a6900-e87e-11e8-8535-903e1d11d13e.png)

Closes apache#23038 from shahidki31/SPARK-25451.

Authored-by: Shahid <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit fbf62b7)
Signed-off-by: Marcelo Vanzin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants