Skip to content

Conversation

@titsuki
Copy link
Contributor

@titsuki titsuki commented Jul 24, 2020

…istently print the metrics on driver's stdout

What changes were proposed in this pull request?

Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout.

Why are the changes needed?

Some RDDs in this example (e.g., precision, recall) call println without calling collect.
If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout.
However if the job is under cluster mode, the job prints the metrics on the executor's stdout.
It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout.
All of the metrics should output its result on the driver's stdout.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This is example code. It doesn't have any tests.

…istently print the metrics on driver's stdout
@huaxingao
Copy link
Contributor

The change looks reasonable to me. I checked the examples, most of the examples have RDD.collect().foreach, but a few of the examples have RDD.foreach. For example, in ElementwiseProductExample

    println("transformedData: ")
    transformedData.foreach(x => println(x))

    println("transformedData2: ")
    transformedData2.foreach(x => println(x))

I think we probably also want to change these to make all the examples to output the result on the driver's stdout.

@huaxingao
Copy link
Contributor

cc @srowen

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep let's fix all such occurrences if possible. Thanks!

How to did it:

+ 1. Replace all occurences of `.collect` with `foreach.collect`:

```
$ find examples/src/ -type f | xargs grep foreach | grep -v foreachRDD | grep -P -v "(collect.foreach|collect\(\).foreach)" | awk '{ print $1 }' | sed -e 's/:$//' | uniq | grep scala | xargs sed -i -e 's/foreach/collect.foreach/g'
```

+ 2. For each file, check if the modification was correct or not by `mvn compile` and call `checkout --` if the modification was incorrect:

```
$ mvn compile | grep Error | awk '{ print $3 }' | perl -plne 's/:(\d+):$//' | xargs -i git checkout -- {}
```

+ 3. Manually call `checkout --` if the modification seems superfluous:

We removed AccumulatorMetricsTest.scala and ExceptionHandlingTest.scala from the target.
@titsuki
Copy link
Contributor Author

titsuki commented Jul 25, 2020

@huaxingao @srowen
Thanks for your reviews! I've also fixed the other examples.

@HyukjinKwon
Copy link
Member

+1

@srowen
Copy link
Member

srowen commented Jul 25, 2020

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Jul 25, 2020

Test build #126548 has finished for PR 29222 at commit 77a00dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in 86ead04 Jul 26, 2020
srowen pushed a commit that referenced this pull request Jul 26, 2020
…istently print the metrics on driver's stdout

### What changes were proposed in this pull request?

Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout.

### Why are the changes needed?

Some RDDs in this example (e.g., precision, recall) call println without calling collect.
If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout.
However if the job is under cluster mode, the job prints the metrics on the executor's stdout.
It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout.
All of the metrics should output its result on the driver's stdout.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is example code. It doesn't have any tests.

Closes #29222 from titsuki/SPARK-32428.

Authored-by: Itsuki Toyota <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 86ead04)
Signed-off-by: Sean Owen <[email protected]>
srowen pushed a commit that referenced this pull request Jul 26, 2020
…istently print the metrics on driver's stdout

### What changes were proposed in this pull request?

Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout.

### Why are the changes needed?

Some RDDs in this example (e.g., precision, recall) call println without calling collect.
If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout.
However if the job is under cluster mode, the job prints the metrics on the executor's stdout.
It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout.
All of the metrics should output its result on the driver's stdout.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is example code. It doesn't have any tests.

Closes #29222 from titsuki/SPARK-32428.

Authored-by: Itsuki Toyota <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 86ead04)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Jul 26, 2020

Merged to master/3.0/2.4

@titsuki titsuki deleted the SPARK-32428 branch July 26, 2020 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants