-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32428] [EXAMPLES] Make BinaryClassificationMetricsExample cons… #29222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…istently print the metrics on driver's stdout
|
The change looks reasonable to me. I checked the examples, most of the examples have RDD.collect().foreach, but a few of the examples have RDD.foreach. For example, in I think we probably also want to change these to make all the examples to output the result on the driver's stdout. |
|
cc @srowen |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep let's fix all such occurrences if possible. Thanks!
How to did it:
+ 1. Replace all occurences of `.collect` with `foreach.collect`:
```
$ find examples/src/ -type f | xargs grep foreach | grep -v foreachRDD | grep -P -v "(collect.foreach|collect\(\).foreach)" | awk '{ print $1 }' | sed -e 's/:$//' | uniq | grep scala | xargs sed -i -e 's/foreach/collect.foreach/g'
```
+ 2. For each file, check if the modification was correct or not by `mvn compile` and call `checkout --` if the modification was incorrect:
```
$ mvn compile | grep Error | awk '{ print $3 }' | perl -plne 's/:(\d+):$//' | xargs -i git checkout -- {}
```
+ 3. Manually call `checkout --` if the modification seems superfluous:
We removed AccumulatorMetricsTest.scala and ExceptionHandlingTest.scala from the target.
|
@huaxingao @srowen |
|
+1 |
|
Jenkins test this please |
|
Test build #126548 has finished for PR 29222 at commit
|
…istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 86ead04) Signed-off-by: Sean Owen <[email protected]>
…istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 86ead04) Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/3.0/2.4 |
…istently print the metrics on driver's stdout
What changes were proposed in this pull request?
Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout.
Why are the changes needed?
Some RDDs in this example (e.g., precision, recall) call println without calling collect.
If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout.
However if the job is under cluster mode, the job prints the metrics on the executor's stdout.
It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout.
All of the metrics should output its result on the driver's stdout.
Does this PR introduce any user-facing change?
No
How was this patch tested?
This is example code. It doesn't have any tests.