[SPARK-25710][SQL] range should report metrics correctly #22698

cloud-fan · 2018-10-11T14:57:43Z

What changes were proposed in this pull request?

Currently Range reports metrics in batch granularity. This is acceptable, but it's better if we can make it row granularity without performance penalty.

Before this PR, the metrics are updated when preparing the batch, which is before we actually consume data. In this PR, the metrics are updated after the data are consumed. There are 2 different cases:

The data processing loop has a stop check. The metrics are updated when we need to stop.
no stop check. The metrics are updated after the loop.

How was this patch tested?

existing tests and a new benchmark

cloud-fan · 2018-10-11T15:00:20Z

sql/core/benchmarks/RangeBenchmark-results.txt

+Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
+
+range:                                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative


I've also tried the commit before this PR, the benchmark result is almost same.

cloud-fan · 2018-10-11T15:01:51Z

sql/core/benchmarks/RangeBenchmark-results.txt

+limit after range                               33 /   37      15900.2           0.1     384.4X
+filter after range                             969 /  985        541.0           1.8      13.1X
+count after range                               42 /   42      12510.5           0.1     302.4X
+count after limit after range                   32 /   33      16337.0           0.1     394.9X


several learnings:

limit does help

The performance is bad if we interrupt the data processing loop too often. Full scan is the worst case, we interrupt the loop for every record.

cloud-fan · 2018-10-11T15:02:08Z

cc @kiszk @viirya @mgaido91

mgaido91 · 2018-10-11T15:03:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+      |     $stopCheck
      |   }
      |   $nextIndex = $batchEnd;
+      |   $numOutput.add($localEnd);


this means now the metrics are updated only once the whole processing of a batch happens right?

partially yes, when there is no stop check.

If there is stop check, we will directly return and won't hit this code path.

also in that case we update the metrics after processing the rows, right?

i am just wondering if we can think of updating the metrics as before but in the shouldStop() "remove" the rows which were not processed. This would let the metrics to be updated earlier as before, but it can also cause the metrics to decrease which is something not expected. Not sure which is worse.

if we can think of updating the metrics as before but in the shouldStop() "remove" the rows which were not processed.

Is it to keep the code diff small? Otherwise I think it's always better to only update metrics once, instead of add-then-remove.

viirya · 2018-10-11T16:11:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+      |     $stopCheck
      |   }
      |   $nextIndex = $batchEnd;
+      |   $numOutput.add($localEnd);


I image a case. There is range + limit + a blocking op + ...

Now as at the range there is no stopCheck, right?

Assume a range batch is 1000. Because there is no stopCheck, this loop on localIdx will run to end. Although the limit works to only pass n rows into the blocking op, here we still add localEnd into numOutput.

I've not really tested it. Not sure if it is really a problem. Since it is late, I may check it more tomorrow if it has not figured out yet.

that's expected isn't it? The range operator does output 1000 rows, the limit operator takes 1000 inputs, but only output like 100 rows.

more background: the stop check for limit is done in batch granularity, while the stop check for result buffer is done in row granularity.

That said, even if the limit is smaller than the batch size, the range operator still outputs a entire batch, physically.

If it is, then it is no problem. I was thinking that the number of output metric at range operator should be 100 if it is followed by a limit(100) operator.

SparkQA · 2018-10-11T18:40:15Z

Test build #97264 has finished for PR 22698 at commit 1071c14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-10-11T18:51:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/RangeBenchmark.scala

+ * Benchmark to measure performance for range operator.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>


1. without sbt: bin/spark-submit --class <this class> <spark sql test jar>

->

1. without sbt: bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>

SparkQA · 2018-10-12T06:26:21Z

Test build #97293 has finished for PR 22698 at commit 4058a21.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-10-12T08:08:32Z

LGTM

viirya · 2018-10-12T08:42:01Z

LGTM

kiszk · 2018-10-12T14:37:33Z

LGTM

cloud-fan · 2018-10-13T05:55:39Z

thanks, merging to master!

## What changes were proposed in this pull request? Currently `Range` reports metrics in batch granularity. This is acceptable, but it's better if we can make it row granularity without performance penalty. Before this PR, the metrics are updated when preparing the batch, which is before we actually consume data. In this PR, the metrics are updated after the data are consumed. There are 2 different cases: 1. The data processing loop has a stop check. The metrics are updated when we need to stop. 2. no stop check. The metrics are updated after the loop. ## How was this patch tested? existing tests and a new benchmark Closes apache#22698 from cloud-fan/range. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

range should report metrics correctly

1071c14

cloud-fan commented Oct 11, 2018

View reviewed changes

mgaido91 reviewed Oct 11, 2018

View reviewed changes

viirya reviewed Oct 11, 2018

View reviewed changes

wangyum reviewed Oct 11, 2018

View reviewed changes

address comment

4058a21

asfgit closed this in 34f229b Oct 13, 2018

[SPARK-25710][SQL] range should report metrics correctly #22698

[SPARK-25710][SQL] range should report metrics correctly #22698

Uh oh!

Conversation

cloud-fan commented Oct 11, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 12, 2018

Uh oh!

mgaido91 commented Oct 12, 2018

Uh oh!

viirya commented Oct 12, 2018

Uh oh!

kiszk commented Oct 12, 2018

Uh oh!

cloud-fan commented Oct 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants