Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion core/src/main/scala/org/apache/spark/partial/BoundedDouble.scala
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,20 @@ package org.apache.spark.partial
* A Double value with error bars and associated confidence.
*/
class BoundedDouble(val mean: Double, val confidence: Double, val low: Double, val high: Double) {
override def toString(): String = "[%.3f, %.3f]".format(low, high)
override def toString(): String = "BoundedDouble(%.3f, %.3f, %.3f, %.3f)".format(mean, confidence, low, high)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say that this can become a case class to get this all for free, but it's part of the API. In theory that doesn't change anything, so you can try a case class and see if MiMa accepts it. Otherwise you can do this, but you'll need to fix up a few little style things. I'd not change the toString, hashCode might be too long a line, and indent on equals is too deep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case classes bring extra concerns with binary compatibility (due to pattern matching). I'd minimize its use in public APIs.


override def hashCode:Int = this.mean.hashCode ^ this.confidence.hashCode ^ this.low.hashCode ^ this.high.hashCode

override def equals(that: Any): Boolean =
that match {
case that: BoundedDouble => {
this.mean == that.mean &&
this.confidence == that.confidence &&
this.low == that.low &&
this.high == that.high
}
case _ => false
}


}
38 changes: 24 additions & 14 deletions core/src/main/scala/org/apache/spark/partial/SumEvaluator.scala
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,9 @@ import org.apache.spark.util.StatCounter
private[spark] class SumEvaluator(totalOutputs: Int, confidence: Double)
extends ApproximateEvaluator[StatCounter, BoundedDouble] {

// modified in merge
var outputsMerged = 0
var counter = new StatCounter
val counter = new StatCounter

override def merge(outputId: Int, taskResult: StatCounter) {
outputsMerged += 1
Expand All @@ -40,30 +41,39 @@ private[spark] class SumEvaluator(totalOutputs: Int, confidence: Double)
override def currentResult(): BoundedDouble = {
if (outputsMerged == totalOutputs) {
new BoundedDouble(counter.sum, 1.0, counter.sum, counter.sum)
} else if (outputsMerged == 0) {
} else if (outputsMerged == 0 || counter.count == 0) {
new BoundedDouble(0, 0.0, Double.NegativeInfinity, Double.PositiveInfinity)
} else {
val p = outputsMerged.toDouble / totalOutputs
val meanEstimate = counter.mean
val meanVar = counter.sampleVariance / counter.count
val countEstimate = (counter.count + 1 - p) / p
val countVar = (counter.count + 1) * (1 - p) / (p * p)
val sumEstimate = meanEstimate * countEstimate
val sumVar = (meanEstimate * meanEstimate * countVar) +
(countEstimate * countEstimate * meanVar) +
(meanVar * countVar)
val sumStdev = math.sqrt(sumVar)
val confFactor = {
if (counter.count > 100) {

val meanVar = counter.sampleVariance / counter.count

// branch at this point because counter.count == 1 implies counter.sampleVariance == Nan
// and we don't want to ever return a bound of NaN
if (meanVar == Double.NaN || counter.count == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN != NaN; you'd have to use Double.isNaN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch

new BoundedDouble(sumEstimate, confidence, Double.NegativeInfinity, Double.PositiveInfinity)
} else {
val countVar = (counter.count + 1) * (1 - p) / (p * p)
val sumVar = (meanEstimate * meanEstimate * countVar) +
(countEstimate * countEstimate * meanVar) +
(meanVar * countVar)
val sumStdev = math.sqrt(sumVar)
val confFactor = if (counter.count > 100) {
new NormalDistribution().inverseCumulativeProbability(1 - (1 - confidence) / 2)
} else {
} else if (counter.count > 1) {
val degreesOfFreedom = (counter.count - 1).toInt
new TDistribution(degreesOfFreedom).inverseCumulativeProbability(1 - (1 - confidence) / 2)
} else {
throw new Exception("Counter.count <= 1; this should be impossible at this point")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary, as you say. Just remove this branch right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately without the final branch confFactor resolves to Any type.
That last branch is necessary to keep it as double - either by returning a
dummy value or throwing an exception. I think an exception is safer because
it's more easily detected that it is no longer impossible.

I'll work on the tests tomorrow.

On Tuesday, March 29, 2016, Sean Owen [email protected] wrote:

In core/src/main/scala/org/apache/spark/partial/SumEvaluator.scala
#12016 (comment):

       val degreesOfFreedom = (counter.count - 1).toInt
       new TDistribution(degreesOfFreedom).inverseCumulativeProbability(1 - (1 - confidence) / 2)
  •    } else {
    
  •      throw new Exception("Counter.count <= 1; this should be impossible at this point")
    

This is unnecessary, as you say. Just remove this branch right?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12016/files/3faecc4f18094686c843060a1e53b81b9e04e75d#r57723805

Want to work at Handy? Check out our culture deck and open roles
http://www.handy.com/careers
Latest news http://www.handy.com/press at Handy
Handy just raised $50m
http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/ led
by Fidelity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but I'm suggesting there be no third branch at all, like before. I understand the types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but then you have to have an unconditional else as the final branch. This whole bug stems from that code being unsuitable for an unconditional else. With the code with three branches, and the final branch being an exception, you'll know if changes in the logic above have re-introduced the same error condition. With an unconditional else holding the call to TDistribution you won't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've already handled the count=0 and count=1 cases earlier. Checking count > 1 doesn't do anything since it can't be false so having a branch for it is odd. Tests are how we catch regressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the check does nothing for the computer but it makes it
easier to read. It's slightly better than a comment because it won't lie
around being incorrect and stale.

Nevertheless I can fix it up to your preference together the tests.

On Tuesday, March 29, 2016, Sean Owen [email protected] wrote:

In core/src/main/scala/org/apache/spark/partial/SumEvaluator.scala
#12016 (comment):

       val degreesOfFreedom = (counter.count - 1).toInt
       new TDistribution(degreesOfFreedom).inverseCumulativeProbability(1 - (1 - confidence) / 2)
  •    } else {
    
  •      throw new Exception("Counter.count <= 1; this should be impossible at this point")
    

You've already handled the count=0 and count=1 cases earlier. Checking
count > 1 doesn't do anything since it can't happen so having a branch for
it is odd. Tests are how we catch regressions.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/12016/files/3faecc4f18094686c843060a1e53b81b9e04e75d#r57732511

Want to work at Handy? Check out our culture deck and open roles
http://www.handy.com/careers
Latest news http://www.handy.com/press at Handy
Handy just raised $50m
http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/ led
by Fidelity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's easier to read since it's more code and suggests this branch can happen, when it should be impossible.

}

val low = sumEstimate - confFactor * sumStdev
val high = sumEstimate + confFactor * sumStdev
new BoundedDouble(sumEstimate, confidence, low, high)
}
val low = sumEstimate - confFactor * sumStdev
val high = sumEstimate + confFactor * sumStdev
new BoundedDouble(sumEstimate, confidence, low, high)
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.partial


import org.apache.spark._
import org.apache.spark.util.StatCounter

class SumEvaluatorSuite extends SparkFunSuite with SharedSparkContext {

test("correct handling of count 1") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're here throw in a basic test for count == 0 too, and ideally some normal-path case for completeness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


//setup
val counter = new StatCounter(List(2.0))
// count of 10 because it's larger than 1,
// and 0.95 because that's the default
val evaluator = new SumEvaluator(10, 0.95)
// arbitrarily assign id 1
evaluator.merge(1, counter)

//execute
val res = evaluator.currentResult()
// Build version with known precisions for equality check
val round_res = new BoundedDouble(res.mean.round.toDouble, res.confidence, res.low, res.high)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to round to the nearest integer. You can assert approximate equality in a test. But I think you can assert exact equality in this case since it's deterministic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the exact value being returned doesn't have an exact decimal representation. I'll think about the best course here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By that logic lots of things returning floating point values would be untestable, but they are. You are asserting that the value returned is the expected closest-possible double. This is the best course. This test asserts anything that rounds to the same int is fine, but that's unnecessarily coarse, and we have clearer syntax for it. If it's really called for, asserting approximate equality is better.

This should be a conceptually simple fix plus tests, so I'd prefer to head straight at the simple solution and wrap this up, or else it may be simpler to merge #11981 with a test instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Update me then if you change your mind. In the mean time I'll make changes here.


//Sanity check that equality works on BoundedDouble
assert(new BoundedDouble(2.0, 0.95, 1.1, 1.2) == new BoundedDouble(2.0, 0.95, 1.1, 1.2))
// actual test

// 38.0 because that's how the maths shakes out
assert(round_res == new BoundedDouble(38.0, 0.950, Double.NegativeInfinity, Double.PositiveInfinity))
}

}