Skip to content
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
952887e
Add Tweedie family to GLM
actuaryzhang Dec 16, 2016
4f184ec
Fix calculation in dev resid; Add test for different var power
actuaryzhang Dec 19, 2016
7fe3910
Merge test into GLR
actuaryzhang Dec 19, 2016
bfcc4fb
Use Tweedie class instead of global object Tweedie; change variancePo…
actuaryzhang Dec 20, 2016
a8feea7
Allow Family to use GLRBase object directly
actuaryzhang Dec 21, 2016
233e2d3
Add TweedieFamily and implement specific distn within Tweedie
actuaryzhang Dec 22, 2016
17c5581
Clean up doc
actuaryzhang Dec 22, 2016
0b41825
Move defaultLink and name to subclass of TweedieFamily
actuaryzhang Dec 22, 2016
6e8e607
Change style for AIC
actuaryzhang Dec 22, 2016
8d7d34e
Rename Family methods and restore methods for tweedie subclasses
actuaryzhang Dec 23, 2016
6da7e30
Update test
actuaryzhang Dec 23, 2016
9a71e89
Clean up doc
actuaryzhang Dec 27, 2016
f461c09
Put delta in Tweedie companion object
actuaryzhang Dec 27, 2016
a839c46
Clean up doc
actuaryzhang Dec 27, 2016
fab2652
Allow more link functions in tweedie
actuaryzhang Jan 5, 2017
651ea62
Implement link power
actuaryzhang Jan 10, 2017
0310e85
remove restriction on link power; revert to use family object in check
actuaryzhang Jan 10, 2017
6f4abeb
create factory method for FamilyAndLink
actuaryzhang Jan 13, 2017
7b5d4cd
fix style issue in test
actuaryzhang Jan 14, 2017
5a40073
make model writable
actuaryzhang Jan 17, 2017
a6fe665
Merge branch 'master' into tweedie
actuaryzhang Jan 18, 2017
83deee3
resolve conflicts
actuaryzhang Jan 18, 2017
995e88f
Merge branch 'master' into tweedie
actuaryzhang Jan 23, 2017
54da2cb
update docs
actuaryzhang Jan 23, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
/**
* Param for the name of family which is a description of the error distribution
* to be used in the model.
* Supported options: "gaussian", "binomial", "poisson" and "gamma".
* Supported options: "gaussian", "binomial", "poisson", "gamma" and "tweedie".
* Default is "gaussian".
*
* @group param
Expand All @@ -63,6 +63,27 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
@Since("2.0.0")
def getFamily: String = $(family)

/**
* Param for the power in the variance function of the Tweedie distribution which provides
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits: Param -> parameter, tweedie -> Tweedie (two lines below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed tweedie. but other docs have been using Param..

* the relationship between the variance and mean of the distribution.
* Used only for the tweedie family.
* (see <a href="https://en.wikipedia.org/wiki/Tweedie_distribution">
* Tweedie Distribution (Wikipedia)</a>)
* Supported value: (1, 2) and (2, Inf).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why we don't allow 0, 1 and 2? They correspond respectively to Gaussian, Poisson and Gamma families, I think we should support fitting a poisson GLM via the tweedie family entrance and R can do it:

y <- rgamma(20,shape=5)
x <- 1:20
glm(y~x,family=tweedie(var.power=1,link.power=1))
glm(y~x,family=poisson(link=identity))

*
* @group param
*/
@Since("2.2.0")
final val variancePower: Param[Double] = new Param(this, "variancePower",
"The power in the variance function of the Tweedie distribution which characterizes " +
"the relationship between the variance and mean of the distribution. " +
"Used only for the tweedie family. Supported value: (1, 2) and (2, Inf).",
(x: Double) => if (x > 1.0 && x != 2.0) true else false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just write => x > 1.0 && x != 2.0. if (x) true else false is redundant.


/** @group getParam */
@Since("2.2.0")
def getVariancePower: Double = $(variancePower)

/**
* Param for the name of link function which provides the relationship
* between the linear predictor and the mean of the distribution function.
Expand Down Expand Up @@ -108,8 +129,9 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
featuresDataType: DataType): StructType = {
if (isDefined(link)) {
require(supportedFamilyAndLinkPairs.contains(
Family.fromName($(family)) -> Link.fromName($(link))), "Generalized Linear Regression " +
s"with ${$(family)} family does not support ${$(link)} link function.")
Family.fromName($(family), $(variancePower)) -> Link.fromName($(link))),
s"Generalized Linear Regression with ${$(family)} family " +
s"does not support ${$(link)} link function.")
}
val newSchema = super.validateAndTransformSchema(schema, fitting, featuresDataType)
if (hasLinkPredictionCol) {
Expand All @@ -128,13 +150,14 @@ private[regression] trait GeneralizedLinearRegressionBase extends PredictorParam
* Generalized linear model (Wikipedia)</a>)
* specified by giving a symbolic description of the linear
* predictor (link function) and a description of the error distribution (family).
* It supports "gaussian", "binomial", "poisson" and "gamma" as family.
* It supports "gaussian", "binomial", "poisson", "gamma" and "tweedie" as family.
* Valid link functions for each family is listed below. The first link function of each family
* is the default one.
* - "gaussian" : "identity", "log", "inverse"
* - "binomial" : "logit", "probit", "cloglog"
* - "poisson" : "log", "identity", "sqrt"
* - "gamma" : "inverse", "identity", "log"
* - "tweedie" : "identity", "log"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- "tweedie" : "log", "identity", see L155: the first link function of each family is the default one.

*/
@Experimental
@Since("2.0.0")
Expand All @@ -157,6 +180,16 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
def setFamily(value: String): this.type = set(family, value)
setDefault(family -> Gaussian.name)

/**
* Sets the value of param [[variancePower]].
* Used only when family is "tweedie".
*
* @group setParam
*/
@Since("2.2.0")
def setVariancePower(value: Double): this.type = set(variancePower, value)
setDefault(variancePower -> 1.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why set the default value to 1.5, AFAIK, R set the default variancePower with 0 which means gaussian family, and identity as default link function.

glm(formula = "b ~ .", family = tweedie, data = df, weights = w)

produces the same model with

glm(formula = "b ~ .", family = gaussian, data = df, weights = w)

h2o.glm has the consistent default values with R, should we keep consistent with them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. change default variancePower to 0.0, which will use Gaussian (with default identity link)


/**
* Sets the value of param [[link]].
*
Expand Down Expand Up @@ -242,7 +275,7 @@ class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val
def setLinkPredictionCol(value: String): this.type = set(linkPredictionCol, value)

override protected def train(dataset: Dataset[_]): GeneralizedLinearRegressionModel = {
val familyObj = Family.fromName($(family))
val familyObj = Family.fromName($(family), $(variancePower))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do this either. variancePower is specific to one family, not a property of all of them.

val linkObj = if (isDefined(link)) {
Link.fromName($(link))
} else {
Expand Down Expand Up @@ -306,7 +339,8 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine
Gaussian -> Identity, Gaussian -> Log, Gaussian -> Inverse,
Binomial -> Logit, Binomial -> Probit, Binomial -> CLogLog,
Poisson -> Log, Poisson -> Identity, Poisson -> Sqrt,
Gamma -> Inverse, Gamma -> Identity, Gamma -> Log
Gamma -> Inverse, Gamma -> Identity, Gamma -> Log,
Tweedie -> Identity, Tweedie -> Log
)

/** Set of family names that GeneralizedLinearRegression supports. */
Expand Down Expand Up @@ -404,14 +438,17 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine
/**
* Gets the [[Family]] object from its name.
*
* @param name family name: "gaussian", "binomial", "poisson" or "gamma".
* @param name family name: "gaussian", "binomial", "poisson", "gamma" or "tweedie".
*/
def fromName(name: String): Family = {
def fromName(name: String, variancePower: Double): Family = {
name match {
case Gaussian.name => Gaussian
case Binomial.name => Binomial
case Poisson.name => Poisson
case Gamma.name => Gamma
case Tweedie.name =>
Tweedie.variancePower = variancePower
Tweedie
}
}
}
Expand Down Expand Up @@ -591,6 +628,59 @@ object GeneralizedLinearRegression extends DefaultParamsReadable[GeneralizedLine
}
}

/**
* Tweedie exponential family distribution.
* The default link for the Tweedie family is the log link.
*/
private[regression] object Tweedie extends Family("tweedie") {

val defaultLink: Link = Log

var variancePower: Double = 1.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a global shared variable -- we really can't do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please suggest a better way to set the variancePower? I want to be consistent with the existing code to have the Family objects, but I need to also pass on the input variancePower to the Tweedie object which is used to compute the variance function. Any suggestion will be highly appreciated. @srowen @yanboliang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Tweedie implementation needs to be able to access parameters of the GLM, to read off variancePower.

As it is this is a global variable and two jobs would overwrite each others' values.


override def initialize(y: Double, weight: Double): Double = {
if (variancePower > 1.0 && variancePower < 2.0) {
require(y >= 0.0, "The response variable of the specified Tweedie distribution " +
s"should be non-negative, but got $y")
math.max(y, 0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to use this magic 0.1 constant in many places, factor out a constant? 0.1 seems quite large as an 'epsilon' but I guess that's what R's implementation uses for whatever reason?

Copy link
Contributor Author

@actuaryzhang actuaryzhang Dec 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen a formal justification for the choice of 0.1 in R. This seminal paper (section 4) suggests 1/6 (about 0.17) to be the best constant. I would prefer to be consistent with R so that we can make comparison. Using a constant is a good idea.

} else {
require(y > 0.0, "The response variable of the specified Tweedie distribution " +
s"should be non-negative, but got $y")
y
}
}

override def variance(mu: Double): Double = math.pow(mu, variancePower)

private def yp(y: Double, mu: Double, p: Double): Double = {
(math.pow(y, p) - math.pow(mu, p)) / p
}

// Force y >= 0.1 for deviance to work for (1 - variancePower). see tweedie()$dev.resid
override def deviance(y: Double, mu: Double, weight: Double): Double = {
2.0 * weight *
(y * yp(math.max(y, 0.1), mu, 1.0 - variancePower) - yp(y, mu, 2.0 - variancePower))
}

// This depends on the density of the tweedie distribution. Not yet implemented.
override def aic(
predictions: RDD[(Double, Double, Double)],
deviance: Double,
numInstances: Double,
weightSum: Double): Double = {
0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw a UnsupportedOperationException?

}

override def project(mu: Double): Double = {
if (mu < epsilon) {
epsilon
} else if (mu.isInfinity) {
Double.MaxValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity is this meaningful to "cap" at Double.MaxValue? By the time you get there a lot of stuff is going to be infinite or not meaningful.

Copy link
Member

@srowen srowen Dec 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it's done that way in other implementations. OK. I'm not sure if it's going to do much.

I think there's a problem in the Gaussian project method because it uses Double.MinValue to appear to mean "the smallest double" when it's the "smallest possible double" I'll investigate and file a bug if needed. EDIT: Ignore this side comment, Double.MinValue isn't the same as Double.MIN_VALUE in Java.

} else {
mu
}
}
}
/**
* A description of the link function to be used in the model.
* The link function provides the relationship between the linear predictor
Expand Down Expand Up @@ -720,7 +810,7 @@ class GeneralizedLinearRegressionModel private[ml] (

import GeneralizedLinearRegression._

private lazy val familyObj = Family.fromName($(family))
private lazy val familyObj = Family.fromName($(family), $(variancePower))
private lazy val linkObj = if (isDefined(link)) {
Link.fromName($(link))
} else {
Expand Down Expand Up @@ -905,7 +995,8 @@ class GeneralizedLinearRegressionSummary private[regression] (
*/
@Since("2.0.0") @transient val predictions: DataFrame = model.transform(dataset)

private[regression] lazy val family: Family = Family.fromName(model.getFamily)
private[regression] lazy val family: Family =
Family.fromName(model.getFamily, model.getVariancePower)
private[regression] lazy val link: Link = if (model.isDefined(model.link)) {
Link.fromName(model.getLink)
} else {
Expand Down Expand Up @@ -1054,7 +1145,11 @@ class GeneralizedLinearRegressionSummary private[regression] (
case Row(label: Double, pred: Double, weight: Double) =>
(label, pred, weight)
}
family.aic(t, deviance, numInstances, weightSum) + 2 * rank
if (model.getFamily == Tweedie.name) {
throw new UnsupportedOperationException("No AIC available for the tweedie family")
} else {
family.aic(t, deviance, numInstances, weightSum) + 2 * rank
}
}
}

Expand Down
Loading