Skip to content
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
67364ab
start working on SparkR tweedie API
actuaryzhang Jan 27, 2017
654551b
set link only for non-tweedie; fix issue on aic
actuaryzhang Jan 28, 2017
852dd6e
add test for tweedie
actuaryzhang Jan 28, 2017
5aa4ae7
fix style
actuaryzhang Jan 28, 2017
3682692
fix style issue
actuaryzhang Jan 28, 2017
3555afb
remove dependency on statmod
actuaryzhang Jan 28, 2017
56f6da0
create model matix directly from formula
actuaryzhang Jan 29, 2017
083849c
update glmWrapper
actuaryzhang Jan 29, 2017
fb66ce0
add comments
actuaryzhang Jan 29, 2017
0d722fd
fix style issue
actuaryzhang Jan 29, 2017
d11fc4b
remove statmod from suggest; update glm
actuaryzhang Jan 29, 2017
4c24158
clean up doc
actuaryzhang Jan 29, 2017
295711d
remove link to statmod
actuaryzhang Jan 30, 2017
c315fb1
allow R-like tweedie specification in family
actuaryzhang Feb 1, 2017
9be9c51
set default tweedie link to avoid passing functions to scala
actuaryzhang Feb 1, 2017
201939b
add tweedie in vignettes
actuaryzhang Feb 1, 2017
6737122
add internal tweedie family
actuaryzhang Feb 1, 2017
0b5ed43
fix style
actuaryzhang Feb 1, 2017
b10777e
fix doc
actuaryzhang Feb 2, 2017
7d5bd60
fix doc
actuaryzhang Feb 2, 2017
a9ac439
fix doc issue
actuaryzhang Feb 2, 2017
f540922
merge pull
actuaryzhang Mar 6, 2017
6cbc62f
make twedie internal (no export)
actuaryzhang Mar 6, 2017
ef65adc
fix test issue
actuaryzhang Mar 6, 2017
c11e57c
add back variancePower and linkPower params
actuaryzhang Mar 9, 2017
5ce4c84
update vignettes
actuaryzhang Mar 9, 2017
aeeb3f7
fix style
actuaryzhang Mar 9, 2017
4cffc40
change names of tweedie parameters to be consistent with R
actuaryzhang Mar 13, 2017
0b496a6
update doc
actuaryzhang Mar 13, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion R/pkg/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ Suggests:
rmarkdown,
testthat,
e1071,
survival
survival,
statmod
Collate:
'schema.R'
'generics.R'
Expand Down
24 changes: 21 additions & 3 deletions R/pkg/R/mllib_regression.R
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,12 @@ setClass("IsotonicRegressionModel", representation(jobj = "jobj"))
#' # can also read back the saved model and print
#' savedModel <- read.ml(path)
#' summary(savedModel)
#'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update L56 for documentation. Also we should update the programming guide and vignettes too

#' # fit tweedie model
#' require(statmod)
Copy link
Member

@felixcheung felixcheung Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally people use library instead of require

#' model <- spark.glm(df, Sepal_Length ~ Sepal_Width,
#' family = tweedie(var.power = 1.2, link.power = 0))
#' summary(model)
#' }
#' @note spark.glm since 2.0.0
#' @seealso \link{glm}, \link{read.ml}
Expand All @@ -101,6 +107,16 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
stop("'family' not recognized")
}

# recover variancePower and linkPower from the specified tweedie family
if (tolower(family$family) == "tweedie") {
variancePower <- log(family$variance(exp(1)))
linkPower <- log(family$linkfun(exp(1)))
} else {
# these default values are not used
variancePower <- 0.0
linkPower <- 1.0
}

formula <- paste(deparse(formula), collapse = "")
if (is.null(weightCol)) {
weightCol <- ""
Expand All @@ -109,7 +125,8 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
# For known families, Gamma is upper-cased
jobj <- callJStatic("org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper",
"fit", formula, data@sdf, tolower(family$family), family$link,
tol, as.integer(maxIter), as.character(weightCol), regParam)
tol, as.integer(maxIter), as.character(weightCol), regParam,
as.double(variancePower), as.double(linkPower))
Copy link
Member

@felixcheung felixcheung Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't need to as.double here since it is either set to fixed values (L116) or from a calculation (L112). Instead, we should check var.power and link.power are within the correct range - not sure if the tweedie function does that.

new("GeneralizedLinearRegressionModel", jobj = jobj)
})

Expand All @@ -124,7 +141,7 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
#' the result of a call to a family function. Refer R family at
#' \url{https://stat.ethz.ch/R-manual/R-devel/library/stats/html/family.html}.
#' Currently these families are supported: \code{binomial}, \code{gaussian},
#' \code{Gamma}, and \code{poisson}.
#' \code{poisson}, \code{Gamma}, and \code{tweedie} (\code{statmod} package).
#' @param weightCol the weight column name. If this is not set or \code{NULL}, we treat all instance
#' weights as 1.0.
#' @param epsilon positive convergence tolerance of iterations.
Expand Down Expand Up @@ -170,9 +187,10 @@ setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"),
deviance <- callJMethod(jobj, "rDeviance")
df.null <- callJMethod(jobj, "rResidualDegreeOfFreedomNull")
df.residual <- callJMethod(jobj, "rResidualDegreeOfFreedom")
aic <- callJMethod(jobj, "rAic")
iter <- callJMethod(jobj, "rNumIterations")
family <- callJMethod(jobj, "rFamily")
aic <- callJMethod(jobj, "rAic")
if (family == "tweedie" && aic == 0) aic <- NA
deviance.resid <- if (is.loaded) {
NULL
} else {
Expand Down
12 changes: 12 additions & 0 deletions R/pkg/inst/tests/testthat/test_mllib_regression.R
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,18 @@ test_that("spark.glm and predict", {
out <- capture.output(print(summary(model)))
expect_true(any(grepl("Dispersion parameter for gamma family", out)))

# tweedie family
require(statmod)
Copy link
Member

@felixcheung felixcheung Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't require this as of now - we would need to update Jenkins otherwise it would fail like it is right now, because on Jenkins we don't have the statmod package

spark.glm and predict (@test_mllib_regression.R#81) - there is no package called 'statmod'

Copy link
Member

@felixcheung felixcheung Jan 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, library is more correct here as it will fail if the package isn't installed, instead of the warning we see and failing later.

model <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
family = tweedie(var.power = 1.2, link.power = 1.0))
prediction <- predict(model, training)
expect_equal(typeof(take(select(prediction, "prediction"), 1)$prediction), "double")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to use dtypes instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you remind me what dtypes is and why we need to use it here? Thanks.

vals <- collect(select(prediction, "prediction"))
rVals <- suppressWarnings(predict(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need suppressWarnings here?

glm(Sepal.Width ~ Sepal.Length + Species, data = iris,
family = tweedie(var.power = 1.2, link.power = 1.0)), iris))
expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)

# Test stats::predict is working
x <- rnorm(15)
y <- x + rnorm(15)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,9 @@ private[r] object GeneralizedLinearRegressionWrapper
tol: Double,
maxIter: Int,
weightCol: String,
regParam: Double): GeneralizedLinearRegressionWrapper = {
regParam: Double,
variancePower: Double,
linkPower: Double): GeneralizedLinearRegressionWrapper = {
val rFormula = new RFormula().setFormula(formula)
checkDataColumns(rFormula, data)
val rFormulaModel = rFormula.fit(data)
Expand All @@ -81,15 +83,20 @@ private[r] object GeneralizedLinearRegressionWrapper
.attributes.get
val features = featureAttrs.map(_.name.get)
// assemble and fit the pipeline
val glr = new GeneralizedLinearRegression()
var glr = new GeneralizedLinearRegression()
.setFamily(family)
.setLink(link)
.setFitIntercept(rFormula.hasIntercept)
.setTol(tol)
.setMaxIter(maxIter)
.setWeightCol(weightCol)
.setRegParam(regParam)
.setFeaturesCol(rFormula.getFeaturesCol)
// set variancePower and linkPower if family is tweedie; otherwise, set link function
if (family.toLowerCase == "tweedie") {
glr = glr.setVariancePower(variancePower).setLinkPower(linkPower)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to assign glr = here? generally the setter method will update the instance

} else {
glr = glr.setLink(link)
}
val pipeline = new Pipeline()
.setStages(Array(rFormulaModel, glr))
.fit(data)
Expand Down Expand Up @@ -143,7 +150,12 @@ private[r] object GeneralizedLinearRegressionWrapper
val rDeviance: Double = summary.deviance
val rResidualDegreeOfFreedomNull: Long = summary.residualDegreeOfFreedomNull
val rResidualDegreeOfFreedom: Long = summary.residualDegreeOfFreedom
val rAic: Double = summary.aic
val rAic: Double = if (family.toLowerCase == "tweedie" &&
!Array(0.0, 1.0, 2.0).contains(variancePower)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are comparing double values here, do you know how reliable is this? should it have epsilon in the comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Changed it to comparison instead.

0.0
} else {
summary.aic
}
val rNumIterations: Int = summary.numIterations

new GeneralizedLinearRegressionWrapper(pipeline, rFeatures, rCoefficients, rDispersion,
Expand Down