Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
67364ab
start working on SparkR tweedie API
actuaryzhang Jan 27, 2017
654551b
set link only for non-tweedie; fix issue on aic
actuaryzhang Jan 28, 2017
852dd6e
add test for tweedie
actuaryzhang Jan 28, 2017
5aa4ae7
fix style
actuaryzhang Jan 28, 2017
3682692
fix style issue
actuaryzhang Jan 28, 2017
3555afb
remove dependency on statmod
actuaryzhang Jan 28, 2017
56f6da0
create model matix directly from formula
actuaryzhang Jan 29, 2017
083849c
update glmWrapper
actuaryzhang Jan 29, 2017
fb66ce0
add comments
actuaryzhang Jan 29, 2017
0d722fd
fix style issue
actuaryzhang Jan 29, 2017
d11fc4b
remove statmod from suggest; update glm
actuaryzhang Jan 29, 2017
4c24158
clean up doc
actuaryzhang Jan 29, 2017
295711d
remove link to statmod
actuaryzhang Jan 30, 2017
c315fb1
allow R-like tweedie specification in family
actuaryzhang Feb 1, 2017
9be9c51
set default tweedie link to avoid passing functions to scala
actuaryzhang Feb 1, 2017
201939b
add tweedie in vignettes
actuaryzhang Feb 1, 2017
6737122
add internal tweedie family
actuaryzhang Feb 1, 2017
0b5ed43
fix style
actuaryzhang Feb 1, 2017
b10777e
fix doc
actuaryzhang Feb 2, 2017
7d5bd60
fix doc
actuaryzhang Feb 2, 2017
a9ac439
fix doc issue
actuaryzhang Feb 2, 2017
f540922
merge pull
actuaryzhang Mar 6, 2017
6cbc62f
make twedie internal (no export)
actuaryzhang Mar 6, 2017
ef65adc
fix test issue
actuaryzhang Mar 6, 2017
c11e57c
add back variancePower and linkPower params
actuaryzhang Mar 9, 2017
5ce4c84
update vignettes
actuaryzhang Mar 9, 2017
aeeb3f7
fix style
actuaryzhang Mar 9, 2017
4cffc40
change names of tweedie parameters to be consistent with R
actuaryzhang Mar 13, 2017
0b496a6
update doc
actuaryzhang Mar 13, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 7 additions & 8 deletions R/pkg/R/mllib_regression.R
Original file line number Diff line number Diff line change
Expand Up @@ -114,19 +114,18 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"),
}
}
if (is.function(family)) {
# family = statmod::tweedie()
if (tolower(family$family) == "tweedie") {
family <- list(family = "tweedie", link = "linkNotUsed")
variancePower <- log(family$variance(exp(1)))
linkPower <- log(family$linkfun(exp(1)))
} else {
family <- family()
}
family <- family()
}
if (is.null(family$family)) {
print(family)
stop("'family' not recognized")
}
# family = statmod::tweedie()
if (tolower(family$family) == "tweedie" && !is.null(family$variance)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume it handles the "fake" family created on L111 correctly? it doesn't have variance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part only handles the case when statmod::tweedie is specified: it retrieves the var.power and link.power and construct a list with family name and link name to be used.
The check for non-null variance is to skip handling the "fake" family. All we need when specifying family = "tweedie" is just a list with family name and link name.

variancePower <- log(family$variance(exp(1)))
linkPower <- log(family$linkfun(exp(1)))
family <- list(family = "tweedie", link = "linkNotUsed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, link = NULL

}

formula <- paste(deparse(formula), collapse = "")
if (!is.null(weightCol) && weightCol == "") {
Expand Down
12 changes: 6 additions & 6 deletions R/pkg/inst/tests/testthat/test_mllib_regression.R
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,9 @@ test_that("spark.glm and predict", {
#' print(coef(rModel))

rCoef <- c(0.6455409, 0.1169143, -0.3224752, -0.3282174)
rVals <- as.numeric(model.matrix(Sepal.Width ~ Sepal.Length + Species,
data = iris) %*% rCoef)
expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
rVals <- exp(as.numeric(model.matrix(Sepal.Width ~ Sepal.Length + Species,
data = iris) %*% rCoef))
expect_true(all(abs(rVals - vals) < 1e-5), rVals - vals)

# Test stats::predict is working
x <- rnorm(15)
Expand Down Expand Up @@ -281,9 +281,9 @@ test_that("glm and predict", {
#' print(coef(rModel))

rCoef <- c(0.6455409, 0.1169143, -0.3224752, -0.3282174)
rVals <- as.numeric(model.matrix(Sepal.Width ~ Sepal.Length + Species,
data = iris) %*% rCoef)
expect_true(all(abs(rVals - vals) < 1e-6), rVals - vals)
rVals <- exp(as.numeric(model.matrix(Sepal.Width ~ Sepal.Length + Species,
data = iris) %*% rCoef))
expect_true(all(abs(rVals - vals) < 1e-5), rVals - vals)

# Test stats::predict is working
x <- rnorm(15)
Expand Down
9 changes: 6 additions & 3 deletions R/pkg/vignettes/sparkr-vignettes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -682,7 +682,9 @@ There are three ways to specify the `family` argument.

* Result returned by a family function, e.g. `family = poisson(link = log)`.

* Note that when package `statmod` is loaded, the tweedie family is specified as `tweedie(var.power, link.power)`. Otherwise, one can use the SparkR internal definition `SparkR:::tweedie(var.power, link.power)`. In the above, `var.power` is the power index of the variance function and `link.power` is the index of the the power link function (the default value is `link.power = 1.0 - var.power`). This is consistent with the `tweedie` family defined in the `statmod` package. Some examples: `family = tweedie(0.0)` is gaussian with identity link, `family = tweedie(1.0)` poisson with log link, `family = tweedie(2.0)` Gamma with inverse link, and `family = tweedie(1.5, 0.0)` compound Poisson with log link.
* Note that there are two ways to specify the tweedie family:
a) Set `family = "tweedie"` and specify the `variancePower` and `linkPower`
b) When package `statmod` is loaded, the tweedie family is specified using the family definition therein, i.e., `tweedie()`.

For more information regarding the families and their link functions, see the Wikipedia page [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model).

Expand All @@ -700,12 +702,13 @@ head(select(gaussianFitted, "model", "prediction", "mpg", "wt", "hp"))

The following is the same fit using the tweedie family:
```{r}
tweedieGLM1 <- spark.glm(carsDF, mpg ~ wt + hp, family = SparkR:::tweedie(0.0))
tweedieGLM1 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie", variancePower = 0.0)
summary(tweedieGLM1)
```
We can try other distributions in the tweedie family, for example, a compound Poisson distribution with a log link:
```{r}
tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = SparkR:::tweedie(1.2, 0.0))
tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie",
variancePower = 1.2, linkPower = 0.0)
summary(tweedieGLM2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add an example with statmod too? either here or in roxygen2 API doc (later might be a better place?)

```

Expand Down