-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR #16729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
67364ab
start working on SparkR tweedie API
actuaryzhang 654551b
set link only for non-tweedie; fix issue on aic
actuaryzhang 852dd6e
add test for tweedie
actuaryzhang 5aa4ae7
fix style
actuaryzhang 3682692
fix style issue
actuaryzhang 3555afb
remove dependency on statmod
actuaryzhang 56f6da0
create model matix directly from formula
actuaryzhang 083849c
update glmWrapper
actuaryzhang fb66ce0
add comments
actuaryzhang 0d722fd
fix style issue
actuaryzhang d11fc4b
remove statmod from suggest; update glm
actuaryzhang 4c24158
clean up doc
actuaryzhang 295711d
remove link to statmod
actuaryzhang c315fb1
allow R-like tweedie specification in family
actuaryzhang 9be9c51
set default tweedie link to avoid passing functions to scala
actuaryzhang 201939b
add tweedie in vignettes
actuaryzhang 6737122
add internal tweedie family
actuaryzhang 0b5ed43
fix style
actuaryzhang b10777e
fix doc
actuaryzhang 7d5bd60
fix doc
actuaryzhang a9ac439
fix doc issue
actuaryzhang f540922
merge pull
actuaryzhang 6cbc62f
make twedie internal (no export)
actuaryzhang ef65adc
fix test issue
actuaryzhang c11e57c
add back variancePower and linkPower params
actuaryzhang 5ce4c84
update vignettes
actuaryzhang aeeb3f7
fix style
actuaryzhang 4cffc40
change names of tweedie parameters to be consistent with R
actuaryzhang 0b496a6
update doc
actuaryzhang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -114,19 +114,18 @@ setMethod("spark.glm", signature(data = "SparkDataFrame", formula = "formula"), | |
| } | ||
| } | ||
| if (is.function(family)) { | ||
| # family = statmod::tweedie() | ||
| if (tolower(family$family) == "tweedie") { | ||
| family <- list(family = "tweedie", link = "linkNotUsed") | ||
| variancePower <- log(family$variance(exp(1))) | ||
| linkPower <- log(family$linkfun(exp(1))) | ||
| } else { | ||
| family <- family() | ||
| } | ||
| family <- family() | ||
| } | ||
| if (is.null(family$family)) { | ||
| print(family) | ||
| stop("'family' not recognized") | ||
| } | ||
| # family = statmod::tweedie() | ||
| if (tolower(family$family) == "tweedie" && !is.null(family$variance)) { | ||
| variancePower <- log(family$variance(exp(1))) | ||
| linkPower <- log(family$linkfun(exp(1))) | ||
| family <- list(family = "tweedie", link = "linkNotUsed") | ||
|
||
| } | ||
|
|
||
| formula <- paste(deparse(formula), collapse = "") | ||
| if (!is.null(weightCol) && weightCol == "") { | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -682,7 +682,9 @@ There are three ways to specify the `family` argument. | |
|
|
||
| * Result returned by a family function, e.g. `family = poisson(link = log)`. | ||
|
|
||
| * Note that when package `statmod` is loaded, the tweedie family is specified as `tweedie(var.power, link.power)`. Otherwise, one can use the SparkR internal definition `SparkR:::tweedie(var.power, link.power)`. In the above, `var.power` is the power index of the variance function and `link.power` is the index of the the power link function (the default value is `link.power = 1.0 - var.power`). This is consistent with the `tweedie` family defined in the `statmod` package. Some examples: `family = tweedie(0.0)` is gaussian with identity link, `family = tweedie(1.0)` poisson with log link, `family = tweedie(2.0)` Gamma with inverse link, and `family = tweedie(1.5, 0.0)` compound Poisson with log link. | ||
| * Note that there are two ways to specify the tweedie family: | ||
| a) Set `family = "tweedie"` and specify the `variancePower` and `linkPower` | ||
| b) When package `statmod` is loaded, the tweedie family is specified using the family definition therein, i.e., `tweedie()`. | ||
|
|
||
| For more information regarding the families and their link functions, see the Wikipedia page [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model). | ||
|
|
||
|
|
@@ -700,12 +702,13 @@ head(select(gaussianFitted, "model", "prediction", "mpg", "wt", "hp")) | |
|
|
||
| The following is the same fit using the tweedie family: | ||
| ```{r} | ||
| tweedieGLM1 <- spark.glm(carsDF, mpg ~ wt + hp, family = SparkR:::tweedie(0.0)) | ||
| tweedieGLM1 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie", variancePower = 0.0) | ||
| summary(tweedieGLM1) | ||
| ``` | ||
| We can try other distributions in the tweedie family, for example, a compound Poisson distribution with a log link: | ||
| ```{r} | ||
| tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = SparkR:::tweedie(1.2, 0.0)) | ||
| tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie", | ||
| variancePower = 1.2, linkPower = 0.0) | ||
| summary(tweedieGLM2) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's add an example with |
||
| ``` | ||
|
|
||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume it handles the "fake" family created on L111 correctly? it doesn't have
varianceThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part only handles the case when
statmod::tweedieis specified: it retrieves thevar.powerandlink.powerand construct a list with family name and link name to be used.The check for non-null
varianceis to skip handling the "fake" family. All we need when specifyingfamily = "tweedie"is just a list with family name and link name.