Skip to content

Commit 813a33f

Browse files
authored
Merge pull request apache#64 from palantir/rk/merge
Merge upstream
2 parents 7edbd64 + 189d143 commit 813a33f

682 files changed

Lines changed: 20151 additions & 7182 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ project/plugins/project/build.properties
5757
project/plugins/src_managed/
5858
project/plugins/target/
5959
python/lib/pyspark.zip
60+
python/deps
61+
python/pyspark/python
6062
reports/
6163
scalastyle-on-compile.generated.xml
6264
scalastyle-output.xml

NOTICE

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -421,9 +421,6 @@ Copyright (c) 2011, Terrence Parr.
421421
This product includes/uses ASM (http://asm.ow2.org/),
422422
Copyright (c) 2000-2007 INRIA, France Telecom.
423423

424-
This product includes/uses org.json (http://www.json.org/java/index.html),
425-
Copyright (c) 2002 JSON.org
426-
427424
This product includes/uses JLine (http://jline.sourceforge.net/),
428425
Copyright (c) 2002-2006, Marc Prud'hommeaux <[email protected]>.
429426

R/CRAN_RELEASE.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# SparkR CRAN Release
2+
3+
To release SparkR as a package to CRAN, we would use the `devtools` package. Please work with the
4+
`[email protected]` community and R package maintainer on this.
5+
6+
### Release
7+
8+
First, check that the `Version:` field in the `pkg/DESCRIPTION` file is updated. Also, check for stale files not under source control.
9+
10+
Note that while `check-cran.sh` is running `R CMD check`, it is doing so with `--no-manual --no-vignettes`, which skips a few vignettes or PDF checks - therefore it will be preferred to run `R CMD check` on the source package built manually before uploading a release.
11+
12+
To upload a release, we would need to update the `cran-comments.md`. This should generally contain the results from running the `check-cran.sh` script along with comments on status of all `WARNING` (should not be any) or `NOTE`. As a part of `check-cran.sh` and the release process, the vignettes is build - make sure `SPARK_HOME` is set and Spark jars are accessible.
13+
14+
Once everything is in place, run in R under the `SPARK_HOME/R` directory:
15+
16+
```R
17+
paths <- .libPaths(); .libPaths(c("lib", paths)); Sys.setenv(SPARK_HOME=tools::file_path_as_absolute("..")); devtools::release(); .libPaths(paths)
18+
```
19+
20+
For more information please refer to http://r-pkgs.had.co.nz/release.html#release-check
21+
22+
### Testing: build package manually
23+
24+
To build package manually such as to inspect the resulting `.tar.gz` file content, we would also use the `devtools` package.
25+
26+
Source package is what get released to CRAN. CRAN would then build platform-specific binary packages from the source package.
27+
28+
#### Build source package
29+
30+
To build source package locally without releasing to CRAN, run in R under the `SPARK_HOME/R` directory:
31+
32+
```R
33+
paths <- .libPaths(); .libPaths(c("lib", paths)); Sys.setenv(SPARK_HOME=tools::file_path_as_absolute("..")); devtools::build("pkg"); .libPaths(paths)
34+
```
35+
36+
(http://r-pkgs.had.co.nz/vignettes.html#vignette-workflow-2)
37+
38+
Similarly, the source package is also created by `check-cran.sh` with `R CMD build pkg`.
39+
40+
For example, this should be the content of the source package:
41+
42+
```sh
43+
DESCRIPTION R inst tests
44+
NAMESPACE build man vignettes
45+
46+
inst/doc/
47+
sparkr-vignettes.html
48+
sparkr-vignettes.Rmd
49+
sparkr-vignettes.Rman
50+
51+
build/
52+
vignette.rds
53+
54+
man/
55+
*.Rd files...
56+
57+
vignettes/
58+
sparkr-vignettes.Rmd
59+
```
60+
61+
#### Test source package
62+
63+
To install, run this:
64+
65+
```sh
66+
R CMD INSTALL SparkR_2.1.0.tar.gz
67+
```
68+
69+
With "2.1.0" replaced with the version of SparkR.
70+
71+
This command installs SparkR to the default libPaths. Once that is done, you should be able to start R and run:
72+
73+
```R
74+
library(SparkR)
75+
vignette("sparkr-vignettes", package="SparkR")
76+
```
77+
78+
#### Build binary package
79+
80+
To build binary package locally, run in R under the `SPARK_HOME/R` directory:
81+
82+
```R
83+
paths <- .libPaths(); .libPaths(c("lib", paths)); Sys.setenv(SPARK_HOME=tools::file_path_as_absolute("..")); devtools::build("pkg", binary = TRUE); .libPaths(paths)
84+
```
85+
86+
For example, this should be the content of the binary package:
87+
88+
```sh
89+
DESCRIPTION Meta R html tests
90+
INDEX NAMESPACE help profile worker
91+
```

R/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ SparkR is an R package that provides a light-weight frontend to use Spark from R
66

77
Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be done by running the script `$SPARK_HOME/R/install-dev.sh`.
88
By default the above script uses the system wide installation of R. However, this can be changed to any user installed location of R by setting the environment variable `R_HOME` the full path of the base directory where R is installed, before running install-dev.sh script.
9-
Example:
9+
Example:
1010
```bash
1111
# where /home/username/R is where R is installed and /home/username/R/bin contains the files R and RScript
1212
export R_HOME=/home/username/R
@@ -46,19 +46,19 @@ Sys.setenv(SPARK_HOME="/Users/username/spark")
4646
# This line loads SparkR from the installed directory
4747
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
4848
library(SparkR)
49-
sc <- sparkR.init(master="local")
49+
sparkR.session()
5050
```
5151

5252
#### Making changes to SparkR
5353

5454
The [instructions](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) for making contributions to Spark also apply to SparkR.
5555
If you only make R file changes (i.e. no Scala changes) then you can just re-install the R package using `R/install-dev.sh` and test your changes.
5656
Once you have made your changes, please include unit tests for them and run existing unit tests using the `R/run-tests.sh` script as described below.
57-
57+
5858
#### Generating documentation
5959

6060
The SparkR documentation (Rd files and HTML files) are not a part of the source repository. To generate them you can run the script `R/create-docs.sh`. This script uses `devtools` and `knitr` to generate the docs and these packages need to be installed on the machine before using the script. Also, you may need to install these [prerequisites](https://github.com/apache/spark/tree/master/docs#prerequisites). See also, `R/DOCUMENTATION.md`
61-
61+
6262
### Examples, Unit tests
6363

6464
SparkR comes with several sample programs in the `examples/src/main/r` directory.

R/check-cran.sh

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,27 @@ if [ ! -z "$R_HOME" ]
3636
fi
3737
echo "USING R_HOME = $R_HOME"
3838

39-
# Build the latest docs
39+
# Build the latest docs, but not vignettes, which is built with the package next
4040
$FWDIR/create-docs.sh
4141

42-
# Build a zip file containing the source package
43-
"$R_SCRIPT_PATH/"R CMD build $FWDIR/pkg
42+
# Build source package with vignettes
43+
SPARK_HOME="$(cd "${FWDIR}"/..; pwd)"
44+
. "${SPARK_HOME}"/bin/load-spark-env.sh
45+
if [ -f "${SPARK_HOME}/RELEASE" ]; then
46+
SPARK_JARS_DIR="${SPARK_HOME}/jars"
47+
else
48+
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
49+
fi
50+
51+
if [ -d "$SPARK_JARS_DIR" ]; then
52+
# Build a zip file containing the source package with vignettes
53+
SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD build $FWDIR/pkg
54+
55+
find pkg/vignettes/. -not -name '.' -not -name '*.Rmd' -not -name '*.md' -not -name '*.pdf' -not -name '*.html' -delete
56+
else
57+
echo "Error Spark JARs not found in $SPARK_HOME"
58+
exit 1
59+
fi
4460

4561
# Run check as-cran.
4662
VERSION=`grep Version $FWDIR/pkg/DESCRIPTION | awk '{print $NF}'`
@@ -54,11 +70,16 @@ fi
5470

5571
if [ -n "$NO_MANUAL" ]
5672
then
57-
CRAN_CHECK_OPTIONS=$CRAN_CHECK_OPTIONS" --no-manual"
73+
CRAN_CHECK_OPTIONS=$CRAN_CHECK_OPTIONS" --no-manual --no-vignettes"
5874
fi
5975

6076
echo "Running CRAN check with $CRAN_CHECK_OPTIONS options"
6177

62-
"$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
63-
78+
if [ -n "$NO_TESTS" ] && [ -n "$NO_MANUAL" ]
79+
then
80+
"$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
81+
else
82+
# This will run tests and/or build vignettes, and require SPARK_HOME
83+
SPARK_HOME="${SPARK_HOME}" "$R_SCRIPT_PATH/"R CMD check $CRAN_CHECK_OPTIONS SparkR_"$VERSION".tar.gz
84+
fi
6485
popd > /dev/null

R/create-docs.sh

Lines changed: 1 addition & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
# Script to create API docs and vignettes for SparkR
2121
# This requires `devtools`, `knitr` and `rmarkdown` to be installed on the machine.
2222

23-
# After running this script the html docs can be found in
23+
# After running this script the html docs can be found in
2424
# $SPARK_HOME/R/pkg/html
2525
# The vignettes can be found in
2626
# $SPARK_HOME/R/pkg/vignettes/sparkr_vignettes.html
@@ -52,21 +52,4 @@ Rscript -e 'libDir <- "../../lib"; library(SparkR, lib.loc=libDir); library(knit
5252

5353
popd
5454

55-
# Find Spark jars.
56-
if [ -f "${SPARK_HOME}/RELEASE" ]; then
57-
SPARK_JARS_DIR="${SPARK_HOME}/jars"
58-
else
59-
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
60-
fi
61-
62-
# Only create vignettes if Spark JARs exist
63-
if [ -d "$SPARK_JARS_DIR" ]; then
64-
# render creates SparkR vignettes
65-
Rscript -e 'library(rmarkdown); paths <- .libPaths(); .libPaths(c("lib", paths)); Sys.setenv(SPARK_HOME=tools::file_path_as_absolute("..")); render("pkg/vignettes/sparkr-vignettes.Rmd"); .libPaths(paths)'
66-
67-
find pkg/vignettes/. -not -name '.' -not -name '*.Rmd' -not -name '*.md' -not -name '*.pdf' -not -name '*.html' -delete
68-
else
69-
echo "Skipping R vignettes as Spark JARs not found in $SPARK_HOME"
70-
fi
71-
7255
popd

R/pkg/DESCRIPTION

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
Package: SparkR
22
Type: Package
33
Title: R Frontend for Apache Spark
4-
Version: 2.0.0
5-
Date: 2016-08-27
4+
Version: 2.1.0
5+
Date: 2016-11-06
66
Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),
77
email = "[email protected]"),
88
person("Xiangrui", "Meng", role = "aut",
@@ -18,7 +18,9 @@ Depends:
1818
Suggests:
1919
testthat,
2020
e1071,
21-
survival
21+
survival,
22+
knitr,
23+
rmarkdown
2224
Description: The SparkR package provides an R frontend for Apache Spark.
2325
License: Apache License (== 2.0)
2426
Collate:
@@ -48,3 +50,4 @@ Collate:
4850
'utils.R'
4951
'window.R'
5052
RoxygenNote: 5.0.1
53+
VignetteBuilder: knitr

R/pkg/NAMESPACE

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,8 @@ exportMethods("glm",
4545
"spark.als",
4646
"spark.kstest",
4747
"spark.logit",
48-
"spark.randomForest")
48+
"spark.randomForest",
49+
"spark.gbt")
4950

5051
# Job group lifecycle management methods
5152
export("setJobGroup",
@@ -353,7 +354,9 @@ export("as.DataFrame",
353354
"read.ml",
354355
"print.summary.KSTest",
355356
"print.summary.RandomForestRegressionModel",
356-
"print.summary.RandomForestClassificationModel")
357+
"print.summary.RandomForestClassificationModel",
358+
"print.summary.GBTRegressionModel",
359+
"print.summary.GBTClassificationModel")
357360

358361
export("structField",
359362
"structField.jobj",
@@ -380,6 +383,8 @@ S3method(print, summary.GeneralizedLinearRegressionModel)
380383
S3method(print, summary.KSTest)
381384
S3method(print, summary.RandomForestRegressionModel)
382385
S3method(print, summary.RandomForestClassificationModel)
386+
S3method(print, summary.GBTRegressionModel)
387+
S3method(print, summary.GBTClassificationModel)
383388
S3method(structField, character)
384389
S3method(structField, jobj)
385390
S3method(structType, jobj)

R/pkg/R/DataFrame.R

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -788,7 +788,7 @@ setMethod("write.json",
788788
function(x, path, mode = "error", ...) {
789789
write <- callJMethod(x@sdf, "write")
790790
write <- setWriteOptions(write, mode = mode, ...)
791-
invisible(callJMethod(write, "json", path))
791+
invisible(handledCallJMethod(write, "json", path))
792792
})
793793

794794
#' Save the contents of SparkDataFrame as an ORC file, preserving the schema.
@@ -819,7 +819,7 @@ setMethod("write.orc",
819819
function(x, path, mode = "error", ...) {
820820
write <- callJMethod(x@sdf, "write")
821821
write <- setWriteOptions(write, mode = mode, ...)
822-
invisible(callJMethod(write, "orc", path))
822+
invisible(handledCallJMethod(write, "orc", path))
823823
})
824824

825825
#' Save the contents of SparkDataFrame as a Parquet file, preserving the schema.
@@ -851,7 +851,7 @@ setMethod("write.parquet",
851851
function(x, path, mode = "error", ...) {
852852
write <- callJMethod(x@sdf, "write")
853853
write <- setWriteOptions(write, mode = mode, ...)
854-
invisible(callJMethod(write, "parquet", path))
854+
invisible(handledCallJMethod(write, "parquet", path))
855855
})
856856

857857
#' @rdname write.parquet
@@ -895,7 +895,7 @@ setMethod("write.text",
895895
function(x, path, mode = "error", ...) {
896896
write <- callJMethod(x@sdf, "write")
897897
write <- setWriteOptions(write, mode = mode, ...)
898-
invisible(callJMethod(write, "text", path))
898+
invisible(handledCallJMethod(write, "text", path))
899899
})
900900

901901
#' Distinct
@@ -936,7 +936,9 @@ setMethod("unique",
936936

937937
#' Sample
938938
#'
939-
#' Return a sampled subset of this SparkDataFrame using a random seed.
939+
#' Return a sampled subset of this SparkDataFrame using a random seed.
940+
#' Note: this is not guaranteed to provide exactly the fraction specified
941+
#' of the total count of of the given SparkDataFrame.
940942
#'
941943
#' @param x A SparkDataFrame
942944
#' @param withReplacement Sampling with replacement or not
@@ -3342,7 +3344,7 @@ setMethod("write.jdbc",
33423344
jprops <- varargsToJProperties(...)
33433345
write <- callJMethod(x@sdf, "write")
33443346
write <- callJMethod(write, "mode", jmode)
3345-
invisible(callJMethod(write, "jdbc", url, tableName, jprops))
3347+
invisible(handledCallJMethod(write, "jdbc", url, tableName, jprops))
33463348
})
33473349

33483350
#' randomSplit

R/pkg/R/SQLContext.R

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -350,7 +350,7 @@ read.json.default <- function(path, ...) {
350350
paths <- as.list(suppressWarnings(normalizePath(path)))
351351
read <- callJMethod(sparkSession, "read")
352352
read <- callJMethod(read, "options", options)
353-
sdf <- callJMethod(read, "json", paths)
353+
sdf <- handledCallJMethod(read, "json", paths)
354354
dataFrame(sdf)
355355
}
356356

@@ -422,7 +422,7 @@ read.orc <- function(path, ...) {
422422
path <- suppressWarnings(normalizePath(path))
423423
read <- callJMethod(sparkSession, "read")
424424
read <- callJMethod(read, "options", options)
425-
sdf <- callJMethod(read, "orc", path)
425+
sdf <- handledCallJMethod(read, "orc", path)
426426
dataFrame(sdf)
427427
}
428428

@@ -444,7 +444,7 @@ read.parquet.default <- function(path, ...) {
444444
paths <- as.list(suppressWarnings(normalizePath(path)))
445445
read <- callJMethod(sparkSession, "read")
446446
read <- callJMethod(read, "options", options)
447-
sdf <- callJMethod(read, "parquet", paths)
447+
sdf <- handledCallJMethod(read, "parquet", paths)
448448
dataFrame(sdf)
449449
}
450450

@@ -496,7 +496,7 @@ read.text.default <- function(path, ...) {
496496
paths <- as.list(suppressWarnings(normalizePath(path)))
497497
read <- callJMethod(sparkSession, "read")
498498
read <- callJMethod(read, "options", options)
499-
sdf <- callJMethod(read, "text", paths)
499+
sdf <- handledCallJMethod(read, "text", paths)
500500
dataFrame(sdf)
501501
}
502502

@@ -914,12 +914,13 @@ read.jdbc <- function(url, tableName,
914914
} else {
915915
numPartitions <- numToInt(numPartitions)
916916
}
917-
sdf <- callJMethod(read, "jdbc", url, tableName, as.character(partitionColumn),
918-
numToInt(lowerBound), numToInt(upperBound), numPartitions, jprops)
917+
sdf <- handledCallJMethod(read, "jdbc", url, tableName, as.character(partitionColumn),
918+
numToInt(lowerBound), numToInt(upperBound), numPartitions, jprops)
919919
} else if (length(predicates) > 0) {
920-
sdf <- callJMethod(read, "jdbc", url, tableName, as.list(as.character(predicates)), jprops)
920+
sdf <- handledCallJMethod(read, "jdbc", url, tableName, as.list(as.character(predicates)),
921+
jprops)
921922
} else {
922-
sdf <- callJMethod(read, "jdbc", url, tableName, jprops)
923+
sdf <- handledCallJMethod(read, "jdbc", url, tableName, jprops)
923924
}
924925
dataFrame(sdf)
925926
}

0 commit comments

Comments
 (0)