Skip to content
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
66cfb6c
add install_spark
junyangq Jul 19, 2016
9d52d19
add doc for install_spark
junyangq Jul 19, 2016
89efb04
changes to conform to R code style
junyangq Jul 19, 2016
7ba5213
add install into sparkR.session if spark jar is not found
junyangq Jul 19, 2016
6203223
message when SPARK_HOME is non-empty
junyangq Jul 19, 2016
98087ad
change options of spark mirror url
junyangq Jul 21, 2016
0db89b7
minor changes
junyangq Jul 21, 2016
503cb9f
fix R style issue: don't use absolute paths
junyangq Jul 21, 2016
f4522a6
remove spark version function, and use return value of packageVersion…
junyangq Jul 21, 2016
78d6f91
remove unnecessary dir create
junyangq Jul 21, 2016
9666e06
fix issue that dir.exists not available before 3.2.0
junyangq Jul 25, 2016
e4fe002
another dir.exists
junyangq Jul 26, 2016
c105d88
temporary change of test
junyangq Jul 26, 2016
d19853a
recover test changes
junyangq Jul 26, 2016
124110a
minor fix
junyangq Jul 26, 2016
03b8320
fix docs
junyangq Jul 26, 2016
d727be8
delete csv file option
junyangq Jul 26, 2016
ab3789f
rename functions of default mirror and check local master
junyangq Jul 26, 2016
626e4a1
fix mustWork issue in normalizePath, and create cache folder if not f…
junyangq Jul 26, 2016
785de93
output OS name if not matched
junyangq Jul 26, 2016
e3fa259
more specific message and reference for spark_cache_path
junyangq Jul 26, 2016
cf0f66d
more precise message under free build
junyangq Jul 26, 2016
4f4a899
fix typo on hadoop version message
junyangq Jul 26, 2016
14e4943
remove the setting of global spark.install.dir option
junyangq Jul 26, 2016
976472f
add overwrite option and adjust layout (#chars per line) of doc
junyangq Jul 26, 2016
8821b56
remove hardcoded hadoop versions
junyangq Jul 26, 2016
328408b
edit doc of install.spark
junyangq Jul 26, 2016
0091615
check installation of spark only when using ordinary R console
junyangq Jul 26, 2016
dbabb56
minor fix of overwrite message
junyangq Jul 26, 2016
6b2a897
minor change of messages
junyangq Jul 27, 2016
907f37d
set env var SPARK_HOME after installation
junyangq Jul 27, 2016
fa94e3c
minor change of doc
junyangq Jul 27, 2016
22f2f78
delete trailing space
junyangq Jul 27, 2016
7aa3239
celebrate 2.0 release
junyangq Jul 27, 2016
3ad99bc
add explanation of remote url structure, change default hadoop version
junyangq Jul 27, 2016
e421c30
change default mirror
junyangq Jul 27, 2016
aa4ba4d
add additional a step of searching for suggested url and only use def…
junyangq Jul 28, 2016
2bb00e1
separate sub functions
junyangq Jul 28, 2016
699420d
fix typos in fun defs
junyangq Jul 28, 2016
d58e080
fix mirror path and typo
junyangq Jul 28, 2016
82d24a6
fix message
junyangq Jul 28, 2016
0ebef8a
fix message
junyangq Jul 28, 2016
26d4518
fix typo
junyangq Jul 28, 2016
f37a07c
message update
junyangq Jul 28, 2016
64756de
concise message, improve doc for windows, fix regex match
junyangq Jul 28, 2016
29bdf30
Disable (temporarily) some test cases of describe and summary functions
junyangq Jul 29, 2016
5decac6
Send message of reset SPARK_HOME in installation
junyangq Aug 9, 2016
d84ba06
Specify path in jsonUrl and add alias to install function doc
junyangq Aug 9, 2016
3aeb4eb
remove comment
junyangq Aug 9, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion R/pkg/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Author: The Apache Software Foundation
Maintainer: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Depends:
R (>= 3.0),
methods,
methods
Suggests:
testthat,
e1071,
Expand All @@ -31,6 +31,7 @@ Collate:
'context.R'
'deserialize.R'
'functions.R'
'install.R'
'mllib.R'
'serialize.R'
'sparkR.R'
Expand Down
2 changes: 2 additions & 0 deletions R/pkg/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -352,3 +352,5 @@ S3method(structField, character)
S3method(structField, jobj)
S3method(structType, jobj)
S3method(structType, structField)

export("install_spark")
160 changes: 160 additions & 0 deletions R/pkg/R/install.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Functions to install Spark in case the user directly downloads SparkR
# from CRAN.

#' Download and Install Spark Core to Local Directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Spark Core -> Apache Spark. This downloads the full distribution.
  • Local Directory -> a Local Directory

#'
#' \code{install_spark} downloads and installs Spark to local directory if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is install_spark consistent with SparkR naming convention? We only used underscores in wrapped SQL functions. Shall we use install.spark? cc: @shivaram

Copy link
Contributor Author

@junyangq junyangq Jul 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting to see two opposite opinions about object names in R. In the style guide of Advanced R, underscore (_) is encouraged, while in Google's R style guide, they say underscore should be avoided and dots (.) is preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of consistency, perhaps we can use install.spark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a style choice and we don't need to fight between "_" and ".". Our goal is to have consistent names in SparkR, which means following existing names.

#' it is not found. The Spark version we use is 2.0.0 (preview). Users can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 2.0.0 (preview) => The Spark version is the same as the SparkR version. (Otherwise we need to update this doc for each Spark release.)

#' specify a desired Hadoop version, the remote site, and the directory where
#' the package is installed locally.
#'
#' @param hadoop_version Version of Hadoop to install, 2.4, 2.6,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • hadoop_version => hadoopVersion (camelCase for params)

#' 2.7 (default) and without
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I would put without as the default to match the default download on http://spark.apache.org/downloads.html.
  • Do not mention other Hadoop versions as they might change. Use a @seealso link to redirect users to the download page for available Hadoop versions.

#' @param mirror_url the base URL of the repositories to use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • mirrorUrl
  • mention the directory layout should follow Apache mirrors (http://www.apache.org/dyn/closer.lua/spark/)

#' @param local_dir local directory that Spark is installed to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • localDir (cacheDir might be more accurate)
  • document default values for each OS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could cacheDir be a little misleading if the user has already installed Spark, and in a non-cache directory?

#' @return \code{install_spark} returns the local directory
#' where Spark is found or installed
#' @rdname install_spark
#' @name install_spark
#' @export
#' @examples
#'\dontrun{
#' install_spark()
#'}
#' @note install_spark since 2.1.0
install_spark <- function(hadoop_version = NULL, mirror_url = NULL,
local_dir = NULL) {
version <- paste0("spark-", spark_version_default())
Copy link
Contributor

@sun-rui sun-rui Jul 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to create a function for spark version. just use "packageVersion("SparkR")". SparkR version should 1:1 map to Spark version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks :)

hadoop_version <- match.arg(hadoop_version, supported_versions_hadoop())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below on supported_versions_hadoop.

packageName <- ifelse(hadoop_version == "without",
paste0(version, "-bin-without-hadoop"),
paste0(version, "-bin-hadoop", hadoop_version))
if (is.null(local_dir)) {
local_dir <- getOption("spark.install.dir", spark_cache_path())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is spark.install.dir documented? If this is only used by SparkR, we should have a more specific config name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After second thought, I guess perhaps we don't need to set global options since installation is just a one-time thing?

} else {
local_dir <- normalizePath(local_dir)
}

packageLocalDir <- file.path(local_dir, packageName)

if (dir.exists(packageLocalDir)) {
fmt <- "Spark %s for Hadoop %s has been installed."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an option to force re-install Spark in case the local directory or file is corrupted. And please mention that in the message.

msg <- sprintf(fmt, version, hadoop_version)
message(msg)
return(invisible(packageLocalDir))
}

packageLocalPath <- paste0(packageLocalDir, ".tgz")
tarExists <- file.exists(packageLocalPath)

if (tarExists) {
message("Tar file found. Installing...")
} else {
dir.create(packageLocalDir, recursive = TRUE)
if (is.null(mirror_url)) {
message("Remote URL not provided. Use Apache default.")
mirror_url <- mirror_url_default()
}
# This is temporary, should be removed when released
version <- "spark-releases/spark-2.0.0-rc4-bin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this version is different from the above?

Copy link
Contributor Author

@junyangq junyangq Jul 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did this temporarily only because the current sparkR.session() uses some methods that are not in the 2.0 preview version available from apache website, and so we try to download the rc4 version instead. However, the corresponding url has different pattern and that forces us to hardcode some part of the url.

packageRemotePath <- paste0(file.path(mirror_url, version, packageName),
".tgz")
fmt <- paste("Installing Spark %s for Hadoop %s.",
"Downloading from:\n- %s",
"Installing to:\n- %s", sep = "\n")
msg <- sprintf(fmt, version, hadoop_version, packageRemotePath,
packageLocalDir)
message(msg)

fetchFail <- tryCatch(download.file(packageRemotePath, packageLocalPath),
error = function(e) {
msg <- paste0("Fetch failed from ", mirror_url, ".")
message(msg)
TRUE
})
if (fetchFail) {
message("Try the backup option.")
mirror_sites <- tryCatch(read.csv(mirror_url_csv()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's skip the CSV solution in this PR and use Apache mirrors as the default.

error = function(e) stop("No csv file found."))
mirror_url <- mirror_sites$url[1]
packageRemotePath <- paste0(file.path(mirror_url, version, packageName),
".tgz")
message(sprintf("Downloading from:\n- %s", packageRemotePath))
tryCatch(download.file(packageRemotePath, packageLocalPath),
error = function(e) {
stop("Download failed. Please provide a valid mirror_url.")
})
}
}

untar(tarfile = packageLocalPath, exdir = local_dir)
if (!tarExists) {
unlink(packageLocalPath)
}
message("Installation done.")
invisible(packageLocalDir)
}

mirror_url_default <- function() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default_mirror_url

# change to http://www.apache.org/dyn/closer.lua
# when released

"http://people.apache.org/~pwendell"
}

supported_versions_hadoop <- function() {
c("2.7", "2.6", "2.4", "without")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should hardcode Hadoop versions. They might change in the next release. And people might use patched version numbers for their own Hadoop releases, e.g. CDH. Shall we use "without" as the default and throw an error if the user requested Hadoop version is not available for download?

}

spark_cache_path <- function() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there references about the implementation here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those actually refer to the implementation of rappdirs/R/cache.r. Should add reference here.

if (.Platform$OS.type == "windows") {
winAppPath <- Sys.getenv("%LOCALAPPDATA%", unset = NA)
if (is.null(winAppPath)) {
msg <- paste("%LOCALAPPDATA% not found.",
"Please define or enter an installation path in loc_dir.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where to enter the path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps give specific message to let user restart?

stop(msg)
} else {
path <- file.path(winAppPath, "spark", "spark", "Cache")
}
} else if (.Platform$OS.type == "unix") {
if (Sys.info()["sysname"] == "Darwin") {
path <- file.path(Sys.getenv("HOME"), "Library/Caches", "spark")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check whether the folder exists and create the folder if it is not. I got this error in the first run

Error in normalizePath(path, mustWork = TRUE) : 
  path[1]="/Users/meng/Library/Caches/spark": No such file or directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to mustWork = FALSE. The check and create operations come later :)

} else {
path <- file.path(
Sys.getenv("XDG_CACHE_HOME", file.path(Sys.getenv("HOME"), ".cache")),
"spark")
}
} else {
stop("Unknown OS")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the OS name in the error message.

}
normalizePath(path, mustWork = TRUE)
}

mirror_url_csv <- function() {
system.file("extdata", "spark_download.csv", package = "SparkR")
}

spark_version_default <- function() {
"2.0.0"
}

hadoop_version_default <- function() {
"2.7"
}
15 changes: 15 additions & 0 deletions R/pkg/R/sparkR.R
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,21 @@ sparkR.session <- function(
}
overrideEnvs(sparkConfigMap, paramMap)
}
if (!nzchar(master) || master_is_local(master)) {
if (!dir.exists(sparkHome)) {
fmt <- paste0("Spark not found in SPARK_HOME: %s.\n",
"Search in the cache directory. ",
"It will be installed if not found.")
msg <- sprintf(fmt, sparkHome)
message(msg)
packageLocalDir <- install_spark()
sparkHome <- packageLocalDir
} else {
fmt <- "Make sure that Spark is installed in SPARK_HOME: %s"
msg <- sprintf(fmt, sparkHome)
message(msg)
}
}

if (!exists(".sparkRjsc", envir = .sparkREnv)) {
sparkExecutorEnvMap <- new.env()
Expand Down
4 changes: 4 additions & 0 deletions R/pkg/R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -689,3 +689,7 @@ getSparkContext <- function() {
sc <- get(".sparkRjsc", envir = .sparkREnv)
sc
}

master_is_local <- function(master) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_master_local

grepl("^local(\\[[0-9\\*]*\\])?$", master, perl = TRUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"^local(\\[[0-9\\*]+\\])?$" (+ instead of *)

Copy link
Member

@felixcheung felixcheung Jul 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe
^local(\\[([0-9]+|\\*?)\\])?$
so it won't match local[3*]?

}
2 changes: 2 additions & 0 deletions R/pkg/inst/extdata/spark_download.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
"url","default"
"http://apache.osuosl.org",TRUE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't recommend hardcoding or relying on certain mirrors - apache mirrors are sort of dynamic - it's better to pull from
http://www.apache.org/dyn/closer.cgi?as_json=1
like
http://www.apache.org/dyn/closer.lua/spark/spark-1.6.2/spark-1.6.2-bin-hadoop2.6.tgz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined with the previous comment: It turns out this would be similar to what sparklyr did. They provide a remote table as well as one shipped with the package that lists supported versions and their download addresses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the strong reason for a CSV file listing available mirrors. Because we can support apache dynamic mirrors, and install_spark() has a parameter allowing user to pass a vector of mirror sites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using apache dynamic mirror is great as long as the address does not change. The reason I was thinking of maintaining a (remote) csv file is that we are able to apply any change of the mirror sites without touching the R package...