-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4262: [Website] Preview to Spark with Arrow and R improvements #3001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
javierluraschi
wants to merge
10
commits into
apache:master
from
javierluraschi:post/spark-r-arrow-preview
Closed
Changes from 1 commit
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
c2058a8
[Website] Preview to Spark with Arrow and R improvements
javierluraschi 259cbb5
explain sparklyr implementation, plots and stats
javierluraschi 7f298ff
increment benchmark iterations to 100
javierluraschi 9b5a3d0
add reference to specific sparklyr branch
javierluraschi 23dc228
Fix some typos
wesm aa6aec7
update blog post with arrow 0.12 and 10m rows
javierluraschi fae7e01
add github references to arrow 0.12 release
javierluraschi f3c3e99
use more accurate multipliers describing improements
javierluraschi 966ecf6
add reference to ongoing sparkr work
javierluraschi 36e1bcb
proper capitalization for sparkr project
javierluraschi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,9 +27,9 @@ limitations under the License. | |
| *[Javier Luraschi][1] is a software engineer at [RStudio][2]* | ||
|
|
||
| Support for Apache Arrow in Apache Spark with R is currently under active | ||
| development in the [sparklyr][3] project. This post explores early, yet | ||
| promising, performance improvements achieved when using R with [Apache | ||
| Spark][4] and Arrow. | ||
| development in the [sparklyr][3] and [sparkR][4] projects. This post explores early, yet | ||
| promising, performance improvements achieved when using R with [Apache Spark][5], | ||
| Arrow and `sparklyr`. | ||
|
|
||
| # Setup | ||
|
|
||
|
|
@@ -41,8 +41,8 @@ devtools::install_github("apache/arrow", subdir = "r", ref = "apache-arrow-0.12. | |
| devtools::install_github("rstudio/sparklyr", ref = "apache-arrow-0.12.0") | ||
| ``` | ||
|
|
||
| In this benchmark, we will use [dplyr][5], but similar improvements can | ||
| be expected from using [DBI][6], or [Spark DataFrames][7] in `sparklyr`. | ||
| In this benchmark, we will use [dplyr][6], but similar improvements can | ||
| be expected from using [DBI][7], or [Spark DataFrames][8] in `sparklyr`. | ||
| The local Spark connection and dataframe with 10M numeric rows was | ||
| initialized as follows: | ||
|
|
||
|
|
@@ -68,7 +68,7 @@ Spark without having to serialize this data in R or persist in disk. | |
| The following example copies 10M rows from R into Spark using `sparklyr` | ||
| with and without `arrow`, there is close to a 16x improvement using `arrow`. | ||
|
|
||
| This benchmark uses the [microbenchmark][8] R package, which runs code | ||
| This benchmark uses the [microbenchmark][9] R package, which runs code | ||
| multiple times, provides stats on total execution time and plots each | ||
| excecution time to understand the distribution over each iteration. | ||
|
|
||
|
|
@@ -182,15 +182,17 @@ Unit: seconds | |
| </div> | ||
|
|
||
| Additional benchmarks and fine-tuning parameters can be found under `sparklyr` | ||
| [/rstudio/sparklyr/pull/1611][9]. Looking forward to bringing this feature | ||
| [/rstudio/sparklyr/pull/1611][10] and `sparkR` [/apache/spark/pull/22954][11]. Looking forward to bringing this feature | ||
| to the Spark, Arrow and R communities. | ||
|
|
||
| [1]: https://github.com/javierluraschi | ||
| [2]: https://rstudio.com | ||
| [3]: https://github.com/rstudio/sparklyr | ||
| [4]: https://spark.apache.org | ||
| [5]: https://dplyr.tidyverse.org | ||
| [6]: https://cran.r-project.org/package=DBI | ||
| [7]: https://spark.rstudio.com/reference/#section-spark-dataframes | ||
| [8]: https://CRAN.R-project.org/package=microbenchmark | ||
| [9]: https://github.com/rstudio/sparklyr/pull/1611 | ||
| [4]: https://spark.apache.org/docs/latest/sparkr.html | ||
| [5]: https://spark.apache.org | ||
| [6]: https://dplyr.tidyverse.org | ||
| [7]: https://cran.r-project.org/package=DBI | ||
| [8]: https://spark.rstudio.com/reference/#section-spark-dataframes | ||
| [9]: https://CRAN.R-project.org/package=microbenchmark | ||
| [10]: https://github.com/rstudio/sparklyr/pull/1611 | ||
| [11]: https://github.com/apache/spark/pull/22954 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for linking this! |
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually use
SparkRcapitalization...