Skip to content

Commit 966ecf6

Browse files
add reference to ongoing sparkr work
1 parent f3c3e99 commit 966ecf6

1 file changed

Lines changed: 15 additions & 13 deletions

File tree

site/_posts/2018-11-19-r-spark-improvements.md

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@ limitations under the License.
2727
*[Javier Luraschi][1] is a software engineer at [RStudio][2]*
2828

2929
Support for Apache Arrow in Apache Spark with R is currently under active
30-
development in the [sparklyr][3] project. This post explores early, yet
31-
promising, performance improvements achieved when using R with [Apache
32-
Spark][4] and Arrow.
30+
development in the [sparklyr][3] and [sparkR][4] projects. This post explores early, yet
31+
promising, performance improvements achieved when using R with [Apache Spark][5],
32+
Arrow and `sparklyr`.
3333

3434
# Setup
3535

@@ -41,8 +41,8 @@ devtools::install_github("apache/arrow", subdir = "r", ref = "apache-arrow-0.12.
4141
devtools::install_github("rstudio/sparklyr", ref = "apache-arrow-0.12.0")
4242
```
4343

44-
In this benchmark, we will use [dplyr][5], but similar improvements can
45-
be expected from using [DBI][6], or [Spark DataFrames][7] in `sparklyr`.
44+
In this benchmark, we will use [dplyr][6], but similar improvements can
45+
be expected from using [DBI][7], or [Spark DataFrames][8] in `sparklyr`.
4646
The local Spark connection and dataframe with 10M numeric rows was
4747
initialized as follows:
4848

@@ -68,7 +68,7 @@ Spark without having to serialize this data in R or persist in disk.
6868
The following example copies 10M rows from R into Spark using `sparklyr`
6969
with and without `arrow`, there is close to a 16x improvement using `arrow`.
7070

71-
This benchmark uses the [microbenchmark][8] R package, which runs code
71+
This benchmark uses the [microbenchmark][9] R package, which runs code
7272
multiple times, provides stats on total execution time and plots each
7373
excecution time to understand the distribution over each iteration.
7474

@@ -182,15 +182,17 @@ Unit: seconds
182182
</div>
183183

184184
Additional benchmarks and fine-tuning parameters can be found under `sparklyr`
185-
[/rstudio/sparklyr/pull/1611][9]. Looking forward to bringing this feature
185+
[/rstudio/sparklyr/pull/1611][10] and `sparkR` [/apache/spark/pull/22954][11]. Looking forward to bringing this feature
186186
to the Spark, Arrow and R communities.
187187

188188
[1]: https://github.com/javierluraschi
189189
[2]: https://rstudio.com
190190
[3]: https://github.com/rstudio/sparklyr
191-
[4]: https://spark.apache.org
192-
[5]: https://dplyr.tidyverse.org
193-
[6]: https://cran.r-project.org/package=DBI
194-
[7]: https://spark.rstudio.com/reference/#section-spark-dataframes
195-
[8]: https://CRAN.R-project.org/package=microbenchmark
196-
[9]: https://github.com/rstudio/sparklyr/pull/1611
191+
[4]: https://spark.apache.org/docs/latest/sparkr.html
192+
[5]: https://spark.apache.org
193+
[6]: https://dplyr.tidyverse.org
194+
[7]: https://cran.r-project.org/package=DBI
195+
[8]: https://spark.rstudio.com/reference/#section-spark-dataframes
196+
[9]: https://CRAN.R-project.org/package=microbenchmark
197+
[10]: https://github.com/rstudio/sparklyr/pull/1611
198+
[11]: https://github.com/apache/spark/pull/22954

0 commit comments

Comments
 (0)