@@ -27,9 +27,9 @@ limitations under the License.
2727* [ Javier Luraschi] [ 1 ] is a software engineer at [ RStudio] [ 2 ] *
2828
2929Support for Apache Arrow in Apache Spark with R is currently under active
30- development in the [ sparklyr] [ 3 ] project . This post explores early, yet
31- promising, performance improvements achieved when using R with [ Apache
32- Spark ] [ 4 ] and Arrow .
30+ development in the [ sparklyr] [ 3 ] and [ sparkR ] [ 4 ] projects . This post explores early, yet
31+ promising, performance improvements achieved when using R with [ Apache Spark ] [ 5 ] ,
32+ Arrow and ` sparklyr ` .
3333
3434# Setup
3535
@@ -41,8 +41,8 @@ devtools::install_github("apache/arrow", subdir = "r", ref = "apache-arrow-0.12.
4141devtools :: install_github(" rstudio/sparklyr" , ref = " apache-arrow-0.12.0" )
4242```
4343
44- In this benchmark, we will use [ dplyr] [ 5 ] , but similar improvements can
45- be expected from using [ DBI] [ 6 ] , or [ Spark DataFrames] [ 7 ] in ` sparklyr ` .
44+ In this benchmark, we will use [ dplyr] [ 6 ] , but similar improvements can
45+ be expected from using [ DBI] [ 7 ] , or [ Spark DataFrames] [ 8 ] in ` sparklyr ` .
4646The local Spark connection and dataframe with 10M numeric rows was
4747initialized as follows:
4848
@@ -68,7 +68,7 @@ Spark without having to serialize this data in R or persist in disk.
6868The following example copies 10M rows from R into Spark using ` sparklyr `
6969with and without ` arrow ` , there is close to a 16x improvement using ` arrow ` .
7070
71- This benchmark uses the [ microbenchmark] [ 8 ] R package, which runs code
71+ This benchmark uses the [ microbenchmark] [ 9 ] R package, which runs code
7272multiple times, provides stats on total execution time and plots each
7373excecution time to understand the distribution over each iteration.
7474
@@ -182,15 +182,17 @@ Unit: seconds
182182</div >
183183
184184Additional benchmarks and fine-tuning parameters can be found under ` sparklyr `
185- [ /rstudio/sparklyr/pull/1611] [ 9 ] . Looking forward to bringing this feature
185+ [ /rstudio/sparklyr/pull/1611] [ 10 ] and ` sparkR ` [ /apache/spark/pull/22954 ] [ 11 ] . Looking forward to bringing this feature
186186to the Spark, Arrow and R communities.
187187
188188[ 1 ] : https://github.com/javierluraschi
189189[ 2 ] : https://rstudio.com
190190[ 3 ] : https://github.com/rstudio/sparklyr
191- [ 4 ] : https://spark.apache.org
192- [ 5 ] : https://dplyr.tidyverse.org
193- [ 6 ] : https://cran.r-project.org/package=DBI
194- [ 7 ] : https://spark.rstudio.com/reference/#section-spark-dataframes
195- [ 8 ] : https://CRAN.R-project.org/package=microbenchmark
196- [ 9 ] : https://github.com/rstudio/sparklyr/pull/1611
191+ [ 4 ] : https://spark.apache.org/docs/latest/sparkr.html
192+ [ 5 ] : https://spark.apache.org
193+ [ 6 ] : https://dplyr.tidyverse.org
194+ [ 7 ] : https://cran.r-project.org/package=DBI
195+ [ 8 ] : https://spark.rstudio.com/reference/#section-spark-dataframes
196+ [ 9 ] : https://CRAN.R-project.org/package=microbenchmark
197+ [ 10 ] : https://github.com/rstudio/sparklyr/pull/1611
198+ [ 11 ] : https://github.com/apache/spark/pull/22954
0 commit comments