@@ -37,21 +37,21 @@ Since this work is under active development, install `sparklyr` and
3737` arrow ` from GitHub as follows:
3838
3939``` r
40- devtools :: install_github(" apache/arrow" , subdir = " r" , ref = " dc5df8f " )
41- devtools :: install_github(" rstudio/sparklyr" , ref = " feature/arrow " )
40+ devtools :: install_github(" apache/arrow" , subdir = " r" )
41+ devtools :: install_github(" rstudio/sparklyr" )
4242```
4343
4444In this benchmark, we will use [ dplyr] [ 5 ] , but similar improvements can
4545be expected from using [ DBI] [ 6 ] , or [ Spark DataFrames] [ 7 ] in ` sparklyr ` .
46- The local Spark connection and dataframe with 1M numeric rows was
46+ The local Spark connection and dataframe with 10M numeric rows was
4747initialized as follows:
4848
4949``` r
5050library(sparklyr )
5151library(dplyr )
5252
53- sc <- spark_connect(master = " local" )
54- data <- data.frame (y = runif(10 ^ 6 , 0 , 1 ))
53+ sc <- spark_connect(master = " local" , config = list ( " sparklyr.shell.driver-memory " = " 6g " ) )
54+ data <- data.frame (y = runif(10 ^ 7 , 0 , 1 ))
5555```
5656
5757# Copying
@@ -65,7 +65,7 @@ transfer more data at fast speeds into Spark.
6565Using ` arrow ` with ` sparklyr ` , we can transfer data directly from R to
6666Spark without having to serialize this data in R or persist in disk.
6767
68- The following example copies 1M rows from R into Spark using ` sparklyr `
68+ The following example copies 10M rows from R into Spark using ` sparklyr `
6969with and without ` arrow ` , there is close to a 10x improvement using ` arrow ` .
7070
7171This benchmark uses the [ microbenchmark] [ 8 ] R package, which runs code
@@ -83,14 +83,15 @@ microbenchmark::microbenchmark(
8383 if (" arrow" %in% .packages()) detach(" package:arrow" )
8484 sparklyr_df <<- copy_to(sc , data , overwrite = T )
8585 count(sparklyr_df ) %> % collect()
86- }
86+ },
87+ times = 10
8788) %T > % print() %> % ggplot2 :: autoplot()
8889```
8990```
90- Unit: milliseconds
91- expr min lq mean median uq max neval
92- arrow_on 326.4083 401.3589 484.4189 428.9402 489.8033 1093.707 100
93- arrow_off 2450.5797 3146.0476 3386.6042 3246.9822 3488.6524 6945.576 100
91+ Unit: seconds
92+ expr min lq mean median uq max neval
93+ arrow_on 3.011515 4.250025 7.257739 7.273011 8.974331 14.23325 10
94+ arrow_off 50.051947 68.523081 119.946947 71.898908 138.743419 390.44028 10
9495```
9596
9697<div align =" center " >
@@ -106,7 +107,7 @@ while collecting data from Spark into R. These improvements are not as
106107significant as copying data since, ` sparklyr ` already collects data in
107108columnar format.
108109
109- The following benchmark collects 1M rows from Spark into R and shows that
110+ The following benchmark collects 10M rows from Spark into R and shows that
110111` arrow ` can bring 2x improvements.
111112
112113``` r
@@ -118,14 +119,15 @@ microbenchmark::microbenchmark(
118119 arrow_off = {
119120 if (" arrow" %in% .packages()) detach(" package:arrow" )
120121 collect(sparklyr_df )
121- }
122+ },
123+ times = 10
122124) %T > % print() %> % ggplot2 :: autoplot()
123125```
124126```
125- Unit: milliseconds
126- expr min lq mean median uq max neval
127- arrow_on 254.4486 278.3992 313.2547 300.1484 334.9117 496.9672 100
128- arrow_off 336.9897 408.1203 478.3004 450.7942 485.0992 1070.5376 100
127+ Unit: seconds
128+ expr min lq mean median uq max neval
129+ arrow_on 4.520593 5.609812 6.154509 5.928099 6.217447 9.432221 10
130+ arrow_off 7.882841 13.358113 16.670708 16.127704 21.051382 24.373331 10
129131```
130132
131133<div align =" center " >
0 commit comments