Skip to content

Commit aa6aec7

Browse files
update blog post with arrow 0.12 and 10m rows
1 parent 23dc228 commit aa6aec7

4 files changed

Lines changed: 19 additions & 17 deletions

File tree

site/_posts/2018-11-19-r-spark-improvements.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -37,21 +37,21 @@ Since this work is under active development, install `sparklyr` and
3737
`arrow` from GitHub as follows:
3838

3939
```r
40-
devtools::install_github("apache/arrow", subdir = "r", ref = "dc5df8f")
41-
devtools::install_github("rstudio/sparklyr", ref = "feature/arrow")
40+
devtools::install_github("apache/arrow", subdir = "r")
41+
devtools::install_github("rstudio/sparklyr")
4242
```
4343

4444
In this benchmark, we will use [dplyr][5], but similar improvements can
4545
be expected from using [DBI][6], or [Spark DataFrames][7] in `sparklyr`.
46-
The local Spark connection and dataframe with 1M numeric rows was
46+
The local Spark connection and dataframe with 10M numeric rows was
4747
initialized as follows:
4848

4949
```r
5050
library(sparklyr)
5151
library(dplyr)
5252

53-
sc <- spark_connect(master = "local")
54-
data <- data.frame(y = runif(10^6, 0, 1))
53+
sc <- spark_connect(master = "local", config = list("sparklyr.shell.driver-memory" = "6g"))
54+
data <- data.frame(y = runif(10^7, 0, 1))
5555
```
5656

5757
# Copying
@@ -65,7 +65,7 @@ transfer more data at fast speeds into Spark.
6565
Using `arrow` with `sparklyr`, we can transfer data directly from R to
6666
Spark without having to serialize this data in R or persist in disk.
6767

68-
The following example copies 1M rows from R into Spark using `sparklyr`
68+
The following example copies 10M rows from R into Spark using `sparklyr`
6969
with and without `arrow`, there is close to a 10x improvement using `arrow`.
7070

7171
This benchmark uses the [microbenchmark][8] R package, which runs code
@@ -83,14 +83,15 @@ microbenchmark::microbenchmark(
8383
if ("arrow" %in% .packages()) detach("package:arrow")
8484
sparklyr_df <<- copy_to(sc, data, overwrite = T)
8585
count(sparklyr_df) %>% collect()
86-
}
86+
},
87+
times = 10
8788
) %T>% print() %>% ggplot2::autoplot()
8889
```
8990
```
90-
Unit: milliseconds
91-
expr min lq mean median uq max neval
92-
arrow_on 326.4083 401.3589 484.4189 428.9402 489.8033 1093.707 100
93-
arrow_off 2450.5797 3146.0476 3386.6042 3246.9822 3488.6524 6945.576 100
91+
Unit: seconds
92+
expr min lq mean median uq max neval
93+
arrow_on 3.011515 4.250025 7.257739 7.273011 8.974331 14.23325 10
94+
arrow_off 50.051947 68.523081 119.946947 71.898908 138.743419 390.44028 10
9495
```
9596

9697
<div align="center">
@@ -106,7 +107,7 @@ while collecting data from Spark into R. These improvements are not as
106107
significant as copying data since, `sparklyr` already collects data in
107108
columnar format.
108109

109-
The following benchmark collects 1M rows from Spark into R and shows that
110+
The following benchmark collects 10M rows from Spark into R and shows that
110111
`arrow` can bring 2x improvements.
111112

112113
```r
@@ -118,14 +119,15 @@ microbenchmark::microbenchmark(
118119
arrow_off = {
119120
if ("arrow" %in% .packages()) detach("package:arrow")
120121
collect(sparklyr_df)
121-
}
122+
},
123+
times = 10
122124
) %T>% print() %>% ggplot2::autoplot()
123125
```
124126
```
125-
Unit: milliseconds
126-
expr min lq mean median uq max neval
127-
arrow_on 254.4486 278.3992 313.2547 300.1484 334.9117 496.9672 100
128-
arrow_off 336.9897 408.1203 478.3004 450.7942 485.0992 1070.5376 100
127+
Unit: seconds
128+
expr min lq mean median uq max neval
129+
arrow_on 4.520593 5.609812 6.154509 5.928099 6.217447 9.432221 10
130+
arrow_off 7.882841 13.358113 16.670708 16.127704 21.051382 24.373331 10
129131
```
130132

131133
<div align="center">
-10.9 KB
Loading

site/img/arrow-r-spark-copying.png

-1.56 KB
Loading
2.36 KB
Loading

0 commit comments

Comments
 (0)