Skip to content

Commit c2058a8

Browse files
javierluraschiwesm
authored andcommitted
[Website] Preview to Spark with Arrow and R improvements
1 parent 349a957 commit c2058a8

4 files changed

Lines changed: 153 additions & 0 deletions

File tree

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
---
2+
layout: post
3+
title: "Speeding up R with Spark using Apache Arrow"
4+
date: "2018-11-29 08:00:00 -0800"
5+
author: javierluraschi
6+
categories: [application]
7+
---
8+
<!--
9+
{% comment %}
10+
Licensed to the Apache Software Foundation (ASF) under one or more
11+
contributor license agreements. See the NOTICE file distributed with
12+
this work for additional information regarding copyright ownership.
13+
The ASF licenses this file to you under the Apache License, Version 2.0
14+
(the "License"); you may not use this file except in compliance with
15+
the License. You may obtain a copy of the License at
16+
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
19+
Unless required by applicable law or agreed to in writing, software
20+
distributed under the License is distributed on an "AS IS" BASIS,
21+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
22+
See the License for the specific language governing permissions and
23+
limitations under the License.
24+
{% endcomment %}
25+
-->
26+
27+
*[Javier Luraschi][1] is a software engineer at [RStudio][2]*
28+
29+
Support for Apache Arrow in Apache Spark with R is currently under
30+
active development through [sparklyr][3]. This post explores early, yet
31+
promising, performance improvements achieved when using R with [Apache
32+
Spark][4] and Arrow.
33+
34+
# Setup
35+
36+
Since this work is under active development, install `sparklyr` and
37+
`arrow` from GitHub as follows:
38+
39+
```r
40+
devtools::install_github("apache/arrow", subdir = "r", ref = "dc5df8f")
41+
devtools::install_github("rstudio/sparklyr")
42+
```
43+
44+
In this benchmark, we will use [dplyr][5], but similar improvements can
45+
be expected from using [DBI][6], or [Spark DataFrames][7] in `sparklyr`.
46+
The local Spark connection and dataframe with 1M numeric rows was
47+
initialized as follows:
48+
49+
```r
50+
library(sparklyr)
51+
library(dplyr)
52+
53+
sc <- spark_connect(master = "local")
54+
data <- data.frame(y = runif(10^6, 0, 1))
55+
```
56+
57+
# Copying
58+
59+
The following benchmark using [microbenchmark][8], copies 1M rows from
60+
R into Spark using `sparklyr` with and without `arrow`, there is close
61+
to a 10x improvement using `arrow`.
62+
63+
64+
```r
65+
microbenchmark::microbenchmark(
66+
setup = library(arrow),
67+
arrow_on = {
68+
library(arrow)
69+
sparklyr_df <<- copy_to(sc, data, overwrite = T)
70+
count(sparklyr_df)
71+
},
72+
arrow_off = {
73+
if ("arrow" %in% .packages()) detach("package:arrow")
74+
sparklyr_df <<- copy_to(sc, data, overwrite = T)
75+
count(sparklyr_df)
76+
},
77+
times = 10
78+
) %>% ggplot2::autoplot()
79+
```
80+
81+
<div align="center">
82+
<img src="{{ site.base-url }}/img/arrow-r-spark-copying.png"
83+
alt="Copying data with R into Spark with and without Arrow"
84+
width="60%" class="img-responsive">
85+
</div>
86+
87+
# Collecting
88+
89+
The following benchmark collects 1M rows from Spark into R and shows that `arrow`
90+
can bring 2x improvements. The collection improvements are not as significant as
91+
copying data since, `sparklyr` already collects data in columnar format.
92+
93+
```r
94+
microbenchmark::microbenchmark(
95+
setup = library(arrow),
96+
arrow_on = {
97+
dplyr::collect(sparklyr_df)
98+
},
99+
arrow_off = {
100+
if ("arrow" %in% .packages()) detach("package:arrow")
101+
dplyr::collect(sparklyr_df)
102+
},
103+
times = 10
104+
) %>% ggplot2::autoplot()
105+
```
106+
107+
<div align="center">
108+
<img src="{{ site.base-url }}/img/arrow-r-spark-collecting.png"
109+
alt="Collecting data with R from Spark with and without Arrow"
110+
width="60%" class="img-responsive">
111+
</div>
112+
113+
# Transforming
114+
115+
Custom transformations of data using R functions are about 100X faster using `arrow`.
116+
This improvement was significant since transforming data in R was copying
117+
and collecting data and was not optimized to be collected in columnar format.
118+
Therefore, `arrow` will be strongly encouraged to perform custom R transformations
119+
in Spark. The following example transforms 100K rows with and without `arrow` enabled.
120+
121+
```r
122+
microbenchmark::microbenchmark(
123+
setup = library(arrow),
124+
arrow_on = {
125+
sample_n(sparklyr_df, 10^5) %>% spark_apply(~ .x / 2) %>% count()
126+
},
127+
arrow_off = {
128+
if ("arrow" %in% .packages()) detach("package:arrow")
129+
sample_n(sparklyr_df, 10^5) %>% spark_apply(~ .x / 2) %>% count()
130+
},
131+
times = 10
132+
) %>% ggplot2::autoplot()
133+
```
134+
135+
<div align="center">
136+
<img src="{{ site.base-url }}/img/arrow-r-spark-transforming.png"
137+
alt="Transforming data with R in Spark with and without Arrow"
138+
width="60%" class="img-responsive">
139+
</div>
140+
141+
Additional benchmarks and fine-tuning parameters can be found under `sparklyr`
142+
[/rstudio/sparklyr/pull/1611][9]. Looking forward to bringing this feature
143+
to the Spark, Arrow and R communities.
144+
145+
[1]: https://github.com/javierluraschi
146+
[2]: https://rstudio.com
147+
[3]: https://github.com/rstudio/sparklyr
148+
[4]: https://spark.apache.org
149+
[5]: https://dplyr.tidyverse.org
150+
[6]: https://cran.r-project.org/package=DBI
151+
[7]: https://spark.rstudio.com/reference/#section-spark-dataframes
152+
[8]: https://CRAN.R-project.org/package=microbenchmark
153+
[9]: https://github.com/rstudio/sparklyr/pull/1611
45.4 KB
Loading

site/img/arrow-r-spark-copying.png

32.5 KB
Loading
26.9 KB
Loading

0 commit comments

Comments
 (0)