|
1 | 1 | --- |
2 | 2 | layout: global |
3 | 3 | title: Java Programming Guide |
| 4 | +redirect: programming-guide.html |
4 | 5 | --- |
5 | 6 |
|
6 | | -The Spark Java API exposes all the Spark features available in the Scala version to Java. |
7 | | -To learn the basics of Spark, we recommend reading through the |
8 | | -[Scala programming guide](programming-guide.html) first; it should be |
9 | | -easy to follow even if you don't know Scala. |
10 | | -This guide will show how to use the Spark features described there in Java. |
11 | | - |
12 | | -The Spark Java API is defined in the |
13 | | -[`org.apache.spark.api.java`](api/java/index.html?org/apache/spark/api/java/package-summary.html) package, and includes |
14 | | -a [`JavaSparkContext`](api/java/index.html?org/apache/spark/api/java/JavaSparkContext.html) for |
15 | | -initializing Spark and [`JavaRDD`](api/java/index.html?org/apache/spark/api/java/JavaRDD.html) classes, |
16 | | -which support the same methods as their Scala counterparts but take Java functions and return |
17 | | -Java data and collection types. The main differences have to do with passing functions to RDD |
18 | | -operations (e.g. map) and handling RDDs of different types, as discussed next. |
19 | | - |
20 | | -# Key Differences in the Java API |
21 | | - |
22 | | -There are a few key differences between the Java and Scala APIs: |
23 | | - |
24 | | -* Java does not support anonymous or first-class functions, so functions are passed |
25 | | - using anonymous classes that implement the |
26 | | - [`org.apache.spark.api.java.function.Function`](api/java/index.html?org/apache/spark/api/java/function/Function.html), |
27 | | - [`Function2`](api/java/index.html?org/apache/spark/api/java/function/Function2.html), etc. |
28 | | - interfaces. |
29 | | -* To maintain type safety, the Java API defines specialized Function and RDD |
30 | | - classes for key-value pairs and doubles. For example, |
31 | | - [`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html) |
32 | | - stores key-value pairs. |
33 | | -* Some methods are defined on the basis of the passed function's return type. |
34 | | - For example `mapToPair()` returns |
35 | | - [`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html), |
36 | | - and `mapToDouble()` returns |
37 | | - [`JavaDoubleRDD`](api/java/index.html?org/apache/spark/api/java/JavaDoubleRDD.html). |
38 | | -* RDD methods like `collect()` and `countByKey()` return Java collections types, |
39 | | - such as `java.util.List` and `java.util.Map`. |
40 | | -* Key-value pairs, which are simply written as `(key, value)` in Scala, are represented |
41 | | - by the `scala.Tuple2` class, and need to be created using `new Tuple2<K, V>(key, value)`. |
42 | | - |
43 | | -## RDD Classes |
44 | | - |
45 | | -Spark defines additional operations on RDDs of key-value pairs and doubles, such |
46 | | -as `reduceByKey`, `join`, and `stdev`. |
47 | | - |
48 | | -In the Scala API, these methods are automatically added using Scala's |
49 | | -[implicit conversions](http://www.scala-lang.org/node/130) mechanism. |
50 | | - |
51 | | -In the Java API, the extra methods are defined in the |
52 | | -[`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html) |
53 | | -and [`JavaDoubleRDD`](api/java/index.html?org/apache/spark/api/java/JavaDoubleRDD.html) |
54 | | -classes. RDD methods like `map` are overloaded by specialized `PairFunction` |
55 | | -and `DoubleFunction` classes, allowing them to return RDDs of the appropriate |
56 | | -types. Common methods like `filter` and `sample` are implemented by |
57 | | -each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`, |
58 | | -etc (this achieves the "same-result-type" principle used by the [Scala collections |
59 | | -framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)). |
60 | | - |
61 | | -## Function Interfaces |
62 | | - |
63 | | -The following table lists the function interfaces used by the Java API, located in the |
64 | | -[`org.apache.spark.api.java.function`](api/java/index.html?org/apache/spark/api/java/function/package-summary.html) |
65 | | -package. Each interface has a single abstract method, `call()`. |
66 | | - |
67 | | -<table class="table"> |
68 | | -<tr><th>Class</th><th>Function Type</th></tr> |
69 | | - |
70 | | -<tr><td>Function<T, R></td><td>T => R </td></tr> |
71 | | -<tr><td>DoubleFunction<T></td><td>T => Double </td></tr> |
72 | | -<tr><td>PairFunction<T, K, V></td><td>T => Tuple2<K, V> </td></tr> |
73 | | - |
74 | | -<tr><td>FlatMapFunction<T, R></td><td>T => Iterable<R> </td></tr> |
75 | | -<tr><td>DoubleFlatMapFunction<T></td><td>T => Iterable<Double> </td></tr> |
76 | | -<tr><td>PairFlatMapFunction<T, K, V></td><td>T => Iterable<Tuple2<K, V>> </td></tr> |
77 | | - |
78 | | -<tr><td>Function2<T1, T2, R></td><td>T1, T2 => R (function of two arguments)</td></tr> |
79 | | -</table> |
80 | | - |
81 | | -## Storage Levels |
82 | | - |
83 | | -RDD [storage level](programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are |
84 | | -declared in the [org.apache.spark.api.java.StorageLevels](api/java/index.html?org/apache/spark/api/java/StorageLevels.html) class. To |
85 | | -define your own storage level, you can use StorageLevels.create(...). |
86 | | - |
87 | | -# Other Features |
88 | | - |
89 | | -The Java API supports other Spark features, including |
90 | | -[accumulators](programming-guide.html#accumulators), |
91 | | -[broadcast variables](programming-guide.html#broadcast-variables), and |
92 | | -[caching](programming-guide.html#rdd-persistence). |
93 | | - |
94 | | -# Upgrading From Pre-1.0 Versions of Spark |
95 | | - |
96 | | -In version 1.0 of Spark the Java API was refactored to better support Java 8 |
97 | | -lambda expressions. Users upgrading from older versions of Spark should note |
98 | | -the following changes: |
99 | | - |
100 | | -* All `org.apache.spark.api.java.function.*` have been changed from abstract |
101 | | - classes to interfaces. This means that concrete implementations of these |
102 | | - `Function` classes will need to use `implements` rather than `extends`. |
103 | | -* Certain transformation functions now have multiple versions depending |
104 | | - on the return type. In Spark core, the map functions (`map`, `flatMap`, and |
105 | | - `mapPartitions`) have type-specific versions, e.g. |
106 | | - [`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction)) |
107 | | - and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)). |
108 | | - Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)). |
109 | | - |
110 | | -# Example |
111 | | - |
112 | | -As an example, we will implement word count using the Java API. |
113 | | - |
114 | | -{% highlight java %} |
115 | | -import org.apache.spark.api.java.*; |
116 | | -import org.apache.spark.api.java.function.*; |
117 | | - |
118 | | -JavaSparkContext jsc = new JavaSparkContext(...); |
119 | | -JavaRDD<String> lines = jsc.textFile("hdfs://..."); |
120 | | -JavaRDD<String> words = lines.flatMap( |
121 | | - new FlatMapFunction<String, String>() { |
122 | | - @Override public Iterable<String> call(String s) { |
123 | | - return Arrays.asList(s.split(" ")); |
124 | | - } |
125 | | - } |
126 | | -); |
127 | | -{% endhighlight %} |
128 | | - |
129 | | -The word count program starts by creating a `JavaSparkContext`, which accepts |
130 | | -the same parameters as its Scala counterpart. `JavaSparkContext` supports the |
131 | | -same data loading methods as the regular `SparkContext`; here, `textFile` |
132 | | -loads lines from text files stored in HDFS. |
133 | | - |
134 | | -To split the lines into words, we use `flatMap` to split each line on |
135 | | -whitespace. `flatMap` is passed a `FlatMapFunction` that accepts a string and |
136 | | -returns an `java.lang.Iterable` of strings. |
137 | | - |
138 | | -Here, the `FlatMapFunction` was created inline; another option is to subclass |
139 | | -`FlatMapFunction` and pass an instance to `flatMap`: |
140 | | - |
141 | | -{% highlight java %} |
142 | | -class Split extends FlatMapFunction<String, String> { |
143 | | - @Override public Iterable<String> call(String s) { |
144 | | - return Arrays.asList(s.split(" ")); |
145 | | - } |
146 | | -} |
147 | | -JavaRDD<String> words = lines.flatMap(new Split()); |
148 | | -{% endhighlight %} |
149 | | - |
150 | | -Java 8+ users can also write the above `FlatMapFunction` in a more concise way using |
151 | | -a lambda expression: |
152 | | - |
153 | | -{% highlight java %} |
154 | | -JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(s.split(" "))); |
155 | | -{% endhighlight %} |
156 | | - |
157 | | -This lambda syntax can be applied to all anonymous classes in Java 8. |
158 | | - |
159 | | -Continuing with the word count example, we map each word to a `(word, 1)` pair: |
160 | | - |
161 | | -{% highlight java %} |
162 | | -import scala.Tuple2; |
163 | | -JavaPairRDD<String, Integer> ones = words.mapToPair( |
164 | | - new PairFunction<String, String, Integer>() { |
165 | | - @Override public Tuple2<String, Integer> call(String s) { |
166 | | - return new Tuple2<String, Integer>(s, 1); |
167 | | - } |
168 | | - } |
169 | | -); |
170 | | -{% endhighlight %} |
171 | | - |
172 | | -Note that `mapToPair` was passed a `PairFunction<String, String, Integer>` and |
173 | | -returned a `JavaPairRDD<String, Integer>`. |
174 | | - |
175 | | -To finish the word count program, we will use `reduceByKey` to count the |
176 | | -occurrences of each word: |
177 | | - |
178 | | -{% highlight java %} |
179 | | -JavaPairRDD<String, Integer> counts = ones.reduceByKey( |
180 | | - new Function2<Integer, Integer, Integer>() { |
181 | | - @Override public Integer call(Integer i1, Integer i2) { |
182 | | - return i1 + i2; |
183 | | - } |
184 | | - } |
185 | | -); |
186 | | -{% endhighlight %} |
187 | | - |
188 | | -Here, `reduceByKey` is passed a `Function2`, which implements a function with |
189 | | -two arguments. The resulting `JavaPairRDD` contains `(word, count)` pairs. |
190 | | - |
191 | | -In this example, we explicitly showed each intermediate RDD. It is also |
192 | | -possible to chain the RDD transformations, so the word count example could also |
193 | | -be written as: |
194 | | - |
195 | | -{% highlight java %} |
196 | | -JavaPairRDD<String, Integer> counts = lines.flatMapToPair( |
197 | | - ... |
198 | | - ).map( |
199 | | - ... |
200 | | - ).reduceByKey( |
201 | | - ... |
202 | | - ); |
203 | | -{% endhighlight %} |
204 | | - |
205 | | -There is no performance difference between these approaches; the choice is |
206 | | -just a matter of style. |
207 | | - |
208 | | -# API Docs |
209 | | - |
210 | | -[API documentation](api/java/index.html) for Spark in Java is available in Javadoc format. |
211 | | - |
212 | | -# Where to Go from Here |
213 | | - |
214 | | -Spark includes several sample programs using the Java API in |
215 | | -[`examples/src/main/java`](https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples). You can run them by passing the class name to the |
216 | | -`bin/run-example` script included in Spark; for example: |
217 | | - |
218 | | - ./bin/run-example JavaWordCount README.md |
| 7 | +This document has been merged into the [Spark programming guide](programming-guide.html). |
0 commit comments