You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We could also use `counts.sortByKey()`, for example, to sort the pairs by word, and finally
749
-
`counts.collect()` to bring them back to the driver program as an array of objects.
748
+
We could also use `counts.sortByKey()`, for example, to sort the pairs alphabetically, and finally
749
+
`counts.collect()` to bring them back to the driver program as a list of objects.
750
750
751
751
</div>
752
752
@@ -755,7 +755,15 @@ We could also use `counts.sortByKey()`, for example, to sort the pairs by word,
755
755
756
756
### Transformations
757
757
758
-
The following tables list the transformations and actions currently supported (see also the [RDD API doc](api/scala/index.html#org.apache.spark.rdd.RDD) for details):
758
+
The following table lists some of the common transformations supported by Spark. Refer to the
<td> Return a new RDD that contains the intersection of elements in the source dataset and the argument. </td>
804
+
</tr>
793
805
<tr>
794
806
<td> <b>distinct</b>([<i>numTasks</i>])) </td>
795
807
<td> Return a new dataset that contains the distinct elements of the source dataset.</td>
796
808
</tr>
797
809
<tr>
798
810
<td> <b>groupByKey</b>([<i>numTasks</i>]) </td>
799
-
<td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs. <br />
800
-
<b>Note:</b> If you are grouping in order to perform an aggregation (such as a sum or
801
-
average) over each key, using <code>reduceByKey</code> or <code>combineByKey</code> will yield much better
802
-
performance.
803
-
<br />
804
-
<b>Note:</b> By default, if the RDD already has a partitioner, the task number is decided by the partition number of the partitioner, or else relies on the value of <code>spark.default.parallelism</code> if the property is set , otherwise depends on the partition number of the RDD. You can pass an optional <code>numTasks</code> argument to set a different number of tasks.
811
+
<td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. <br />
812
+
<b>Note:</b> If you are grouping in order to perform an aggregation (such as a sum or
813
+
average) over each key, using <code>reduceByKey</code> or <code>combineByKey</code> will yield much better
814
+
performance.
815
+
<br />
816
+
<b>Note:</b> By default, the level of parallelism in the output depends on the number of partitions of the parent RDD.
817
+
You can pass an optional <code>numTasks</code> argument to set a different number of tasks.
805
818
</td>
806
819
</tr>
807
820
<tr>
@@ -814,22 +827,47 @@ The following tables list the transformations and actions currently supported (s
<td> When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples. This operation is also called <code>groupWith</code>. </td>
836
+
<td> When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Iterable<V>, Iterable<W>) tuples. This operation is also called <code>groupWith</code>. </td>
822
837
</tr>
823
838
<tr>
824
839
<td> <b>cartesian</b>(<i>otherDataset</i>) </td>
825
840
<td> When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). </td>
<td> Return an array with a random sample of <i>num</i> elements of the dataset, with or without replacement, using the given random number generator seed. </td>
<td> Return the first <i>n</i> elements of the RDD using either their natural order or a custom comparator. </td>
900
+
</tr>
859
901
<tr>
860
902
<td> <b>saveAsTextFile</b>(<i>path</i>) </td>
861
903
<td> Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. </td>
862
904
</tr>
863
905
<tr>
864
-
<td> <b>saveAsSequenceFile</b>(<i>path</i>) </td>
865
-
<td> Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). </td>
906
+
<td> <b>saveAsSequenceFile</b>(<i>path</i>) <br /> (Java and Scala) </td>
907
+
<td> Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that either implement Hadoop's Writable interface. In Scala, it is also
908
+
available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). </td>
909
+
</tr>
910
+
<tr>
911
+
<td> <b>saveAsObjectFile</b>(<i>path</i>) <br /> (Java and Scala) </td>
912
+
<td> Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using
913
+
<code>SparkContext.objectFile()</code>. </td>
866
914
</tr>
867
915
<tr>
868
916
<td> <b>countByKey</b>() </td>
869
-
<td> Only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key. </td>
917
+
<td> Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. </td>
870
918
</tr>
871
919
<tr>
872
920
<td> <b>foreach</b>(<i>func</i>) </td>
873
921
<td> Run a function <i>func</i> on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems. </td>
874
922
</tr>
875
923
</table>
876
924
877
-
A complete list of actions is available in the [RDD API doc](api/scala/index.html#org.apache.spark.rdd.RDD).
878
-
879
925
## RDD Persistence
880
926
881
927
One of the most important capabilities in Spark is *persisting* (or *caching*) a dataset in memory
882
-
across operations. When you persist an RDD, each node stores any slices of it that it computes in
928
+
across operations. When you persist an RDD, each node stores any partitions of it that it computes in
883
929
memory and reuses them in other actions on that dataset (or datasets derived from it). This allows
884
-
future actions to be much faster (often by more than 10x). Caching is a key tool for building
885
-
iterative algorithms with Spark and for interactive use from the interpreter.
930
+
future actions to be much faster (often by more than 10x). Caching is a key tool for
931
+
iterative algorithms and fast interactive use.
886
932
887
933
You can mark an RDD to be persisted using the `persist()` or `cache()` methods on it. The first time
888
-
it is computed in an action, it will be kept in memory on the nodes. The cache is fault-tolerant --
934
+
it is computed in an action, it will be kept in memory on the nodes. Spark's cache is fault-tolerant --
889
935
if any partition of an RDD is lost, it will automatically be recomputed using the transformations
890
936
that originally created it.
891
937
892
938
In addition, each persisted RDD can be stored using a different *storage level*, allowing you, for example,
893
939
to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space),
894
940
replicate it across nodes, or store it off-heap in [Tachyon](http://tachyon-project.org/).
**Note:** In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level.
991
+
**Note:***In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level.*
946
992
947
-
Spark also automatically persists intermediate results in shuffle operatons (e.g. `reduceByKey`), even without users calling `persist`. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call `persist` if they plan to re-use an RDD iteratively.
993
+
Spark also automatically persists some intermediate data in shuffle operations (e.g. `reduceByKey`), even without users calling `persist`. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call `persist`on the resulting RDD if they plan to reuse it.
948
994
949
995
### Which Storage Level to Choose?
950
996
@@ -958,7 +1004,7 @@ efficiency. We recommend going through the following process to select one:
958
1004
make the objects much more space-efficient, but still reasonably fast to access.
959
1005
960
1006
* Don't spill to disk unless the functions that computed your datasets are expensive, or they filter
961
-
a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from
1007
+
a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from
962
1008
disk.
963
1009
964
1010
* Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve
@@ -972,6 +1018,12 @@ mode has several advantages:
972
1018
* It significantly reduces garbage collection costs.
973
1019
* Cached data is not lost if individual executors crash.
974
1020
1021
+
### Removing Data
1022
+
1023
+
Spark automatically monitors cache usage on each node and drops out old data partitions in a
1024
+
least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for
1025
+
it to fall out of the cache, use the `RDD.unpersist()` method.
1026
+
975
1027
# Shared Variables
976
1028
977
1029
Normally, when a function passed to a Spark operation (such as `map` or `reduce`) is executed on a
@@ -1044,7 +1096,7 @@ MapReduce) or sums. Spark natively supports accumulators of numeric types, and p
1044
1096
can add support for new types.
1045
1097
1046
1098
An accumulator is created from an initial value `v` by calling `SparkContext.accumulator(v)`. Tasks
1047
-
running on the cluster can then add to it using the `add` method or the `+=` operator (in Scala / Python).
1099
+
running on the cluster can then add to it using the `add` method or the `+=` operator (in Scala and Python).
1048
1100
However, they cannot read its value.
1049
1101
Only the driver program can read the accumulator's value, using its `value` method.
1050
1102
@@ -1200,10 +1252,21 @@ cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.
1200
1252
# Where to Go from Here
1201
1253
1202
1254
You can see some [example Spark programs](http://spark.apache.org/examples.html) on the Spark website.
1203
-
In addition, Spark includes several samples in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `bin/run-example` script included in Spark; for example:
1255
+
In addition, Spark includes several samples in the `examples` directory
0 commit comments