Skip to content

Commit 3289ea4

Browse files
committed
Pulling in changes from apache#856
1 parent 106ee31 commit 3289ea4

File tree

2 files changed

+127
-26
lines changed

2 files changed

+127
-26
lines changed

docs/configuration.md

Lines changed: 39 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,8 @@ layout: global
33
title: Spark Configuration
44
---
55

6-
Spark provides three locations to configure the system:
7-
8-
* [Spark properties](#spark-properties) control most application parameters and can be set by
9-
passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext,
10-
or through the `conf/spark-defaults.conf` properties file.
11-
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
12-
the IP address, through the `conf/spark-env.sh` script on each node.
13-
* [Logging](#configuring-logging) can be configured through `log4j.properties`.
14-
6+
* This will become a table of contents (this text will be scraped).
7+
{:toc}
158

169
# Spark Properties
1710

@@ -149,7 +142,8 @@ Apart from these, the following properties are also available, and may be useful
149142
<td><code>spark.executor.memory</code></td>
150143
<td>512m</td>
151144
<td>
152-
Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>).
145+
Amount of memory to use per executor process, in the same format as JVM memory strings
146+
(e.g. <code>512m</code>, <code>2g</code>).
153147
</td>
154148
</tr>
155149
<tr>
@@ -422,7 +416,8 @@ Apart from these, the following properties are also available, and may be useful
422416
<td><code>spark.files.overwrite</code></td>
423417
<td>false</td>
424418
<td>
425-
Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source.
419+
Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not
420+
match those of the source.
426421
</td>
427422
</tr>
428423
<tr>
@@ -446,8 +441,9 @@ Apart from these, the following properties are also available, and may be useful
446441
<td><code>spark.tachyonStore.baseDir</code></td>
447442
<td>System.getProperty("java.io.tmpdir")</td>
448443
<td>
449-
Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by <code>spark.tachyonStore.url</code>.
450-
It can also be a comma-separated list of multiple directories on Tachyon file system.
444+
Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by
445+
<code>spark.tachyonStore.url</code>. It can also be a comma-separated list of multiple directories
446+
on Tachyon file system.
451447
</td>
452448
</tr>
453449
<tr>
@@ -504,21 +500,33 @@ Apart from these, the following properties are also available, and may be useful
504500
<td><code>spark.akka.heartbeat.pauses</code></td>
505501
<td>600</td>
506502
<td>
507-
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to control sensitivity to gc pauses. Tune this in combination of `spark.akka.heartbeat.interval` and `spark.akka.failure-detector.threshold` if you need to.
503+
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
504+
plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to
505+
control sensitivity to gc pauses. Tune this in combination of `spark.akka.heartbeat.interval` and
506+
`spark.akka.failure-detector.threshold` if you need to.
508507
</td>
509508
</tr>
510509
<tr>
511510
<td><code>spark.akka.failure-detector.threshold</code></td>
512511
<td>300.0</td>
513512
<td>
514-
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). This maps to akka's `akka.remote.transport-failure-detector.threshold`. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
513+
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
514+
plan to use this feature (Not recommended). This maps to akka's `akka.remote.transport-failure-detector.threshold`.
515+
Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
515516
</td>
516517
</tr>
517518
<tr>
518519
<td><code>spark.akka.heartbeat.interval</code></td>
519520
<td>1000</td>
520521
<td>
521-
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using failure detector can be, a sensistive failure detector can help evict rogue executors really quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the network with those.
522+
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
523+
plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a
524+
smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination
525+
of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use
526+
case for using failure detector can be, a sensistive failure detector can help evict rogue executors really
527+
quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster.
528+
Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the
529+
network with those.
522530
</td>
523531
</tr>
524532
</table>
@@ -578,7 +586,8 @@ Apart from these, the following properties are also available, and may be useful
578586
<td><code>spark.speculation</code></td>
579587
<td>false</td>
580588
<td>
581-
If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.
589+
If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a
590+
stage, they will be re-launched.
582591
</td>
583592
</tr>
584593
<tr>
@@ -739,13 +748,13 @@ Apart from these, the following properties are also available, and may be useful
739748

740749
# Environment Variables
741750

742-
Certain Spark settings can be configured through environment variables, which are read from the `conf/spark-env.sh`
743-
script in the directory where Spark is installed (or `conf/spark-env.cmd` on Windows). In Standalone and Mesos modes,
744-
this file can give machine specific information such as hostnames. It is also sourced when running local
745-
Spark applications or submission scripts.
751+
Certain Spark settings can be configured through environment variables, which are read from the
752+
`conf/spark-env.sh` script in the directory where Spark is installed (or `conf/spark-env.cmd` on
753+
Windows). In Standalone and Mesos modes, this file can give machine specific information such as
754+
hostnames. It is also sourced when running local Spark applications or submission scripts.
746755

747-
Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can copy
748-
`conf/spark-env.sh.template` to create it. Make sure you make the copy executable.
756+
Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can
757+
copy `conf/spark-env.sh.template` to create it. Make sure you make the copy executable.
749758

750759
The following variables can be set in `spark-env.sh`:
751760

@@ -770,12 +779,16 @@ The following variables can be set in `spark-env.sh`:
770779
</tr>
771780
</table>
772781

773-
In addition to the above, there are also options for setting up the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores to use on each machine and maximum memory.
782+
In addition to the above, there are also options for setting up the Spark
783+
[standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores to use on each
784+
machine and maximum memory.
774785

775786
Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might
776787
compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.
777788

778789
# Configuring Logging
779790

780-
Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a `log4j.properties`
781-
file in the `conf` directory. One way to start is to copy the existing `log4j.properties.template` located there.
791+
Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a
792+
`log4j.properties` file in the `conf` directory. One way to start is to copy the existing
793+
`log4j.properties.template` located there.
794+
</table>

docs/spark-standalone.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,94 @@ In addition, detailed log output for each job is also written to the work direct
286286
You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically `hdfs://<namenode>:9000/path`, but you can find the right URL on your Hadoop Namenode's web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on).
287287

288288

289+
# Configuring Ports for Network Security
290+
291+
Spark makes heavy use of the network, and some environments have strict requirements for using tight
292+
firewall settings. Below are the primary ports that Spark uses for its communication and how to
293+
configure those ports.
294+
295+
<table class="table">
296+
<tr>
297+
<th>From</th><th>To</th><th>Default Port</th><th>Purpose</th><th>Configuration
298+
Setting</th><th>Notes</th>
299+
</tr>
300+
<!-- Web UIs -->
301+
<tr>
302+
<td>Browser</td>
303+
<td>Standalone Cluster Master</td>
304+
<td>8080</td>
305+
<td>Web UI</td>
306+
<td><code>master.ui.port</code></td>
307+
<td>Jetty-based</td>
308+
</tr>
309+
<tr>
310+
<td>Browser</td>
311+
<td>Driver</td>
312+
<td>4040</td>
313+
<td>Web UI</td>
314+
<td><code>spark.ui.port</code></td>
315+
<td>Jetty-based</td>
316+
</tr>
317+
<tr>
318+
<td>Browser</td>
319+
<td>History Server</td>
320+
<td>18080</td>
321+
<td>Web UI</td>
322+
<td><code>spark.history.ui.port</code></td>
323+
<td>Jetty-based</td>
324+
</tr>
325+
<tr>
326+
<td>Browser</td>
327+
<td>Worker</td>
328+
<td>8081</td>
329+
<td>Web UI</td>
330+
<td><code>worker.ui.port</code></td>
331+
<td>Jetty-based</td>
332+
</tr>
333+
<!-- Cluster interactions -->
334+
<tr>
335+
<td>Application</td>
336+
<td>Standalone Cluster Master</td>
337+
<td>7077</td>
338+
<td>Submit job to cluster</td>
339+
<td><code>spark.driver.port</code></td>
340+
<td>Akka-based. Set to "0" to choose a port randomly</td>
341+
</tr>
342+
<tr>
343+
<td>Worker</td>
344+
<td>Standalone Cluster Master</td>
345+
<td>7077</td>
346+
<td>Join cluster</td>
347+
<td><code>spark.driver.port</code></td>
348+
<td>Akka-based. Set to "0" to choose a port randomly</td>
349+
</tr>
350+
<tr>
351+
<td>Application</td>
352+
<td>Worker</td>
353+
<td>(random)</td>
354+
<td>Join cluster</td>
355+
<td><code>SPARK_WORKER_PORT</code> (standalone cluster)</td>
356+
<td>Akka-based</td>
357+
</tr>
358+
359+
<!-- Other misc stuff -->
360+
<tr>
361+
<td>Driver and other Workers</td>
362+
<td>Worker</td>
363+
<td>(random)</td>
364+
<td>
365+
<ul>
366+
<li>File server for file and jars</li>
367+
<li>Http Broadcast</li>
368+
<li>Class file server (Spark Shell only)</li>
369+
</ul>
370+
</td>
371+
<td>None</td>
372+
<td>Jetty-based. Each of these services starts on a random port that cannot be configured</td>
373+
</tr>
374+
375+
</table>
376+
289377
# High Availability
290378

291379
By default, standalone scheduling clusters are resilient to Worker failures (insofar as Spark itself is resilient to losing work by moving it to other workers). However, the scheduler uses a Master to make scheduling decisions, and this (by default) creates a single point of failure: if the Master crashes, no new applications can be created. In order to circumvent this, we have two high availability schemes, detailed below.

0 commit comments

Comments
 (0)