Skip to content
This repository was archived by the owner on Nov 15, 2024. It is now read-only.

Commit 74ac1fb

Browse files
committed
[SPARK-21267][DOCS][MINOR] Follow up to avoid referencing programming-guide redirector
## What changes were proposed in this pull request? Update internal references from programming-guide to rdd-programming-guide See apache/spark-website@5ddf243 and apache#18485 (comment) Let's keep the redirector even if it's problematic to build, but not rely on it internally. ## How was this patch tested? (Doc build) Author: Sean Owen <sowen@cloudera.com> Closes apache#18625 from srowen/SPARK-21267.2.
1 parent ac5d5d7 commit 74ac1fb

9 files changed

Lines changed: 20 additions & 14 deletions

R/pkg/R/DataFrame.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -593,7 +593,7 @@ setMethod("cache",
593593
#'
594594
#' Persist this SparkDataFrame with the specified storage level. For details of the
595595
#' supported storage levels, refer to
596-
#' \url{http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence}.
596+
#' \url{http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence}.
597597
#'
598598
#' @param x the SparkDataFrame to persist.
599599
#' @param newLevel storage level chosen for the persistance. See available options in

R/pkg/R/RDD.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,7 @@ setMethod("cacheRDD",
227227
#'
228228
#' Persist this RDD with the specified storage level. For details of the
229229
#' supported storage levels, refer to
230-
#'\url{http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence}.
230+
#'\url{http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence}.
231231
#'
232232
#' @param x The RDD to persist
233233
#' @param newLevel The new storage level to be assigned

docs/graphx-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ description: GraphX graph processing library guide for Spark SPARK_VERSION_SHORT
2727
[EdgeContext]: api/scala/index.html#org.apache.spark.graphx.EdgeContext
2828
[GraphOps.collectNeighborIds]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]]
2929
[GraphOps.collectNeighbors]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]]
30-
[RDD Persistence]: programming-guide.html#rdd-persistence
30+
[RDD Persistence]: rdd-programming-guide.html#rdd-persistence
3131
[Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED]
3232
[GraphOps.pregel]: api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED]
3333
[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ options for deployment:
8787
**Programming Guides:**
8888

8989
* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
90-
* [RDD Programming Guide](programming-guide.html): overview of Spark basics - RDDs (core but old API), accumulators, and broadcast variables
90+
* [RDD Programming Guide](rdd-programming-guide.html): overview of Spark basics - RDDs (core but old API), accumulators, and broadcast variables
9191
* [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): processing structured data with relational queries (newer API than RDDs)
9292
* [Structured Streaming](structured-streaming-programming-guide.html): processing structured data streams with relation queries (using Datasets and DataFrames, newer API than DStreams)
9393
* [Spark Streaming](streaming-programming-guide.html): processing data streams using DStreams (old API)

docs/ml-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ At a high level, it provides tools such as:
1818

1919
**The MLlib RDD-based API is now in maintenance mode.**
2020

21-
As of Spark 2.0, the [RDD](programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
21+
As of Spark 2.0, the [RDD](rdd-programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
2222
The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.
2323

2424
*What are the implications?*

docs/mllib-optimization.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ is a stochastic gradient. Here `$S$` is the sampled subset of size `$|S|=$ miniB
116116
$\cdot n$`.
117117

118118
In each iteration, the sampling over the distributed dataset
119-
([RDD](programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
119+
([RDD](rdd-programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
120120
computation of the sum of the partial results from each worker machine is performed by the
121121
standard spark routines.
122122

docs/spark-standalone.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -264,7 +264,7 @@ SPARK_WORKER_OPTS supports the following system properties:
264264
# Connecting an Application to the Cluster
265265

266266
To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`
267-
constructor](programming-guide.html#initializing-spark).
267+
constructor](rdd-programming-guide.html#initializing-spark).
268268

269269
To run an interactive Spark shell against the cluster, run the following command:
270270

docs/streaming-programming-guide.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -535,7 +535,7 @@ After a context is defined, you have to do the following.
535535
It represents a continuous stream of data, either the input data stream received from source,
536536
or the processed data stream generated by transforming the input stream. Internally,
537537
a DStream is represented by a continuous series of RDDs, which is Spark's abstraction of an immutable,
538-
distributed dataset (see [Spark Programming Guide](programming-guide.html#resilient-distributed-datasets-rdds) for more details). Each RDD in a DStream contains data from a certain interval,
538+
distributed dataset (see [Spark Programming Guide](rdd-programming-guide.html#resilient-distributed-datasets-rdds) for more details). Each RDD in a DStream contains data from a certain interval,
539539
as shown in the following figure.
540540

541541
<p style="text-align: center;">
@@ -1531,7 +1531,7 @@ default persistence level is set to replicate the data to two nodes for fault-to
15311531

15321532
Note that, unlike RDDs, the default persistence level of DStreams keeps the data serialized in
15331533
memory. This is further discussed in the [Performance Tuning](#memory-tuning) section. More
1534-
information on different persistence levels can be found in the [Spark Programming Guide](programming-guide.html#rdd-persistence).
1534+
information on different persistence levels can be found in the [Spark Programming Guide](rdd-programming-guide.html#rdd-persistence).
15351535

15361536
***
15371537

@@ -1720,7 +1720,13 @@ batch interval that is at least 10 seconds. It can be set by using
17201720

17211721
## Accumulators, Broadcast Variables, and Checkpoints
17221722

1723-
[Accumulators](programming-guide.html#accumulators) and [Broadcast variables](programming-guide.html#broadcast-variables) cannot be recovered from checkpoint in Spark Streaming. If you enable checkpointing and use [Accumulators](programming-guide.html#accumulators) or [Broadcast variables](programming-guide.html#broadcast-variables) as well, you'll have to create lazily instantiated singleton instances for [Accumulators](programming-guide.html#accumulators) and [Broadcast variables](programming-guide.html#broadcast-variables) so that they can be re-instantiated after the driver restarts on failure. This is shown in the following example.
1723+
[Accumulators](rdd-programming-guide.html#accumulators) and [Broadcast variables](rdd-programming-guide.html#broadcast-variables)
1724+
cannot be recovered from checkpoint in Spark Streaming. If you enable checkpointing and use
1725+
[Accumulators](rdd-programming-guide.html#accumulators) or [Broadcast variables](rdd-programming-guide.html#broadcast-variables)
1726+
as well, you'll have to create lazily instantiated singleton instances for
1727+
[Accumulators](rdd-programming-guide.html#accumulators) and [Broadcast variables](rdd-programming-guide.html#broadcast-variables)
1728+
so that they can be re-instantiated after the driver restarts on failure.
1729+
This is shown in the following example.
17241730

17251731
<div class="codetabs">
17261732
<div data-lang="scala" markdown="1">
@@ -2182,7 +2188,7 @@ overall processing throughput of the system, its use is still recommended to ach
21822188
consistent batch processing times. Make sure you set the CMS GC on both the driver (using `--driver-java-options` in `spark-submit`) and the executors (using [Spark configuration](configuration.html#runtime-environment) `spark.executor.extraJavaOptions`).
21832189

21842190
* **Other tips**: To further reduce GC overheads, here are some more tips to try.
2185-
- Persist RDDs using the `OFF_HEAP` storage level. See more detail in the [Spark Programming Guide](programming-guide.html#rdd-persistence).
2191+
- Persist RDDs using the `OFF_HEAP` storage level. See more detail in the [Spark Programming Guide](rdd-programming-guide.html#rdd-persistence).
21862192
- Use more executors with smaller heap sizes. This will reduce the GC pressure within each JVM heap.
21872193

21882194
***

docs/tuning.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Because of the in-memory nature of most Spark computations, Spark programs can b
1212
by any resource in the cluster: CPU, network bandwidth, or memory.
1313
Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you
1414
also need to do some tuning, such as
15-
[storing RDDs in serialized form](programming-guide.html#rdd-persistence), to
15+
[storing RDDs in serialized form](rdd-programming-guide.html#rdd-persistence), to
1616
decrease memory usage.
1717
This guide will cover two main topics: data serialization, which is crucial for good network
1818
performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics.
@@ -155,7 +155,7 @@ pointer-based data structures and wrapper objects. There are several ways to do
155155

156156
When your objects are still too large to efficiently store despite this tuning, a much simpler way
157157
to reduce memory usage is to store them in *serialized* form, using the serialized StorageLevels in
158-
the [RDD persistence API](programming-guide.html#rdd-persistence), such as `MEMORY_ONLY_SER`.
158+
the [RDD persistence API](rdd-programming-guide.html#rdd-persistence), such as `MEMORY_ONLY_SER`.
159159
Spark will then store each RDD partition as one large byte array.
160160
The only downside of storing data in serialized form is slower access times, due to having to
161161
deserialize each object on the fly.
@@ -262,7 +262,7 @@ number of cores in your clusters.
262262

263263
## Broadcasting Large Variables
264264

265-
Using the [broadcast functionality](programming-guide.html#broadcast-variables)
265+
Using the [broadcast functionality](rdd-programming-guide.html#broadcast-variables)
266266
available in `SparkContext` can greatly reduce the size of each serialized task, and the cost
267267
of launching a job over a cluster. If your tasks use any large object from the driver program
268268
inside of them (e.g. a static lookup table), consider turning it into a broadcast variable.

0 commit comments

Comments
 (0)