Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -69,14 +69,14 @@
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="quick-start.html">Quick Start</a></li>
<li><a href="programming-guide.html">Spark Programming Guide</a></li>
<li class="divider"></li>
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
<li><a href="sql-programming-guide.html">DataFrames, Datasets and SQL</a></li>
<li><a href="structured-streaming-programming-guide.html">Structured Streaming</a></li>
<li><a href="ml-guide.html">MLlib (Machine Learning)</a></li>
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
<li><a href="sparkr.html">SparkR (R on Spark)</a></li>
<li class="divider"></li>
Copy link
Contributor Author

@tdas tdas Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved RDD and DStream guides below a divider since we are trying to de-emphasize those. But kept the Accumulators and Broadcast vars because that part of the guide is still useful.

<li><a href="rdd-programming-guide.html">RDDs, Accumulators, Broadcasts Vars</a></li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we probably still need to keep this after "Quick Start". The following sections inside this page are pretty important for most of Spark modules:

  • Overview
  • Linking with Spark
  • Initializing Spark
  • Using the Shell
  • Passing Functions to Spark
  • Understanding closures
  • Local vs. cluster modes
  • Deploying to a Cluster
  • Launching Spark jobs from Java / Scala
  • Unit Testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsxwing How about we add a short section in the SQL Programming Guide that points to these sections in the RDD programming guide?

<li><a href="streaming-programming-guide.html">Spark Streaming (DStreams)</a></li>
</ul>
</li>

Expand Down
17 changes: 9 additions & 8 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ description: Apache Spark SPARK_VERSION_SHORT documentation homepage
Apache Spark is a fast and general-purpose cluster computing system.
It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](ml-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).

It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [Structured Streaming](structured-streaming-programming-guide.html) for streaming, [MLlib](ml-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing.

# Downloading

Expand Down Expand Up @@ -88,13 +89,13 @@ options for deployment:
**Programming Guides:**

* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
* [Spark Programming Guide](programming-guide.html): detailed overview of Spark
in all supported languages (Scala, Java, Python, R)
* Modules built on Spark:
* [Spark Streaming](streaming-programming-guide.html): processing real-time data streams
* [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): support for structured data and relational queries
* [MLlib](ml-guide.html): built-in machine learning library
* [GraphX](graphx-programming-guide.html): Spark's new API for graph processing
* [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): processing structured data
* [Structured Streaming](structured-streaming-programming-guide.html): processing structured data streams
* [Spark RDD Programming Guide](rdd-programming-guide.html): processing using RDDs (old API)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

* Contains additional features (accumulator, broacast vars, etc.) that can be used with Datasets and DataFrames
* [Spark Streaming](streaming-programming-guide.html): processing data streams using DStreams/RDDs (old API)
* [MLlib](ml-guide.html): applying machine learning
* [GraphX](graphx-programming-guide.html): processing graphs

**API Docs:**

Expand Down
7 changes: 0 additions & 7 deletions docs/java-programming-guide.md

This file was deleted.

7 changes: 7 additions & 0 deletions docs/programming-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
layout: global
title: Spark Programming Guide
redirect: rdd-programming-guide.html
---

This document has moved [here](rdd-programming-guide.html).
7 changes: 0 additions & 7 deletions docs/python-programming-guide.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/rdd-programming-guide.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: global
title: Spark Programming Guide
title: RDD Programming Guide
description: Spark SPARK_VERSION_SHORT programming guide in Java, Scala and Python
---

Expand Down
7 changes: 0 additions & 7 deletions docs/scala-programming-guide.md

This file was deleted.

85 changes: 71 additions & 14 deletions docs/structured-streaming-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ In this guide, we are going to walk you through the programming model and the AP
# Quick Example
Let’s say you want to maintain a running word count of text data received from a data server listening on a TCP socket. Let’s see how you can express this using Structured Streaming. You can see the full code in
[Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java)/[Python]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/python/sql/streaming/structured_network_wordcount.py)/[R]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/r/streaming/structured_network_wordcount.R).
And if you [download Spark](http://spark.apache.org/downloads.html), you can directly run the example. In any case, let’s walk through the example step-by-step and understand how it works. First, we have to import the necessary classes and create a local SparkSession, the starting point of all functionalities related to Spark.
And if you [download Spark](http://spark.apache.org/downloads.html), you can directly [run the example](index.html#running-the-examples-and-shell). In any case, let’s walk through the example step-by-step and understand how it works. First, we have to import the necessary classes and create a local SparkSession, the starting point of all functionalities related to Spark.

<div class="codetabs">
<div data-lang="scala" markdown="1">
Expand Down Expand Up @@ -450,7 +450,12 @@ running counts with the new data to compute updated counts, as shown below.

![Model](img/structured-streaming-example-model.png)

This model is significantly different from many other stream processing
**Note that Structured Streaming does not materialize the entire table**. It reads the latest
available data from the streaming data source, processes it incrementally to update the result,
and then discards the source data. It only keeps around the minimal intermediate *state* data as
required to update the result (e.g. intermediate counts in the earlier example).

This model is significantly different from many other stream processing
engines. Many streaming systems require the user to maintain running
aggregations themselves, thus having to reason about fault-tolerance, and
data consistency (at-least-once, or at-most-once, or exactly-once). In this
Expand Down Expand Up @@ -490,7 +495,7 @@ In Spark 2.0, there are a few built-in sources.

- **File source** - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.

- **Kafka source** - Poll data from Kafka. It's compatible with Kafka broker versions 0.10.0 or higher. See the [Kafka Integration Guide](structured-streaming-kafka-integration.html) for more details.
- **Kafka source** - Reads data from Kafka. It's compatible with Kafka broker versions 0.10.0 or higher. See the [Kafka Integration Guide](structured-streaming-kafka-integration.html) for more details.

- **Socket source (for testing)** - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees.

Expand Down Expand Up @@ -754,10 +759,22 @@ select(where(df, "signal > 10"), "device")

# Running count of the number of updates for each device type
count(groupBy(df, "deviceType"))
writeStream
.format("console")</div>
{% endhighlight %}
</div>
</div>

You can also register a streaming DataFrame/Dataset as a temporary view and then apply SQL commands on it.
{% highlight scala %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enclose this in

<div class="codetabs">
 <div data-lang="scala"  markdown="1">

?
or, add example in java, python, r too?

Copy link
Contributor Author

@tdas tdas Jul 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me the example in R? I really have no idea how to write anything in R. :)

Copy link
Member

@felixcheung felixcheung Jul 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. python would look exactly the same except # for the comment.
for R

createOrReplaceTempView(df, "updates")
sql("select count(*) from updates") #  returns another streaming SparkDataFrame


df.createOrReplaceTempView("updates")
spark.sql("select count(*) from updates") // returns another streaming DF
{% endhighlight %}

Note, you can identify whether a DataFrame/Dataset has streaming data or not by using
`df.isStreaming()`.

### Window Operations on Event Time
Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into. Let's understand this with an illustration.

Expand Down Expand Up @@ -1043,7 +1060,7 @@ streamingDf \
</div>

### Arbitrary Stateful Operations
Many uscases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation `mapGroupsWithState` and the more powerful operation `flatMapGroupsWithState`. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state. For more concrete details, take a look at the API documentation ([Scala](api/scala/index.html#org.apache.spark.sql.streaming.GroupState)/[Java](api/java/org/apache/spark/sql/streaming/GroupState.html)) and the examples ([Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java)).
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation `mapGroupsWithState` and the more powerful operation `flatMapGroupsWithState`. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state. For more concrete details, take a look at the API documentation ([Scala](api/scala/index.html#org.apache.spark.sql.streaming.GroupState)/[Java](api/java/org/apache/spark/sql/streaming/GroupState.html)) and the examples ([Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java)).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: uscases -> usecases


### Unsupported Operations
There are a few DataFrame/Dataset operations that are not supported with streaming DataFrames/Datasets.
Expand Down Expand Up @@ -1201,6 +1218,16 @@ writeStream
.start()
{% endhighlight %}

- **Kafka sink** - Stores the output to one or more topics in Kafka.

{% highlight scala %}
writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "updates")
.start()
{% endhighlight %}

- **Foreach sink** - Runs arbitrary computation on the records in the output. See later in the section for more details.

{% highlight scala %}
Expand Down Expand Up @@ -1253,12 +1280,19 @@ Here are the details of all the sinks in Spark.
href="api/R/write.stream.html">R</a>).
E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code>
</td>
<td>Yes</td>
<td>Yes (exactly-once)</td>
<td>Supports writes to partitioned tables. Partitioning by time may be useful.</td>
</tr>
<tr>
<td><b>Kafka Sink</b></td>
<td>Append, Update, Complete</td>
<td>See the <a href="structured-streaming-kafka-integration.html">Kafka Integration Guide</a></td>
<td>Yes (at-least-once)</td>
<td>More details in the <a href="structured-streaming-kafka-integration.html">Kafka Integration Guide</a></td>
</tr>
<tr>
<td><b>Foreach Sink</b></td>
<td>Append, Update, Compelete</td>
<td>Append, Update, Complete</td>
<td>None</td>
<td>Depends on ForeachWriter implementation</td>
<td>More details in the <a href="#using-foreach">next section</a></td>
Expand Down Expand Up @@ -1624,10 +1658,9 @@ Not available in R.


## Monitoring Streaming Queries
There are two APIs for monitoring and debugging active queries -
interactively and asynchronously.
There are multiple ways to monitor active streaming queries. You can either push metrics to external systems using Spark's Dropwizard Metrics support, or access them programmatically.

### Interactive APIs
### Reading Metrics Interactively

You can directly get the current status and metrics of an active query using
`streamingQuery.lastProgress()` and `streamingQuery.status()`.
Expand Down Expand Up @@ -1857,7 +1890,7 @@ Will print something like the following.
</div>
</div>

### Asynchronous API
### Reporting Metrics programmatically using Asynchronous APIs

You can also asynchronously monitor all queries associated with a
`SparkSession` by attaching a `StreamingQueryListener`
Expand Down Expand Up @@ -1922,6 +1955,15 @@ Not available in R.
</div>
</div>

### Reporting Metrics using Dropwizard
Spark supports reporting metrics using the [Dropwizard Library](monitoring.html#metrics). To enable metrics of Structured Streaming queries to be reported as well, you have to explicitly enable the configuration `spark.sql.streaming.metricsEnabled` in the SparkSession.

{% highlight bash %}
spark.conf().set("spark.sql.streaming.metricsEnabled", "true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is Java codes. Do you need to add example for Scala and Python?

{% endhighlight %}

All queries started in the SparkSession after this configuration has been enabled will report metrics through Dropwizard to whatever [sinks](monitoring.html#metrics) have been configured (e.g. Ganglia, Graphite, JMX, etc.).

## Recovering from Failures with Checkpointing
In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the [quick example](#quick-example)) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when [starting a query](#starting-streaming-queries).

Expand Down Expand Up @@ -1971,8 +2013,23 @@ write.stream(aggDF, "memory", outputMode = "complete", checkpointLocation = "pat
</div>
</div>

# Where to go from here
- Examples: See and run the
[Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming)/[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/sql/streaming)/[Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/sql/streaming)/[R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r/streaming)
examples.
# Additional Information

**Further Reading**

- See and run the
[Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming)/[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/sql/streaming)/[Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/sql/streaming)/[R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r/streaming)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you change master to v{{site.SPARK_VERSION_SHORT}}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the point is to link it to the latest example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the point is to link it to the latest example.

But if we rename/remove these files, the links will become invalid in the docs for old versions. It should be fine to link to a special version when the user is looking at some old docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Updating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we add new stuff to the example, that are not present in old versions, then it will be confusing. So yeah, updating it.

examples.
- [Instructions](index.html#running-the-examples-and-shell) on how to run Spark examples
- Read about integrating with Kafka in the [Structured Streaming Kafka Integration Guide](structured-streaming-kafka-integration.html)
- Read more details about using DataFrames/Datasets in the [Spark SQL Programming Guide](sql-programming-guide.html)
- Third-party Blog Posts
- [Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1 (Databricks Blog)](https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html)
- [Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Structured Streaming (Databricks Blog)](https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html)
- [Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming (Databricks Blog)](https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html)

**Talks**

- Spark Summit 2017 Talk - [Easy, Scalable, Fault-tolerant Stream Processing with Structured Streaming in Apache Spark](https://spark-summit.org/2017/events/easy-scalable-fault-tolerant-stream-processing-with-structured-streaming-in-apache-spark/)
- Spark Summit 2016 Talk - [A Deep Dive into Structured Streaming](https://spark-summit.org/2016/events/a-deep-dive-into-structured-streaming/)

1 change: 0 additions & 1 deletion sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Original file line number Diff line number Diff line change
Expand Up @@ -520,7 +520,6 @@ class Dataset[T] private[sql](
* @group streaming
* @since 2.0.0
*/
@Experimental
@InterfaceStability.Evolving
def isStreaming: Boolean = logicalPlan.isStreaming

Expand Down