-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation #18485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
92418a9
d3998f8
a77bca5
109cca9
f2ab73e
c995280
df61136
d71c190
f9b1683
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -69,14 +69,14 @@ | |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a> | ||
| <ul class="dropdown-menu"> | ||
| <li><a href="quick-start.html">Quick Start</a></li> | ||
| <li><a href="programming-guide.html">Spark Programming Guide</a></li> | ||
| <li class="divider"></li> | ||
| <li><a href="streaming-programming-guide.html">Spark Streaming</a></li> | ||
| <li><a href="sql-programming-guide.html">DataFrames, Datasets and SQL</a></li> | ||
| <li><a href="structured-streaming-programming-guide.html">Structured Streaming</a></li> | ||
| <li><a href="ml-guide.html">MLlib (Machine Learning)</a></li> | ||
| <li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li> | ||
| <li><a href="sparkr.html">SparkR (R on Spark)</a></li> | ||
| <li class="divider"></li> | ||
| <li><a href="rdd-programming-guide.html">RDDs, Accumulators, Broadcasts Vars</a></li> | ||
|
||
| <li><a href="streaming-programming-guide.html">Spark Streaming (DStreams)</a></li> | ||
| </ul> | ||
| </li> | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,7 +8,8 @@ description: Apache Spark SPARK_VERSION_SHORT documentation homepage | |
| Apache Spark is a fast and general-purpose cluster computing system. | ||
| It provides high-level APIs in Java, Scala, Python and R, | ||
| and an optimized engine that supports general execution graphs. | ||
| It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](ml-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html). | ||
|
|
||
| It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [Structured Streaming](structured-streaming-programming-guide.html) for streaming, [MLlib](ml-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing. | ||
|
|
||
| # Downloading | ||
|
|
||
|
|
@@ -88,13 +89,13 @@ options for deployment: | |
| **Programming Guides:** | ||
|
|
||
| * [Quick Start](quick-start.html): a quick introduction to the Spark API; start here! | ||
| * [Spark Programming Guide](programming-guide.html): detailed overview of Spark | ||
| in all supported languages (Scala, Java, Python, R) | ||
| * Modules built on Spark: | ||
| * [Spark Streaming](streaming-programming-guide.html): processing real-time data streams | ||
| * [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): support for structured data and relational queries | ||
| * [MLlib](ml-guide.html): built-in machine learning library | ||
| * [GraphX](graphx-programming-guide.html): Spark's new API for graph processing | ||
| * [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): processing structured data | ||
| * [Structured Streaming](structured-streaming-programming-guide.html): processing structured data streams | ||
| * [Spark RDD Programming Guide](rdd-programming-guide.html): processing using RDDs (old API) | ||
|
||
| * Contains additional features (accumulator, broacast vars, etc.) that can be used with Datasets and DataFrames | ||
| * [Spark Streaming](streaming-programming-guide.html): processing data streams using DStreams/RDDs (old API) | ||
| * [MLlib](ml-guide.html): applying machine learning | ||
| * [GraphX](graphx-programming-guide.html): processing graphs | ||
|
|
||
| **API Docs:** | ||
|
|
||
|
|
||
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| --- | ||
| layout: global | ||
| title: Spark Programming Guide | ||
| redirect: rdd-programming-guide.html | ||
| --- | ||
|
|
||
| This document has moved [here](rdd-programming-guide.html). |
This file was deleted.
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,7 +15,7 @@ In this guide, we are going to walk you through the programming model and the AP | |
| # Quick Example | ||
| Let’s say you want to maintain a running word count of text data received from a data server listening on a TCP socket. Let’s see how you can express this using Structured Streaming. You can see the full code in | ||
| [Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java)/[Python]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/python/sql/streaming/structured_network_wordcount.py)/[R]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/r/streaming/structured_network_wordcount.R). | ||
| And if you [download Spark](http://spark.apache.org/downloads.html), you can directly run the example. In any case, let’s walk through the example step-by-step and understand how it works. First, we have to import the necessary classes and create a local SparkSession, the starting point of all functionalities related to Spark. | ||
| And if you [download Spark](http://spark.apache.org/downloads.html), you can directly [run the example](index.html#running-the-examples-and-shell). In any case, let’s walk through the example step-by-step and understand how it works. First, we have to import the necessary classes and create a local SparkSession, the starting point of all functionalities related to Spark. | ||
|
|
||
| <div class="codetabs"> | ||
| <div data-lang="scala" markdown="1"> | ||
|
|
@@ -450,7 +450,12 @@ running counts with the new data to compute updated counts, as shown below. | |
|
|
||
|  | ||
|
|
||
| This model is significantly different from many other stream processing | ||
| **Note that Structured Streaming does not materialize the entire table**. It reads the latest | ||
| available data from the streaming data source, processes it incrementally to update the result, | ||
| and then discards the source data. It only keeps around the minimal intermediate *state* data as | ||
| required to update the result (e.g. intermediate counts in the earlier example). | ||
|
|
||
| This model is significantly different from many other stream processing | ||
| engines. Many streaming systems require the user to maintain running | ||
| aggregations themselves, thus having to reason about fault-tolerance, and | ||
| data consistency (at-least-once, or at-most-once, or exactly-once). In this | ||
|
|
@@ -490,7 +495,7 @@ In Spark 2.0, there are a few built-in sources. | |
|
|
||
| - **File source** - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations. | ||
|
|
||
| - **Kafka source** - Poll data from Kafka. It's compatible with Kafka broker versions 0.10.0 or higher. See the [Kafka Integration Guide](structured-streaming-kafka-integration.html) for more details. | ||
| - **Kafka source** - Reads data from Kafka. It's compatible with Kafka broker versions 0.10.0 or higher. See the [Kafka Integration Guide](structured-streaming-kafka-integration.html) for more details. | ||
|
|
||
| - **Socket source (for testing)** - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees. | ||
|
|
||
|
|
@@ -754,10 +759,22 @@ select(where(df, "signal > 10"), "device") | |
|
|
||
| # Running count of the number of updates for each device type | ||
| count(groupBy(df, "deviceType")) | ||
| writeStream | ||
| .format("console")</div> | ||
| {% endhighlight %} | ||
| </div> | ||
| </div> | ||
|
|
||
| You can also register a streaming DataFrame/Dataset as a temporary view and then apply SQL commands on it. | ||
| {% highlight scala %} | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. enclose this in ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you give me the example in R? I really have no idea how to write anything in R. :)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure. python would look exactly the same except |
||
|
|
||
| df.createOrReplaceTempView("updates") | ||
| spark.sql("select count(*) from updates") // returns another streaming DF | ||
| {% endhighlight %} | ||
|
|
||
| Note, you can identify whether a DataFrame/Dataset has streaming data or not by using | ||
| `df.isStreaming()`. | ||
|
|
||
| ### Window Operations on Event Time | ||
| Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into. Let's understand this with an illustration. | ||
|
|
||
|
|
@@ -1043,7 +1060,7 @@ streamingDf \ | |
| </div> | ||
|
|
||
| ### Arbitrary Stateful Operations | ||
| Many uscases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation `mapGroupsWithState` and the more powerful operation `flatMapGroupsWithState`. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state. For more concrete details, take a look at the API documentation ([Scala](api/scala/index.html#org.apache.spark.sql.streaming.GroupState)/[Java](api/java/org/apache/spark/sql/streaming/GroupState.html)) and the examples ([Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java)). | ||
| Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation `mapGroupsWithState` and the more powerful operation `flatMapGroupsWithState`. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state. For more concrete details, take a look at the API documentation ([Scala](api/scala/index.html#org.apache.spark.sql.streaming.GroupState)/[Java](api/java/org/apache/spark/sql/streaming/GroupState.html)) and the examples ([Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java)). | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo: uscases -> usecases |
||
|
|
||
| ### Unsupported Operations | ||
| There are a few DataFrame/Dataset operations that are not supported with streaming DataFrames/Datasets. | ||
|
|
@@ -1201,6 +1218,16 @@ writeStream | |
| .start() | ||
| {% endhighlight %} | ||
|
|
||
| - **Kafka sink** - Stores the output to one or more topics in Kafka. | ||
|
|
||
| {% highlight scala %} | ||
| writeStream | ||
| .format("kafka") | ||
| .option("kafka.bootstrap.servers", "host1:port1,host2:port2") | ||
| .option("topic", "updates") | ||
| .start() | ||
| {% endhighlight %} | ||
|
|
||
| - **Foreach sink** - Runs arbitrary computation on the records in the output. See later in the section for more details. | ||
|
|
||
| {% highlight scala %} | ||
|
|
@@ -1253,12 +1280,19 @@ Here are the details of all the sinks in Spark. | |
| href="api/R/write.stream.html">R</a>). | ||
| E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code> | ||
| </td> | ||
| <td>Yes</td> | ||
| <td>Yes (exactly-once)</td> | ||
| <td>Supports writes to partitioned tables. Partitioning by time may be useful.</td> | ||
| </tr> | ||
| <tr> | ||
| <td><b>Kafka Sink</b></td> | ||
| <td>Append, Update, Complete</td> | ||
| <td>See the <a href="structured-streaming-kafka-integration.html">Kafka Integration Guide</a></td> | ||
| <td>Yes (at-least-once)</td> | ||
| <td>More details in the <a href="structured-streaming-kafka-integration.html">Kafka Integration Guide</a></td> | ||
| </tr> | ||
| <tr> | ||
| <td><b>Foreach Sink</b></td> | ||
| <td>Append, Update, Compelete</td> | ||
| <td>Append, Update, Complete</td> | ||
| <td>None</td> | ||
| <td>Depends on ForeachWriter implementation</td> | ||
| <td>More details in the <a href="#using-foreach">next section</a></td> | ||
|
|
@@ -1624,10 +1658,9 @@ Not available in R. | |
|
|
||
|
|
||
| ## Monitoring Streaming Queries | ||
| There are two APIs for monitoring and debugging active queries - | ||
| interactively and asynchronously. | ||
| There are multiple ways to monitor active streaming queries. You can either push metrics to external systems using Spark's Dropwizard Metrics support, or access them programmatically. | ||
|
|
||
| ### Interactive APIs | ||
| ### Reading Metrics Interactively | ||
|
|
||
| You can directly get the current status and metrics of an active query using | ||
| `streamingQuery.lastProgress()` and `streamingQuery.status()`. | ||
|
|
@@ -1857,7 +1890,7 @@ Will print something like the following. | |
| </div> | ||
| </div> | ||
|
|
||
| ### Asynchronous API | ||
| ### Reporting Metrics programmatically using Asynchronous APIs | ||
|
|
||
| You can also asynchronously monitor all queries associated with a | ||
| `SparkSession` by attaching a `StreamingQueryListener` | ||
|
|
@@ -1922,6 +1955,15 @@ Not available in R. | |
| </div> | ||
| </div> | ||
|
|
||
| ### Reporting Metrics using Dropwizard | ||
| Spark supports reporting metrics using the [Dropwizard Library](monitoring.html#metrics). To enable metrics of Structured Streaming queries to be reported as well, you have to explicitly enable the configuration `spark.sql.streaming.metricsEnabled` in the SparkSession. | ||
|
|
||
| {% highlight bash %} | ||
| spark.conf().set("spark.sql.streaming.metricsEnabled", "true") | ||
|
||
| {% endhighlight %} | ||
|
|
||
| All queries started in the SparkSession after this configuration has been enabled will report metrics through Dropwizard to whatever [sinks](monitoring.html#metrics) have been configured (e.g. Ganglia, Graphite, JMX, etc.). | ||
|
|
||
| ## Recovering from Failures with Checkpointing | ||
| In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the [quick example](#quick-example)) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when [starting a query](#starting-streaming-queries). | ||
|
|
||
|
|
@@ -1971,8 +2013,23 @@ write.stream(aggDF, "memory", outputMode = "complete", checkpointLocation = "pat | |
| </div> | ||
| </div> | ||
|
|
||
| # Where to go from here | ||
| - Examples: See and run the | ||
| [Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming)/[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/sql/streaming)/[Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/sql/streaming)/[R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r/streaming) | ||
| examples. | ||
| # Additional Information | ||
|
|
||
| **Further Reading** | ||
|
|
||
| - See and run the | ||
| [Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming)/[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/sql/streaming)/[Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/sql/streaming)/[R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r/streaming) | ||
|
||
| examples. | ||
| - [Instructions](index.html#running-the-examples-and-shell) on how to run Spark examples | ||
| - Read about integrating with Kafka in the [Structured Streaming Kafka Integration Guide](structured-streaming-kafka-integration.html) | ||
| - Read more details about using DataFrames/Datasets in the [Spark SQL Programming Guide](sql-programming-guide.html) | ||
| - Third-party Blog Posts | ||
| - [Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1 (Databricks Blog)](https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html) | ||
| - [Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Structured Streaming (Databricks Blog)](https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html) | ||
| - [Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming (Databricks Blog)](https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html) | ||
|
|
||
| **Talks** | ||
|
|
||
| - Spark Summit 2017 Talk - [Easy, Scalable, Fault-tolerant Stream Processing with Structured Streaming in Apache Spark](https://spark-summit.org/2017/events/easy-scalable-fault-tolerant-stream-processing-with-structured-streaming-in-apache-spark/) | ||
| - Spark Summit 2016 Talk - [A Deep Dive into Structured Streaming](https://spark-summit.org/2016/events/a-deep-dive-into-structured-streaming/) | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved RDD and DStream guides below a divider since we are trying to de-emphasize those. But kept the Accumulators and Broadcast vars because that part of the guide is still useful.