[MINOR][SS][DOC] Added missing config maxFileAge in file streaming source

ryne.yang · linehrr · dongjoon-hyun · commit dbb8143501ab · 2019-05-10T16:19:41.000-07:00
## What changes were proposed in this pull request? added the missing config for structured streaming when using file source. from the code we have ``` /** * Maximum age of a file that can be found in this directory, before it is ignored. For the * first batch all files will be considered valid. If `latestFirst` is set to `true` and * `maxFilesPerTrigger` is set, then this parameter will be ignored, because old files that are * valid, and should be processed, may be ignored. Please refer to SPARK-19813 for details. * * The max age is specified with respect to the timestamp of the latest file, and not the * timestamp of the current system. That this means if the last file has timestamp 1000, and the * current system time is 2000, and max age is 200, the system will purge files older than * 800 (rather than 1800) from the internal state. * * Default to a week. */ val maxFileAgeMs: Long = Utils.timeStringAsMs(parameters.getOrElse("maxFileAge", "7d")) ``` which is not documented. also the file processing order was not mentioned but in the code we specifically select the file list based on file mtime: ```scala private val fileSortOrder = if (sourceOptions.latestFirst) { logWarning( """'latestFirst' is true. New files will be processed first, which may affect the watermark |value. In addition, 'maxFileAge' will be ignored.""".stripMargin) implicitly[Ordering[Long]].reverse } else { implicitly[Ordering[Long]] } val files = allFiles.sortBy(_.getModificationTime)(fileSortOrder).map { status => (status.getPath.toUri.toString, status.getModificationTime) } ``` --------- ![Screen Shot 2019-05-07 at 5 55 01 PM](https://user-images.githubusercontent.com/1124115/57335683-5a8b0400-70f1-11e9-98c8-99f173872842.png) --------- ![Screen Shot 2019-05-07 at 5 54 55 PM](https://user-images.githubusercontent.com/1124115/57335684-5a8b0400-70f1-11e9-996a-4bb1639e3d6b.png) Closes #24548 from linehrr/master. Lead-authored-by: ryne.yang <ryne.yang@acuityads.com> Co-authored-by: Ryne Yang <ryne.yang@acuityads.com> Co-authored-by: linehrr <linehrr@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md
@@ -510,8 +510,7 @@ returned by `SparkSession.readStream()`. In [R](api/R/read.stream.html), with th
 #### Input Sources
 There are a few built-in sources.
 
-  - **File source** - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, orc, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
-
+  - **File source** - Reads files written in a directory as a stream of data. Files will be processed in the order of file modification time. If `latestFirst` is set, order will be reversed. Supported file formats are text, CSV, JSON, ORC, Parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
   - **Kafka source** - Reads data from Kafka. It's compatible with Kafka broker versions 0.10.0 or higher. See the [Kafka Integration Guide](structured-streaming-kafka-0-10-integration.html) for more details.
 
   - **Socket source (for testing)** - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees. 
@@ -541,6 +540,8 @@ Here are the details of all the sources in Spark.
         <br/>
         <code>fileNameOnly</code>: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same:
         <br/>
+        <code>maxFileAge</code>: Maximum age of a file that can be found in this directory, before it is ignored. For the first batch all files will be considered valid. If <code>latestFirst</code> is set to `true` and <code>maxFilesPerTrigger</code> is set, then this parameter will be ignored, because old files that are valid, and should be processed, may be ignored. The max age is specified with respect to the timestamp of the latest file, and not the timestamp of the current system.(default: 1 week)
+        <br/>
         "file:///dataset.txt"<br/>
         "s3://a/dataset.txt"<br/>
         "s3n://a/b/dataset.txt"<br/>