Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
143 commits
Select commit Hold shift + click to select a range
b1d6580
Added filesModifiedAfterDate parameter with method overloads
cchighman Jun 12, 2020
145102e
Add first set of Unit Tests
cchighman Jun 12, 2020
2b35de7
Updated to use a PathFilter visitor for consistency. Added support f…
cchighman Jun 13, 2020
359cd42
Add more Unit Tests
cchighman Jun 14, 2020
96ab0d8
Update Tests, Styles, an Docs
Jun 16, 2020
e03b9ec
Merge remote-tracking branch 'upstream/master'
Jun 16, 2020
ad05dbb
Update Styles
Jun 16, 2020
3d34b5f
Fix Scala Styles
Jun 16, 2020
f8eef35
Fix Tests
Jun 16, 2020
3f45acf
Added filesModifiedAfterDate parameter with method overloads
cchighman Jun 12, 2020
0a69e73
Add first set of Unit Tests
cchighman Jun 12, 2020
7a2cad0
Updated to use a PathFilter visitor for consistency. Added support f…
cchighman Jun 13, 2020
8ee8e26
Add more Unit Tests
cchighman Jun 14, 2020
45e8afe
Update Tests, Styles, an Docs
Jun 16, 2020
d9f7bea
Update Styles
Jun 16, 2020
8b62cad
Fix Scala Styles
Jun 16, 2020
2cb58e6
Fix Tests
Jun 16, 2020
0a97479
Merge branch 'master' of https://github.com/cchighman/spark
Jun 16, 2020
7906998
Merge remote-tracking branch 'upstream/master'
Jun 16, 2020
256cb1b
Fix the silly duplicate line
Jun 16, 2020
3929b84
Fix tests
cchighman Jun 19, 2020
2b1c5dc
Fix spacing
cchighman Jun 19, 2020
2126c85
Merge remote-tracking branch 'upstream/master'
cchighman Jun 20, 2020
59f7787
Account for null
cchighman Jun 20, 2020
918af78
Update Filter Name
cchighman Jun 20, 2020
7110ca5
Merge remote-tracking branch 'upstream/master'
cchighman Jun 25, 2020
58c35e6
Revert Previous Files
cchighman Jun 25, 2020
032ca2f
Update Path Matching Logic
cchighman Jun 25, 2020
a754057
Update Examples
cchighman Jun 25, 2020
0492b02
Docs and Styles
cchighman Jun 26, 2020
c9ad703
Unit Tests
cchighman Jun 26, 2020
020973f
Merge branch 'master' of https://github.com/cchighman/spark
cchighman Jun 26, 2020
e815391
Fix Scala Style Tests
cchighman Jun 26, 2020
86b5d6f
Final scala styling issues.
cchighman Jun 26, 2020
b4e0428
Python Lint and Docs
cchighman Jun 26, 2020
4a80fdc
Silly missing line
cchighman Jun 26, 2020
d2f201b
Fix Doc Linting Issues
cchighman Jun 26, 2020
5de1b44
Sphinx / Doc Build linting updates
cchighman Jun 26, 2020
5121704
Merge remote-tracking branch 'upstream/master'
cchighman Jun 27, 2020
bbe63ec
Fix indentations and JIRA ticket name
cchighman Jun 27, 2020
7ee667d
Fix newline at end of file
cchighman Jun 27, 2020
4867cbc
Merge remote-tracking branch 'upstream/master'
cchighman Jul 2, 2020
d397fa0
Remove Streaming References
cchighman Jul 2, 2020
150e77d
Revert streaming.py
cchighman Jul 2, 2020
02631e9
Introduce abstraction layer for more efficient use of FileStatus
cchighman Jul 3, 2020
772f5f5
Merge remote-tracking branch 'upstream/master'
cchighman Jul 3, 2020
eb8b3e6
Adjust asserts in unit test
cchighman Jul 3, 2020
fff7d66
Introduce Path Filter Strategies
cchighman Jul 15, 2020
b7263b3
Merge branch 'master' of https://github.com/cchighman/spark
cchighman Jul 15, 2020
3dd6c9e
Update FileBasedDataSourceSuite.scala
cchighman Jul 15, 2020
3b4acea
Unit Tests
cchighman Jul 15, 2020
13439fb
Merge remote-tracking branch 'upstream/master'
cchighman Jul 15, 2020
5126d9a
Merge branch 'master' into tz2
cchighman Jul 15, 2020
da78938
Merge Conflicts
cchighman Jul 15, 2020
69c2393
Merge conflicts
cchighman Jul 15, 2020
b45f11e
Merge branch 'master' of https://github.com/cchighman/spark
cchighman Jul 15, 2020
143a048
Path Filter Updates
cchighman Jul 16, 2020
de30391
Refinements
cchighman Jul 16, 2020
56a0293
Update Sources
cchighman Jul 25, 2020
23bbe52
Pulling from Master
cchighman Jul 25, 2020
7cb912c
Merge from Master
cchighman Jul 25, 2020
2efdb7a
Merge from Master
cchighman Jul 25, 2020
c2fba17
Fix newline
cchighman Jul 25, 2020
ac5227d
Remove new lines
cchighman Jul 25, 2020
7a18b9e
Fix Linting
cchighman Jul 25, 2020
17de795
Fix Linting
cchighman Jul 25, 2020
853c432
Fix tests
cchighman Jul 25, 2020
04de904
Final Linting Updates
cchighman Jul 25, 2020
e62568a
Removing empty python line:
cchighman Jul 25, 2020
247b63b
Fix python styles
cchighman Jul 25, 2020
8bb28eb
Reprocess python changes
cchighman Jul 25, 2020
2ae67a3
Got run-tests working
cchighman Jul 26, 2020
ea9d484
Remove whitespace
cchighman Jul 26, 2020
0e97f5e
Adjust formatting
cchighman Jul 26, 2020
b427e1c
Update example
cchighman Jul 26, 2020
5d9a02a
Fix sneaking IDE auto correct
cchighman Jul 26, 2020
724db1c
Merge remote-tracking branch 'upstream/master'
cchighman Jul 26, 2020
f849059
Add CaseInsensitiveMap Support
cchighman Jul 26, 2020
944181c
Correct auto-correct ide lint
cchighman Jul 26, 2020
ebb7ede
Update Tests and Timezone
cchighman Jul 27, 2020
5952fc2
Merge remote-tracking branch 'upstream/master'
cchighman Jul 27, 2020
d1ae21e
Adjust timezone condition
cchighman Jul 27, 2020
b7952a6
Add Locale.ROOT to test
cchighman Jul 27, 2020
11e1109
Add more unit tests to validate change
cchighman Jul 27, 2020
6e3e94f
Fix tests
cchighman Jul 28, 2020
b67c5ec
Change to lazy initialization
cchighman Jul 28, 2020
48a4faf
Re-arrange tests
cchighman Jul 28, 2020
37fbd8a
Explicitly set default sqlconf timezone
cchighman Jul 28, 2020
03c2c3b
Change thresholdTime to a def
cchighman Jul 28, 2020
46a8dbd
Remove flaky test
cchighman Jul 28, 2020
e305b40
remove whitespace
cchighman Jul 28, 2020
05c6b89
retest
cchighman Jul 28, 2020
98717dc
Whitespaces :)
cchighman Jul 28, 2020
971c6ed
Resolve python conflict
cchighman Jul 29, 2020
2e5a038
Revert "Resolve python conflict"
cchighman Jul 29, 2020
c59b4d6
Merge remote-tracking branch 'upstream/master'
cchighman Jul 29, 2020
9979b96
Merge branch 'master' of https://github.com/cchighman/spark
cchighman Jul 29, 2020
08c067a
Merge branch 'master' of https://github.com/cchighman/spark
cchighman Jul 29, 2020
0f8a3c8
Revert "Merge remote-tracking branch 'upstream/master'"
cchighman Jul 29, 2020
976c796
Revert changes
cchighman Jul 29, 2020
54e4278
Revert "Revert "Merge remote-tracking branch 'upstream/master'""
cchighman Jul 29, 2020
530295b
Restore conflicted source
cchighman Jul 29, 2020
bccc1b5
Merge remote-tracking branch 'upstream/master'
cchighman Jul 29, 2020
37584c7
Resolve conflict
cchighman Jul 29, 2020
ca44781
Remove metals config
cchighman Jul 29, 2020
b925622
Reapply latest commit
cchighman Jul 29, 2020
0f597f7
Path Filter Updates
cchighman Aug 11, 2020
e6f8928
Merge remote-tracking branch 'upstream/master'
cchighman Aug 11, 2020
042c36e
Adjust Comments
cchighman Aug 11, 2020
15a263c
Adjust Tests
cchighman Aug 11, 2020
b090639
Remove whitespace
cchighman Aug 11, 2020
cf9d2fa
Requested Updates
cchighman Aug 12, 2020
5fc630e
Manual Style Corrections
cchighman Aug 12, 2020
c211e0e
Scala Style
cchighman Aug 12, 2020
ef37e2c
Remove un-necessary line changes
cchighman Aug 12, 2020
09302dc
Remove space
cchighman Aug 12, 2020
48438b5
Formatting
cchighman Aug 12, 2020
50ebdfc
More Manual Formatting
cchighman Aug 12, 2020
54dd3cf
Requested styling updates
cchighman Aug 12, 2020
3ded6d0
Adjust test parameters
cchighman Aug 12, 2020
0e069a7
Solidfy Tests
cchighman Aug 12, 2020
04fe25c
minor fix
maropu Aug 13, 2020
74b4d33
Merge pull request #2 from maropu/pr28841
cchighman Aug 13, 2020
473f390
Update path filters source
cchighman Aug 14, 2020
4329c8a
Commit updates
cchighman Aug 14, 2020
1ee4af4
Fix method signature
cchighman Aug 14, 2020
fb5f767
Update Tests
cchighman Aug 17, 2020
234e000
Merge remote-tracking branch 'upstream/master'
cchighman Aug 17, 2020
a7c6122
Merge remote-tracking branch 'upstream/master'
cchighman Aug 17, 2020
263dd2a
Update Files
cchighman Aug 18, 2020
3a99d90
Fine Tuning
cchighman Aug 25, 2020
980f103
Refactor Tests
cchighman Aug 27, 2020
d57bf08
Formatting
cchighman Sep 5, 2020
1c8384c
Scalafmt
cchighman Sep 5, 2020
6de2346
Merge branch 'master' of http://github.com/cchighman/spark
cchighman Oct 24, 2020
8dca5e5
Merge from Master
cchighman Nov 4, 2020
4480ca7
Merge Corrections
cchighman Nov 4, 2020
40c99dc
Carefully walk thru merge
cchighman Nov 4, 2020
1d54fa5
Changes after run-tests
cchighman Nov 4, 2020
f88ef62
Reduce line length
cchighman Nov 4, 2020
aed5d30
trailing whitespace
cchighman Nov 4, 2020
6b39e06
Correct merge issue
cchighman Nov 5, 2020
bf2a665
Correct silly oversight
cchighman Nov 5, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions docs/sql-data-sources-generic-options.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,3 +119,38 @@ To load all files recursively, you can use:
{% include_example recursive_file_lookup r/RSparkSQLExample.R %}
</div>
</div>

### Modification Time Path Filters
`modifiedBefore` and `modifiedAfter` are options that can be
applied together or separately in order to achieve greater
granularity over which files may load during a Spark batch query.

* `modifiedBefore`: an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
* `modifiedAfter`: an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

When a timezone option is not provided, the timestamps will be interpreted according
to the Spark session timezone (`spark.sql.session.timeZone`).

To load files with paths matching a given modified time range, you can use:

<div class="codetabs">
<div data-lang="scala" markdown="1">
{% include_example load_with_modified_time_filter scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
</div>

<div data-lang="java" markdown="1">
{% include_example load_with_modified_time_filter java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
</div>

<div data-lang="python" markdown="1">
{% include_example load_with_modified_time_filter python/sql/datasource.py %}
</div>

<div data-lang="r" markdown="1">
{% include_example load_with_modified_time_filter r/RSparkSQLExample.R %}
</div>
</div>
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,22 @@ private static void runGenericFileSourceOptionsExample(SparkSession spark) {
// |file1.parquet|
// +-------------+
// $example off:load_with_path_glob_filter$
// $example on:load_with_modified_time_filter$
Dataset<Row> beforeFilterDF = spark.read().format("parquet")
// Only load files modified before 7/1/2020 at 05:30
.option("modifiedBefore", "2020-07-01T05:30:00")
// Only load files modified after 6/1/2020 at 05:30
.option("modifiedAfter", "2020-06-01T05:30:00")
// Interpret both times above relative to CST timezone
.option("timeZone", "CST")
.load("examples/src/main/resources/dir1");
beforeFilterDF.show();
// +-------------+
// | file|
// +-------------+
// |file1.parquet|
// +-------------+
// $example off:load_with_modified_time_filter$
}

private static void runBasicDataSourceExample(SparkSession spark) {
Expand Down
20 changes: 20 additions & 0 deletions examples/src/main/python/sql/datasource.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,26 @@ def generic_file_source_options_example(spark):
# +-------------+
# $example off:load_with_path_glob_filter$

# $example on:load_with_modified_time_filter$
# Only load files modified before 07/1/2050 @ 08:30:00
df = spark.read.load("examples/src/main/resources/dir1",
format="parquet", modifiedBefore="2050-07-01T08:30:00")
df.show()
# +-------------+
# | file|
# +-------------+
# |file1.parquet|
# +-------------+
# Only load files modified after 06/01/2050 @ 08:30:00
df = spark.read.load("examples/src/main/resources/dir1",
format="parquet", modifiedAfter="2050-06-01T08:30:00")
df.show()
# +-------------+
# | file|
# +-------------+
# +-------------+
# $example off:load_with_modified_time_filter$


def basic_datasource_example(spark):
# $example on:generic_load_save_functions$
Expand Down
8 changes: 8 additions & 0 deletions examples/src/main/r/RSparkSQLExample.R
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,14 @@ df <- read.df("examples/src/main/resources/dir1", "parquet", pathGlobFilter = "*
# 1 file1.parquet
# $example off:load_with_path_glob_filter$

# $example on:load_with_modified_time_filter$
beforeDF <- read.df("examples/src/main/resources/dir1", "parquet", modifiedBefore= "2020-07-01T05:30:00")
# file
# 1 file1.parquet
afterDF <- read.df("examples/src/main/resources/dir1", "parquet", modifiedAfter = "2020-06-01T05:30:00")
# file
# $example off:load_with_modified_time_filter$

# $example on:manual_save_options_orc$
df <- read.df("examples/src/main/resources/users.orc", "orc")
write.orc(df, "users_with_options.orc", orc.bloom.filter.columns = "favorite_color", orc.dictionary.key.threshold = 1.0, orc.column.encoding.direct = "name")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,27 @@ object SQLDataSourceExample {
// |file1.parquet|
// +-------------+
// $example off:load_with_path_glob_filter$
// $example on:load_with_modified_time_filter$
val beforeFilterDF = spark.read.format("parquet")
// Files modified before 07/01/2020 at 05:30 are allowed
.option("modifiedBefore", "2020-07-01T05:30:00")
.load("examples/src/main/resources/dir1");
beforeFilterDF.show();
// +-------------+
// | file|
// +-------------+
// |file1.parquet|
// +-------------+
val afterFilterDF = spark.read.format("parquet")
// Files modified after 06/01/2020 at 05:30 are allowed
.option("modifiedAfter", "2020-06-01T05:30:00")
.load("examples/src/main/resources/dir1");
afterFilterDF.show();
// +-------------+
// | file|
// +-------------+
// +-------------+
// $example off:load_with_modified_time_filter$
}

private def runBasicDataSourceExample(spark: SparkSession): Unit = {
Expand Down
Loading