[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source #28841

cchighman · 2020-06-16T09:57:08Z

What changes were proposed in this pull request?

Two new options, modifiiedBefore and modifiedAfter, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format. PartioningAwareFileIndex considers these options during the process of checking for files, just before considering applied PathFilters such as pathGlobFilter. In order to filter file results, a new PathFilter class was derived for this purpose. General house-keeping around classes extending PathFilter was performed for neatness. It became apparent support was needed to handle multiple potential path filters. Logic was introduced for this purpose and the associated tests written.

Why are the changes needed?

When loading files from a data source, there can often times be thousands of file within a respective file path. In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point. This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered. This saves a ton of time automatically and reduces significant complexity managing this in code.

Does this PR introduce any user-facing change?

This PR introduces an option that can be used with batch-based Spark file data sources. A documentation update was made to reflect an example and usage of the new data source option.

Example Usages
Load all CSV files modified after date:
spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()

Load all CSV files modified before date:
spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()

Load all CSV files modified between two dates:
spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load()

How was this patch tested?

A handful of unit tests were added to support the positive, negative, and edge case code paths. It's also live in a handful of our Databricks dev environments.

…r multiple filters and refactored a bit.

bart-samwel · 2020-06-16T12:02:39Z

The option fileModifiedDate doesn't say at all that it's a minimum modified date. I can imagine use cases for lower bounds, upper bounds, ranges. That requires at least two options, e.g. filesModifiedAfter and filesModifiedBefore.

There's also option pathGlobFilter which only supports globs, but there as well there may be other use cases, e.g. "files with path names lexicographically larger than a file name", or "files with names that, after parsing, satisfy some interesting condition".

It seems to me that this is asking for some more generic filtering functionality. E.g. something like .fileFilter(lambda), where the lambda receives an object argument that has not only the path but also things like the modification date. That said, specific options may be pushed down into the data source (e.g. S3 supports prefix filters and start-from), so it would make sense to keep things as options when pushdown might be possible.

Based on weighing the options, I would suggest using two options, for min and max.

…r multiple filters and refactored a bit.

cchighman · 2020-06-16T14:15:05Z

Thanks for your comments, @bart-samwel. I like your way of thinking, there are a lot of unique cases here. To provide more context behind the scenario I'm looking to cover which is a current issue for consumers:

Imagine you have a massive, massive data lake with routine ETL operations.
Every couple hours or so, a CSV file is dropped in a "Delta" folder containing perhaps 50 million events, per dataset, and you have a lot of these various datasets.
Over time, going back a handful of years, the folder hierarchy was rather deterministic which seems to be a common practice, such that you have /dataset/delta/yyyy-mm-dd/dataset_guid_timestamp.csv as folder structure.
A number of teams may need to begin consuming these files but they are only interested in consuming them starting from a particular date. Prior to this date, there is no longer any interest, and they hope to consume all the delta files for events up to the current date from the specified modified date without needing to write code that concatenates or embeds this for them.
From this perspective, enterprise consumers have value in being able to specify a modified timestamp to help checkpoint what deltas they're interested in consuming.

Granted, this context is specific to non-streaming file data sources. I was hopeful to find an equivalent perhaps with Structured Streaming but the closest I found was latestFirst and maxFileAge which each have their respective use cases but does not solve this particular one. The connective tissue between my change here lies in the fact that Structured Streaming also leverages InMemoryFileIndex and actively passes a parameter map to its constructor. I'll provide a PR to complete support there, as well, but separately from this MVP piece.

cchighman · 2020-06-16T14:19:22Z

@bart-samwel To your point, I wonder if "fromModifiedDate" would be more appropriate?

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala

...n/scala/org/apache/spark/sql/execution/datasources/pathfilters/PathFilterIgnoreNonData.scala

SparkQA · 2020-11-04T14:26:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35203/

SparkQA · 2020-11-04T14:59:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35205/

SparkQA · 2020-11-04T14:59:43Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35203/

SparkQA · 2020-11-04T14:59:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35201/

SparkQA · 2020-11-04T15:24:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35206/

SparkQA · 2020-11-04T15:27:25Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35205/

SparkQA · 2020-11-04T15:39:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35207/

SparkQA · 2020-11-04T15:46:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35206/

SparkQA · 2020-11-04T16:02:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35207/

SparkQA · 2020-11-05T10:32:03Z

Test build #130647 has finished for PR 28841 at commit 6b39e06.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-05T11:18:51Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35258/

SparkQA · 2020-11-05T11:20:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35259/

SparkQA · 2020-11-05T11:49:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35259/

SparkQA · 2020-11-05T11:52:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35258/

SparkQA · 2020-11-05T15:07:26Z

Test build #130648 has finished for PR 28841 at commit bf2a665.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-11-05T22:11:27Z

I see you update the PR. Thanks! As you're upmerging the branch instead of rebasing it's uneasy to check the effective changes. Two questions:

Does your last update only upmerge with master? It looks like so, but just to confirm.
Do you plan to go through my review comments, or let me do it by myself after merging this PR?

And the last time I checked with the updated diff, I see some changed lines which is unnecessary (additional indentation or line break which was already passing the style checker). Could you please go through the diff and make sure you don't introduce unnecessary changes?

AmplabJenkins · 2020-11-06T19:37:44Z

Can one of the admins verify this patch?

maropu · 2020-11-12T00:24:35Z

I wouldn't request to an individual contributor doing the heavy work consistently - now this PR has nearly 300 comments. If the remaining comments are minors (not functional or public API issue), I'll volunteer to deal with these comments as a follow-up PR.

In my opinion, if the author @cchighman does not have much time to keep working on this and he think its okay for someone (probably, @HeartSaVioR) to take this over (while keeping a credit for the original author), I'm fine to do so. (NOTE: I think this feature looks useful, so it would be nice that we could merge it before the next feature freeze)

HeartSaVioR · 2020-11-12T01:23:04Z

The problem is feedback cycle, not whether @cchighman is busy or not. We are requiring contributors for multiple months to keep on focus, whereas reviewers don't promise anything about the focus on the PR. When @cchighman was active his feedback delay wasn't that long, but the PR stays as it is. Contributors are always in a risk of "wasting time" if PR loses focus from reviewers - it's going to be worse if reviewers ask to put efforts to reflect the change already.

I'm not sure it will change even if I take this over. If I take this over I'm going to lose my right to vote, making things worse, as in recent days I only look into this. My recent comments are minor, can be addressed via follow-up PR.

My last concern is that we should make sure the new options don't work on streaming, and the fact should be documented. Other than that, I'll review the PR again majorly checking there's any change during rebase. If anything wasn't changed, I'm +1 given we can resolve minors in follow-up PR.

@cchighman I guess you're going to be pretty busy, but could you please answer the questions from me - #28841 (comment)

and make a small change that "the new options don't work on streaming, and the fact should be documented"?

cchighman · 2020-11-12T01:24:28Z

@maropu
Thank you for your feedback. I will finish the merge pieces this evening. If one of you two would like to pick up any remaining effort, please feel free to do so.

cchighman · 2020-11-12T01:36:38Z

The problem is feedback cycle, not whether @cchighman is busy or not. We are requiring contributors for multiple months to keep on focus, whereas reviewers don't promise anything about the focus on the PR. When @cchighman was active his feedback delay wasn't that long, but the PR stays as it is. Contributors are always in a risk of "wasting time" if PR loses focus from reviewers - it's going to be worse if reviewers ask to put efforts to reflect the change already.

I'm not sure it will change even if I take this over. If I take this over I'm going to lose my right to vote, making things worse, as in recent days I only look into this. My recent comments are minor, can be addressed via follow-up PR.

My last concern is that we should make sure the new options don't work on streaming, and the fact should be documented. Other than that, I'll review the PR again majorly checking there's any change during rebase. If anything wasn't changed, I'm +1 given we can resolve minors in follow-up PR.

@cchighman I guess you're going to be pretty busy, but could you please answer the questions from me - #28841 (comment)

and make a small change that "the new options don't work on streaming, and the fact should be documented"?

Yes, I will look into this evening.

cchighman · 2020-11-17T16:12:10Z

@HeartSaVioR @maropu
Unfortunately, I don't have time to work further on this right now. If one of you two would like to pick up remaining work in a subsequent PR, please feel free to do so. I hope the work I've done here is valuable and useful. I appreciate the opportunity to contribute.

HeartSaVioR · 2020-11-18T07:10:52Z

Thanks @cchighman for great efforts during so far, and sorry to make you struggle with the review process. I'll take this over based on the current state of the PR and address my own comments.

gengliangwang · 2020-11-18T07:48:47Z

I'm sorry that I was forcing on other tasks and couldn't follow this thread.
Thanks for the great work, @cchighman !

HeartSaVioR · 2020-11-18T13:38:44Z

I've submitted #30411 to take over this & address my own review comments.

HeartSaVioR · 2020-11-18T13:49:42Z

I'll close this PR to avoid any confusion. Thanks again @cchighman for your great contribution. I'll try my best to help getting this in.

cchighman and others added 7 commits June 12, 2020 03:44

Added filesModifiedAfterDate parameter with method overloads

b1d6580

Add first set of Unit Tests

145102e

Updated to use a PathFilter visitor for consistency. Added support fo…

2b35de7

…r multiple filters and refactored a bit.

Add more Unit Tests

359cd42

Update Tests, Styles, an Docs

96ab0d8

Merge remote-tracking branch 'upstream/master'

e03b9ec

Update Styles

ad05dbb

probot-autolabeler bot added DOCS SQL labels Jun 16, 2020

cchighman changed the title ~~[SPARK-31962] - Provide option to load files after a specified date when reading from a folder path~~ [SPARK-31962][SQL] - Provide option to load files after a specified date when reading from a folder path Jun 16, 2020

cchighman changed the title ~~[SPARK-31962][SQL] - Provide option to load files after a specified date when reading from a folder path~~ [SPARK-31962][SQL] Provide option to load files after a specified date when reading from a folder path Jun 16, 2020

Christopher Highman added 2 commits June 16, 2020 03:12

Fix Scala Styles

3d34b5f

Fix Tests

f8eef35

cchighman and others added 11 commits June 16, 2020 05:46

Added filesModifiedAfterDate parameter with method overloads

3f45acf

Add first set of Unit Tests

0a69e73

Updated to use a PathFilter visitor for consistency. Added support fo…

7a2cad0

…r multiple filters and refactored a bit.

Add more Unit Tests

8ee8e26

Update Tests, Styles, an Docs

45e8afe

Update Styles

d9f7bea

Fix Scala Styles

8b62cad

Fix Tests

2cb58e6

Merge branch 'master' of https://github.com/cchighman/spark

0a97479

Merge remote-tracking branch 'upstream/master'

7906998

Fix the silly duplicate line

256cb1b

cchighman commented Jun 16, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala Outdated Show resolved Hide resolved

cchighman commented Jun 16, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala Outdated Show resolved Hide resolved

cchighman commented Jun 16, 2020

View reviewed changes

...n/scala/org/apache/spark/sql/execution/datasources/pathfilters/PathFilterIgnoreNonData.scala Outdated Show resolved Hide resolved

Correct merge issue

6b39e06

Correct silly oversight

bf2a665

HeartSaVioR mentioned this pull request Nov 18, 2020

[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source #30411

Closed

HeartSaVioR closed this Nov 18, 2020

[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source #28841

[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source #28841

Uh oh!

Conversation

cchighman commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

bart-samwel commented Jun 16, 2020

Uh oh!

cchighman commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cchighman commented Jun 16, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 4, 2020

Uh oh!

SparkQA commented Nov 4, 2020

Uh oh!

SparkQA commented Nov 4, 2020

Uh oh!

SparkQA commented Nov 4, 2020

Uh oh!

SparkQA commented Nov 4, 2020

Uh oh!

SparkQA commented Nov 4, 2020

Uh oh!

SparkQA commented Nov 4, 2020

Uh oh!

SparkQA commented Nov 4, 2020

Uh oh!

SparkQA commented Nov 4, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

SparkQA commented Nov 5, 2020

Uh oh!

HeartSaVioR commented Nov 5, 2020

Uh oh!

AmplabJenkins commented Nov 6, 2020

Uh oh!

maropu commented Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cchighman commented Nov 12, 2020

Uh oh!

cchighman commented Nov 12, 2020

Uh oh!

cchighman commented Nov 17, 2020

Uh oh!

HeartSaVioR commented Nov 18, 2020

Uh oh!

gengliangwang commented Nov 18, 2020

Uh oh!

HeartSaVioR commented Nov 18, 2020

Uh oh!

HeartSaVioR commented Nov 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

cchighman commented Jun 16, 2020 •

edited

Loading

cchighman commented Jun 16, 2020 •

edited

Loading

maropu commented Nov 12, 2020 •

edited

Loading

HeartSaVioR commented Nov 12, 2020 •

edited

Loading