Skip to content

Conversation

@cchighman
Copy link
Contributor

@cchighman cchighman commented Jun 16, 2020

What changes were proposed in this pull request?

Two new options, modifiiedBefore and modifiedAfter, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format. PartioningAwareFileIndex considers these options during the process of checking for files, just before considering applied PathFilters such as pathGlobFilter. In order to filter file results, a new PathFilter class was derived for this purpose. General house-keeping around classes extending PathFilter was performed for neatness. It became apparent support was needed to handle multiple potential path filters. Logic was introduced for this purpose and the associated tests written.

Why are the changes needed?

When loading files from a data source, there can often times be thousands of file within a respective file path. In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point. This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered. This saves a ton of time automatically and reduces significant complexity managing this in code.

Does this PR introduce any user-facing change?

This PR introduces an option that can be used with batch-based Spark file data sources. A documentation update was made to reflect an example and usage of the new data source option.

Example Usages
Load all CSV files modified after date:
spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()

Load all CSV files modified before date:
spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()

Load all CSV files modified between two dates:
spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load()

How was this patch tested?

A handful of unit tests were added to support the positive, negative, and edge case code paths. It's also live in a handful of our Databricks dev environments.

@cchighman cchighman changed the title [SPARK-31962] - Provide option to load files after a specified date when reading from a folder path [SPARK-31962][SQL] - Provide option to load files after a specified date when reading from a folder path Jun 16, 2020
@cchighman cchighman changed the title [SPARK-31962][SQL] - Provide option to load files after a specified date when reading from a folder path [SPARK-31962][SQL] Provide option to load files after a specified date when reading from a folder path Jun 16, 2020
Christopher Highman added 2 commits June 16, 2020 03:12
@bart-samwel
Copy link

The option fileModifiedDate doesn't say at all that it's a minimum modified date. I can imagine use cases for lower bounds, upper bounds, ranges. That requires at least two options, e.g. filesModifiedAfter and filesModifiedBefore.

There's also option pathGlobFilter which only supports globs, but there as well there may be other use cases, e.g. "files with path names lexicographically larger than a file name", or "files with names that, after parsing, satisfy some interesting condition".

It seems to me that this is asking for some more generic filtering functionality. E.g. something like .fileFilter(lambda), where the lambda receives an object argument that has not only the path but also things like the modification date. That said, specific options may be pushed down into the data source (e.g. S3 supports prefix filters and start-from), so it would make sense to keep things as options when pushdown might be possible.

Based on weighing the options, I would suggest using two options, for min and max.

@cchighman
Copy link
Contributor Author

cchighman commented Jun 16, 2020

Thanks for your comments, @bart-samwel. I like your way of thinking, there are a lot of unique cases here. To provide more context behind the scenario I'm looking to cover which is a current issue for consumers:

  • Imagine you have a massive, massive data lake with routine ETL operations.
  • Every couple hours or so, a CSV file is dropped in a "Delta" folder containing perhaps 50 million events, per dataset, and you have a lot of these various datasets.
  • Over time, going back a handful of years, the folder hierarchy was rather deterministic which seems to be a common practice, such that you have /dataset/delta/yyyy-mm-dd/dataset_guid_timestamp.csv as folder structure.
  • A number of teams may need to begin consuming these files but they are only interested in consuming them starting from a particular date. Prior to this date, there is no longer any interest, and they hope to consume all the delta files for events up to the current date from the specified modified date without needing to write code that concatenates or embeds this for them.
  • From this perspective, enterprise consumers have value in being able to specify a modified timestamp to help checkpoint what deltas they're interested in consuming.

Granted, this context is specific to non-streaming file data sources. I was hopeful to find an equivalent perhaps with Structured Streaming but the closest I found was latestFirst and maxFileAge which each have their respective use cases but does not solve this particular one. The connective tissue between my change here lies in the fact that Structured Streaming also leverages InMemoryFileIndex and actively passes a parameter map to its constructor. I'll provide a PR to complete support there, as well, but separately from this MVP piece.

@cchighman
Copy link
Contributor Author

@bart-samwel To your point, I wonder if "fromModifiedDate" would be more appropriate?

@SparkQA
Copy link

SparkQA commented Nov 4, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35203/

@SparkQA
Copy link

SparkQA commented Nov 4, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35205/

@SparkQA
Copy link

SparkQA commented Nov 4, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35203/

@SparkQA
Copy link

SparkQA commented Nov 4, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35201/

@SparkQA
Copy link

SparkQA commented Nov 4, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35206/

@SparkQA
Copy link

SparkQA commented Nov 4, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35205/

@SparkQA
Copy link

SparkQA commented Nov 4, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35207/

@SparkQA
Copy link

SparkQA commented Nov 4, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35206/

@SparkQA
Copy link

SparkQA commented Nov 4, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35207/

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Test build #130647 has finished for PR 28841 at commit 6b39e06.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35258/

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35259/

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35259/

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35258/

@SparkQA
Copy link

SparkQA commented Nov 5, 2020

Test build #130648 has finished for PR 28841 at commit bf2a665.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

I see you update the PR. Thanks! As you're upmerging the branch instead of rebasing it's uneasy to check the effective changes. Two questions:

  1. Does your last update only upmerge with master? It looks like so, but just to confirm.
  2. Do you plan to go through my review comments, or let me do it by myself after merging this PR?

And the last time I checked with the updated diff, I see some changed lines which is unnecessary (additional indentation or line break which was already passing the style checker). Could you please go through the diff and make sure you don't introduce unnecessary changes?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@maropu
Copy link
Member

maropu commented Nov 12, 2020

I wouldn't request to an individual contributor doing the heavy work consistently - now this PR has nearly 300 comments. If the remaining comments are minors (not functional or public API issue), I'll volunteer to deal with these comments as a follow-up PR.

In my opinion, if the author @cchighman does not have much time to keep working on this and he think its okay for someone (probably, @HeartSaVioR) to take this over (while keeping a credit for the original author), I'm fine to do so. (NOTE: I think this feature looks useful, so it would be nice that we could merge it before the next feature freeze)

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Nov 12, 2020

The problem is feedback cycle, not whether @cchighman is busy or not. We are requiring contributors for multiple months to keep on focus, whereas reviewers don't promise anything about the focus on the PR. When @cchighman was active his feedback delay wasn't that long, but the PR stays as it is. Contributors are always in a risk of "wasting time" if PR loses focus from reviewers - it's going to be worse if reviewers ask to put efforts to reflect the change already.

I'm not sure it will change even if I take this over. If I take this over I'm going to lose my right to vote, making things worse, as in recent days I only look into this. My recent comments are minor, can be addressed via follow-up PR.

My last concern is that we should make sure the new options don't work on streaming, and the fact should be documented. Other than that, I'll review the PR again majorly checking there's any change during rebase. If anything wasn't changed, I'm +1 given we can resolve minors in follow-up PR.

@cchighman I guess you're going to be pretty busy, but could you please answer the questions from me - #28841 (comment)

and make a small change that "the new options don't work on streaming, and the fact should be documented"?

@cchighman
Copy link
Contributor Author

@maropu
Thank you for your feedback. I will finish the merge pieces this evening. If one of you two would like to pick up any remaining effort, please feel free to do so.

@cchighman
Copy link
Contributor Author

The problem is feedback cycle, not whether @cchighman is busy or not. We are requiring contributors for multiple months to keep on focus, whereas reviewers don't promise anything about the focus on the PR. When @cchighman was active his feedback delay wasn't that long, but the PR stays as it is. Contributors are always in a risk of "wasting time" if PR loses focus from reviewers - it's going to be worse if reviewers ask to put efforts to reflect the change already.

I'm not sure it will change even if I take this over. If I take this over I'm going to lose my right to vote, making things worse, as in recent days I only look into this. My recent comments are minor, can be addressed via follow-up PR.

My last concern is that we should make sure the new options don't work on streaming, and the fact should be documented. Other than that, I'll review the PR again majorly checking there's any change during rebase. If anything wasn't changed, I'm +1 given we can resolve minors in follow-up PR.

@cchighman I guess you're going to be pretty busy, but could you please answer the questions from me - #28841 (comment)

and make a small change that "the new options don't work on streaming, and the fact should be documented"?

Yes, I will look into this evening.

@cchighman
Copy link
Contributor Author

@HeartSaVioR @maropu
Unfortunately, I don't have time to work further on this right now. If one of you two would like to pick up remaining work in a subsequent PR, please feel free to do so. I hope the work I've done here is valuable and useful. I appreciate the opportunity to contribute.

@HeartSaVioR
Copy link
Contributor

Thanks @cchighman for great efforts during so far, and sorry to make you struggle with the review process. I'll take this over based on the current state of the PR and address my own comments.

@gengliangwang
Copy link
Member

I'm sorry that I was forcing on other tasks and couldn't follow this thread.
Thanks for the great work, @cchighman !

@HeartSaVioR
Copy link
Contributor

I've submitted #30411 to take over this & address my own review comments.

@HeartSaVioR
Copy link
Contributor

I'll close this PR to avoid any confusion. Thanks again @cchighman for your great contribution. I'll try my best to help getting this in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.