-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source #28841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r multiple filters and refactored a bit.
|
The option There's also option It seems to me that this is asking for some more generic filtering functionality. E.g. something like Based on weighing the options, I would suggest using two options, for min and max. |
…r multiple filters and refactored a bit.
|
Thanks for your comments, @bart-samwel. I like your way of thinking, there are a lot of unique cases here. To provide more context behind the scenario I'm looking to cover which is a current issue for consumers:
Granted, this context is specific to non-streaming file data sources. I was hopeful to find an equivalent perhaps with Structured Streaming but the closest I found was latestFirst and maxFileAge which each have their respective use cases but does not solve this particular one. The connective tissue between my change here lies in the fact that Structured Streaming also leverages InMemoryFileIndex and actively passes a parameter map to its constructor. I'll provide a PR to complete support there, as well, but separately from this MVP piece. |
|
@bart-samwel To your point, I wonder if "fromModifiedDate" would be more appropriate? |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
Outdated
Show resolved
Hide resolved
...n/scala/org/apache/spark/sql/execution/datasources/pathfilters/PathFilterIgnoreNonData.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test status failure |
|
Test build #130647 has finished for PR 28841 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test status failure |
|
Test build #130648 has finished for PR 28841 at commit
|
|
I see you update the PR. Thanks! As you're upmerging the branch instead of rebasing it's uneasy to check the effective changes. Two questions:
And the last time I checked with the updated diff, I see some changed lines which is unnecessary (additional indentation or line break which was already passing the style checker). Could you please go through the diff and make sure you don't introduce unnecessary changes? |
|
Can one of the admins verify this patch? |
In my opinion, if the author @cchighman does not have much time to keep working on this and he think its okay for someone (probably, @HeartSaVioR) to take this over (while keeping a credit for the original author), I'm fine to do so. (NOTE: I think this feature looks useful, so it would be nice that we could merge it before the next feature freeze) |
|
The problem is feedback cycle, not whether @cchighman is busy or not. We are requiring contributors for multiple months to keep on focus, whereas reviewers don't promise anything about the focus on the PR. When @cchighman was active his feedback delay wasn't that long, but the PR stays as it is. Contributors are always in a risk of "wasting time" if PR loses focus from reviewers - it's going to be worse if reviewers ask to put efforts to reflect the change already. I'm not sure it will change even if I take this over. If I take this over I'm going to lose my right to vote, making things worse, as in recent days I only look into this. My recent comments are minor, can be addressed via follow-up PR. My last concern is that we should make sure the new options don't work on streaming, and the fact should be documented. Other than that, I'll review the PR again majorly checking there's any change during rebase. If anything wasn't changed, I'm +1 given we can resolve minors in follow-up PR. @cchighman I guess you're going to be pretty busy, but could you please answer the questions from me - #28841 (comment) and make a small change that "the new options don't work on streaming, and the fact should be documented"? |
|
@maropu |
Yes, I will look into this evening. |
|
@HeartSaVioR @maropu |
|
Thanks @cchighman for great efforts during so far, and sorry to make you struggle with the review process. I'll take this over based on the current state of the PR and address my own comments. |
|
I'm sorry that I was forcing on other tasks and couldn't follow this thread. |
|
I've submitted #30411 to take over this & address my own review comments. |
|
I'll close this PR to avoid any confusion. Thanks again @cchighman for your great contribution. I'll try my best to help getting this in. |
What changes were proposed in this pull request?
Two new options, modifiiedBefore and modifiedAfter, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format. PartioningAwareFileIndex considers these options during the process of checking for files, just before considering applied PathFilters such as
pathGlobFilter.In order to filter file results, a new PathFilter class was derived for this purpose. General house-keeping around classes extending PathFilter was performed for neatness. It became apparent support was needed to handle multiple potential path filters. Logic was introduced for this purpose and the associated tests written.Why are the changes needed?
When loading files from a data source, there can often times be thousands of file within a respective file path. In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point. This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered. This saves a ton of time automatically and reduces significant complexity managing this in code.
Does this PR introduce any user-facing change?
This PR introduces an option that can be used with batch-based Spark file data sources. A documentation update was made to reflect an example and usage of the new data source option.
Example Usages
Load all CSV files modified after date:
spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()Load all CSV files modified before date:
spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()Load all CSV files modified between two dates:
spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load()How was this patch tested?
A handful of unit tests were added to support the positive, negative, and edge case code paths. It's also live in a handful of our Databricks dev environments.