-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33784][SQL] Rename dataSourceRewriteRules batch #30808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't final.
I'll explain my current thinking here. We are going to use this batch to rewrite plans directly after operator optimization. I did not go for
planRewriteRulesdue to the comment made by @dongjoon-hyun and @viirya (please let me know if you don't feel that way anymore). I also did not go forpreCBORulesas there quite a few things that will happen before CBO like discussed in the original thread.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I do not like this name. Our catalyst is designed as a query compiler. There does not exist a concept in a query compiler called "Post Operator Optimization". If you talk about this extension point in a Database class, no one understands what it means from the name.
My point is still the same. Here, we are adding the extension point for advanced users to add the rules between the rule-based optimizer (RBO) rules and the cost-based optimizer (CBO) rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batch/function/rule names in Optimizer.scala is not critical to me. They are private. We can change them whenever we need. However, this API is an external developer API. We should be very careful to name it.
To me, either preCBORules or postRBORules are much better than postOperatorOptimizationRule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, eventually, we should move all the statistics based rules from the optimizer to the planner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate that we don't have a clear separation between RBO and CBO. There are RBO rules before and after the only CBO rule
CostBasedJoinReorder.I think the general idea of this batch is to allow people to inject special optimizer rules that can't be run together with the main operator optimizer batch. It's indeed a Spark specific thing, as the main operator optimizer batch will be run many times until reaching the fixed point, the new batch added here will be run only once.
It's really hard to do the naming here. To match the actual purpose and to be general, how about
Phase 2 Optimizer RulesorRun Once Optimizer Rules?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phase 2 Optimizer RulesorRun Once Optimizer Rulesdoes not explain the location of the batch.How many batches do we want to add? I am afraid we will add phase 2, 3, 4, .... This is endless. It looks very random. We should not expect the users need to understand/read the source code of our optimizer every new release. It is also fragile.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do share some of the points brought by @cloud-fan. There is no clear separation between CBO and RBO and we run quite some rule-based optimizations after the existing cost-based optimization rule. Also, we need to run this batch before early scan pushdown rules to capture possible rewrites, which, in turn, run before CBO. That’s why
preCBORulesdoes not seem to ideally convey what this rule does.Our existing hook in the extensions API says
The injected rules will be executed during the operator optimization batch. That was one of my motivations for the current name as we have resolution and post-hoc resolution rules.That said, I don’t mean
postOperatorOptimizationRulesis a perfect name either. It just seems to better reflect the place in the Spark optimizer. I’ll be happy to discuss and iterate further.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be also interested to hear from @hvanhovell and other folks who commented.