-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25850][SQL] Make the split threshold for the code generated function configurable #22847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
188a947
[SPARK-25850][SQL] Make the split threshold for the code generated me…
yucai ee986cc
improve desc
yucai b578dd4
improve desc
yucai 0db224f
address comments
yucai b0ce2ca
address comments
yucai ba392c8
address comments
yucai 2fc6417
address comments
yucai 8eeb538
minor
yucai 65a5355
Expression.reduceCodeSize
yucai File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -812,6 +812,18 @@ object SQLConf { | |
| .intConf | ||
| .createWithDefault(65535) | ||
|
|
||
| val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") | ||
| .internal() | ||
| .doc("The threshold of source-code splitting in the codegen. When the number of characters " + | ||
| "in a single Java function (without comment) exceeds the threshold, the function will be " + | ||
| "automatically split to multiple smaller ones. We cannot know how many bytecode will be " + | ||
| "generated, so use the code length as metric. When running on HotSpot, a function's " + | ||
| "bytecode should not go beyond 8KB, otherwise it will not be JITted; it also should not " + | ||
| "be too small, otherwise there will be many function calls.") | ||
| .intConf | ||
| .checkValue(threshold => threshold > 0, "The threshold must be a positive integer.") | ||
| .createWithDefault(1024) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's add a check value to make sure the value is positive. We can figure out a lower and upper bound later. |
||
|
|
||
| val WHOLESTAGE_SPLIT_CONSUME_FUNC_BY_OPERATOR = | ||
| buildConf("spark.sql.codegen.splitConsumeFuncByOperator") | ||
| .internal() | ||
|
|
@@ -1733,6 +1745,8 @@ class SQLConf extends Serializable with Logging { | |
|
|
||
| def hugeMethodLimit: Int = getConf(WHOLESTAGE_HUGE_METHOD_LIMIT) | ||
|
|
||
| def methodSplitThreshold: Int = getConf(CODEGEN_METHOD_SPLIT_THRESHOLD) | ||
|
|
||
| def wholeStageSplitConsumeFuncByOperator: Boolean = | ||
| getConf(WHOLESTAGE_SPLIT_CONSUME_FUNC_BY_OPERATOR) | ||
|
|
||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the description, it seems that we had better have
checkValuehere. Could you recommend the reasonable min/max values, @kiszk ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more accurately, I think I should add
When running on HotSpot, a function's bytecode should not go beyond 8KB....There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. That could be
max.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 is conservative. But, there is no recommendation since the bytecode size depends on the content (e.g.
0's byte code length is 1.9's byte code lengh is 2).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk agree,
1000might be not the best, see my benchmark for the wide table,2048is better.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try some very long alias names or complex expressions? You will get different number, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like long alias names have no influence.
No
averylongprefixrepeatedmultipletimesin the expression code gen:My benchmark:
Will keep benchmarking for the complex expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "freshNamePrefix" prefix is only applied in whole-stage codegen,
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L87
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L169
It doesn't take any effect in non-whole-stage codegen.
If you intend to stress test expression codegen but don't see the prefix being prepended, you're probably not adding it in the right place. Where did you add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you're using the column name...that's not the right place to put the "prefix". Column names are almost never carried over to the generated code in the current framework (the only exception is the lambda variable name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rednaxelafx the wide table benchmark I used has 400 columns, whole stage codegen is disabled by default.