-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25850][SQL] Make the split threshold for the code generated function configurable #22847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
188a947
ee986cc
b578dd4
0db224f
b0ce2ca
ba392c8
2fc6417
8eeb538
65a5355
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -812,6 +812,18 @@ object SQLConf { | |
| .intConf | ||
| .createWithDefault(65535) | ||
|
|
||
| val CODEGEN_METHOD_SPLIT_THRESHOLD = buildConf("spark.sql.codegen.methodSplitThreshold") | ||
| .internal() | ||
| .doc("The threshold of source code length without comment of a single Java function by " + | ||
| "codegen to be split. When the generated Java function source code exceeds this threshold" + | ||
| ", it will be split into multiple small functions. We cannot know how many bytecode will " + | ||
| "be generated, so use the code length as metric. When running on HotSpot, a function's " + | ||
| "bytecode should not go beyond 8KB, otherwise it will not be JITted; it also should not " + | ||
| "be too small, otherwise there will be many function calls.") | ||
| .intConf | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the description, it seems that we had better have
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be more accurately, I think I should add
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep. That could be
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1000 is conservative. But, there is no recommendation since the bytecode size depends on the content (e.g.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kiszk agree,
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you try some very long alias names or complex expressions? You will get different number, right?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like long alias names have no influence. No My benchmark: Will keep benchmarking for the complex expression.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "freshNamePrefix" prefix is only applied in whole-stage codegen, It doesn't take any effect in non-whole-stage codegen. If you intend to stress test expression codegen but don't see the prefix being prepended, you're probably not adding it in the right place. Where did you add it?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh I see, you're using the column name...that's not the right place to put the "prefix". Column names are almost never carried over to the generated code in the current framework (the only exception is the lambda variable name).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rednaxelafx the wide table benchmark I used has 400 columns, whole stage codegen is disabled by default. |
||
| .checkValue(threshold => threshold > 0, "The threshold must be a positive integer.") | ||
| .createWithDefault(1024) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's add a check value to make sure the value is positive. We can figure out a lower and upper bound later. |
||
|
|
||
| val WHOLESTAGE_SPLIT_CONSUME_FUNC_BY_OPERATOR = | ||
| buildConf("spark.sql.codegen.splitConsumeFuncByOperator") | ||
| .internal() | ||
|
|
@@ -1733,6 +1745,8 @@ class SQLConf extends Serializable with Logging { | |
|
|
||
| def hugeMethodLimit: Int = getConf(WHOLESTAGE_HUGE_METHOD_LIMIT) | ||
|
|
||
| def methodSplitThreshold: Int = getConf(CODEGEN_METHOD_SPLIT_THRESHOLD) | ||
|
|
||
| def wholeStageSplitConsumeFuncByOperator: Boolean = | ||
| getConf(WHOLESTAGE_SPLIT_CONSUME_FUNC_BY_OPERATOR) | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The threshold of source-code splitting in the codegen. When the number of characters in a single JAVA function (without comment) exceeds the threshold, the function will be automatically split to multiple smaller ones.