-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-47307][SQL] Add a config to optionally chunk base64 strings #47303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
Outdated
Show resolved
Hide resolved
| """}) | ||
| if (chunkBase64) { | ||
| s"""${ev.value} = UTF8String.fromBytes( | ||
| ${classOf[JBase64].getName}.getMimeEncoder().encode($child)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use the encoder directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use the encoder directly?
java.util.Base64$Encoder is not serializable.
-- !query
select base64(c7), base64(c8), base64(v), ascii(s) from char_tbl4
-- !query schema
struct<>
-- !query output
java.io.NotSerializableException
java.util.Base64$Encoder
Serialization stack:
- object not serializable (class: java.util.Base64$Encoder, value: java.util.Base64$Encoder@423ed07f)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 5)
- field (class: org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory, name: org$apache$spark$sql$execution$WholeStageCodegenEvaluatorFactory$$references, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory, org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory@2fd9633e)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/execution/WholeStageCodegenEvaluatorFactory;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=1])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$2458/0x000002cf3e949c30, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$2458/0x000002cf3e949c30@603a0fa7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can follow the work I have done for Encode to make it RuntimeReplaceable with StaticInvoke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can follow the work I have done for
Encodeto make it RuntimeReplaceable with StaticInvoke
Thanks, I will try it
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
||
| val CHUNK_BASE_64_STRING_ENABLED = buildConf("spark.sql.legacy.chunkBase64String.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: BASE_64 to BASE64
…change of base64 function <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. 8. If you want to add or modify an error type or message, please read the guideline first in 'common/utils/src/main/resources/error/README.md'. --> ### What changes were proposed in this pull request? Follow up to #47303 Add a migration guide for the behavior change of `base64` function ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> doc change ### Was this patch authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> No Closes #47371 from wForget/SPARK-47307_doc. Authored-by: wforget <[email protected]> Signed-off-by: allisonwang-db <[email protected]>
…change of base64 function <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. 8. If you want to add or modify an error type or message, please read the guideline first in 'common/utils/src/main/resources/error/README.md'. --> ### What changes were proposed in this pull request? Follow up to #47303 Add a migration guide for the behavior change of `base64` function ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> doc change ### Was this patch authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> No Closes #47371 from wForget/SPARK-47307_doc. Authored-by: wforget <[email protected]> Signed-off-by: allisonwang-db <[email protected]> (cherry picked from commit b2e0a4d) Signed-off-by: allisonwang-db <[email protected]>
@wForget is it true? @yaooqinn @dongjoon-hyun @wForget I think we need to discuss it in the dev list before merging this PR. This will break the encryption that relies on base64. I would suggest reverting it first before making a common decision on the dev list. |
|
Thank you for the feedback, @gatorsmile. I have three questions to understand your request.
|
|
I think the PR description should be improved to mention more information. It's a behavior change, we should clearly define the behavior difference and measure the impact. Does it mean the Spark base64 result can be decoded by more encoders now? Is there any standard encoder that may fail to decode after this change? |
|
Sorry, I didn't change PR description after changing default value for new config. |
|
…ng.enabled from a legacy/internal config to a regular/public one ### What changes were proposed in this pull request? + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one. + Add test cases for unbase64 ### Why are the changes needed? Keep the same behavior as before. More details: #47303 (comment) ### Does this PR introduce _any_ user-facing change? yes, revert behavior change introduced in #47303 ### How was this patch tested? existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47410 from wForget/SPARK-47307_followup. Lead-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>
…ng.enabled from a legacy/internal config to a regular/public one ### What changes were proposed in this pull request? + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one. + Add test cases for unbase64 ### Why are the changes needed? Keep the same behavior as before. More details: apache#47303 (comment) ### Does this PR introduce _any_ user-facing change? yes, revert behavior change introduced in apache#47303 ### How was this patch tested? existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47410 from wForget/SPARK-47307_followup. Lead-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit af5eb08)
…4String.enabled from a legacy/internal config to a regular/public one Backports #47410 to 3.5 ### What changes were proposed in this pull request? + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one. + Add test cases for unbase64 ### Why are the changes needed? Keep the same behavior as before. More details: #47303 (comment) ### Does this PR introduce _any_ user-facing change? yes, revert behavior change introduced in #47303 ### How was this patch tested? existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47416 from wForget/SPARK-47307_followup_3.5. Authored-by: wforget <[email protected]> Signed-off-by: Kent Yao <[email protected]>
Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <[email protected]> Co-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Co-authored-by: Ted Chester Jenks <[email protected]> Signed-off-by: Kent Yao <[email protected]>
…change of base64 function <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. 8. If you want to add or modify an error type or message, please read the guideline first in 'common/utils/src/main/resources/error/README.md'. --> ### What changes were proposed in this pull request? Follow up to apache#47303 Add a migration guide for the behavior change of `base64` function ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> doc change ### Was this patch authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> No Closes apache#47371 from wForget/SPARK-47307_doc. Authored-by: wforget <[email protected]> Signed-off-by: allisonwang-db <[email protected]>
…ng.enabled from a legacy/internal config to a regular/public one ### What changes were proposed in this pull request? + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one. + Add test cases for unbase64 ### Why are the changes needed? Keep the same behavior as before. More details: apache#47303 (comment) ### Does this PR introduce _any_ user-facing change? yes, revert behavior change introduced in apache#47303 ### How was this patch tested? existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47410 from wForget/SPARK-47307_followup. Lead-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request? Fix the nullability of the `Base64` expression to be based on the child's nullability, and not always be nullable. ### Why are the changes needed? #47303 had a side effect of changing the nullability by the switch to using `StaticInvoke`. This was also backported to Spark 3.5.2 and caused schema mismatch errors for stateful streams when we upgraded. This restores the previous behavior which is supported by StaticInvoke through the `returnNullable` argument. If the child is non-nullable, we know the result will be non-nullable. ### Does this PR introduce _any_ user-facing change? Restores the nullability of the `Base64` expression to what is was in Spark 3.5.1 and earlier. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #47941 from Kimahriman/base64-nullability. Lead-authored-by: Adam Binford <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? Fix the nullability of the `Base64` expression to be based on the child's nullability, and not always be nullable. ### Why are the changes needed? #47303 had a side effect of changing the nullability by the switch to using `StaticInvoke`. This was also backported to Spark 3.5.2 and caused schema mismatch errors for stateful streams when we upgraded. This restores the previous behavior which is supported by StaticInvoke through the `returnNullable` argument. If the child is non-nullable, we know the result will be non-nullable. ### Does this PR introduce _any_ user-facing change? Restores the nullability of the `Base64` expression to what is was in Spark 3.5.1 and earlier. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #47941 from Kimahriman/base64-nullability. Lead-authored-by: Adam Binford <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]> (cherry picked from commit c274c5a) Signed-off-by: Max Gekk <[email protected]>
Backports apache#47303 to 3.5 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47325 from wForget/SPARK-47307_3.5. Lead-authored-by: wforget <[email protected]> Co-authored-by: Ted Jenks <[email protected]> Signed-off-by: Kent Yao <[email protected]>
* [SPARK-49476][SQL] Fix nullability of base64 function ### What changes were proposed in this pull request? Fix the nullability of the `Base64` expression to be based on the child's nullability, and not always be nullable. ### Why are the changes needed? apache#47303 had a side effect of changing the nullability by the switch to using `StaticInvoke`. This was also backported to Spark 3.5.2 and caused schema mismatch errors for stateful streams when we upgraded. This restores the previous behavior which is supported by StaticInvoke through the `returnNullable` argument. If the child is non-nullable, we know the result will be non-nullable. ### Does this PR introduce _any_ user-facing change? Restores the nullability of the `Base64` expression to what is was in Spark 3.5.1 and earlier. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47941 from Kimahriman/base64-nullability. Lead-authored-by: Adam Binford <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]> (cherry picked from commit c274c5a) Signed-off-by: Max Gekk <[email protected]> * [SPARK-49476][SQL][3.5][FOLLOWUP] Fix base64 proto test ### What changes were proposed in this pull request? Fix a test that is failing from backporting apache#47941 ### Why are the changes needed? Fix test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Fixed test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47964 from Kimahriman/base64-proto-test. Authored-by: Adam Binford <[email protected]> Signed-off-by: Kent Yao <[email protected]> * [SPARK-49476][SQL][3.5][FOLLOWUP] Fix base64 proto test ### What changes were proposed in this pull request? Fix a test that is failing from backporting apache#47941 ### Why are the changes needed? Fix test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Fixed test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47964 from Kimahriman/base64-proto-test. Authored-by: Adam Binford <[email protected]> Signed-off-by: Kent Yao <[email protected]> --------- Signed-off-by: Max Gekk <[email protected]> Signed-off-by: Kent Yao <[email protected]> Co-authored-by: Adam Binford <[email protected]> Co-authored-by: Maxim Gekk <[email protected]>

Follow up #45408
What changes were proposed in this pull request?
[SPARK-47307] Add a config to optionally chunk base64 strings
Why are the changes needed?
In #35110, it was incorrectly asserted that:
This is not true as the previous code called:
Which states:
However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing test suite.
Was this patch authored or co-authored using generative AI tooling?
No