[SPARK-47307][SQL] Add a config to optionally chunk base64 strings #47303

wForget · 2024-07-11T12:28:45Z

Follow up #45408

What changes were proposed in this pull request?

[SPARK-47307] Add a config to optionally chunk base64 strings

Why are the changes needed?

In #35110, it was incorrectly asserted that:

ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

public static byte[] encodeBase64(byte[] binaryData)

Which states:

Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing test suite.

Was this patch authored or co-authored using generative AI tooling?

No

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

yaooqinn · 2024-07-12T02:29:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

-       """})
+      if (chunkBase64) {
+        s"""${ev.value} = UTF8String.fromBytes(
+             ${classOf[JBase64].getName}.getMimeEncoder().encode($child));


Why don't we use the encoder directly?

Why don't we use the encoder directly?

java.util.Base64$Encoder is not serializable.

-- !query select base64(c7), base64(c8), base64(v), ascii(s) from char_tbl4 -- !query schema struct<> -- !query output java.io.NotSerializableException java.util.Base64$Encoder Serialization stack: - object not serializable (class: java.util.Base64$Encoder, value: java.util.Base64$Encoder@423ed07f) - element of array (index: 2) - array (class [Ljava.lang.Object;, size 5) - field (class: org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory, name: org$apache$spark$sql$execution$WholeStageCodegenEvaluatorFactory$$references, type: class [Ljava.lang.Object;) - object (class org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory, org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory@2fd9633e) - element of array (index: 0) - array (class [Ljava.lang.Object;, size 1) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/execution/WholeStageCodegenEvaluatorFactory;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=1]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$2458/0x000002cf3e949c30, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$2458/0x000002cf3e949c30@603a0fa7)

maybe we can follow the work I have done for Encode to make it RuntimeReplaceable with StaticInvoke

maybe we can follow the work I have done for Encode to make it RuntimeReplaceable with StaticInvoke

Thanks, I will try it

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

…onf.scala

yaooqinn · 2024-07-12T06:56:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .booleanConf
    .createWithDefault(false)

+  val CHUNK_BASE_64_STRING_ENABLED = buildConf("spark.sql.legacy.chunkBase64String.enabled")


nit: BASE_64 to BASE64

…change of base64 function  ### What changes were proposed in this pull request? Follow up to #47303 Add a migration guide for the behavior change of `base64` function ### Why are the changes needed?  ### Does this PR introduce _any_ user-facing change?  No ### How was this patch tested?  doc change ### Was this patch authored or co-authored using generative AI tooling?  No Closes #47371 from wForget/SPARK-47307_doc. Authored-by: wforget <[email protected]> Signed-off-by: allisonwang-db <[email protected]>

…change of base64 function  ### What changes were proposed in this pull request? Follow up to #47303 Add a migration guide for the behavior change of `base64` function ### Why are the changes needed?  ### Does this PR introduce _any_ user-facing change?  No ### How was this patch tested?  doc change ### Was this patch authored or co-authored using generative AI tooling?  No Closes #47371 from wForget/SPARK-47307_doc. Authored-by: wforget <[email protected]> Signed-off-by: allisonwang-db <[email protected]> (cherry picked from commit b2e0a4d) Signed-off-by: allisonwang-db <[email protected]>

gatorsmile · 2024-07-18T23:49:43Z

@wForget is it true?

@yaooqinn @dongjoon-hyun @wForget I think we need to discuss it in the dev list before merging this PR. This will break the encryption that relies on base64.

I would suggest reverting it first before making a common decision on the dev list.

dongjoon-hyun · 2024-07-18T23:57:58Z

Thank you for the feedback, @gatorsmile.

I have three questions to understand your request.

Do you see the example, SPARK-47307? What is your opinion?
We have the following migration guide, do you want to set spark.sql.legacy.chunkBase64String.enabled=true by default in Apache Spark 3.5.2?

Since 3.5.2, the base64 function will return a non-chunked string. To restore the behavior of chunking base64 encoded strings into lines of at most 76 characters, set spark.sql.legacy.chunkBase64String.enabled to true.

If we change the default value, do we still need to revert?

cloud-fan · 2024-07-19T00:15:41Z

I think the PR description should be improved to mention more information. It's a behavior change, we should clearly define the behavior difference and measure the impact. Does it mean the Spark base64 result can be decoded by more encoders now? Is there any standard encoder that may fail to decode after this change?

wForget · 2024-07-19T02:07:24Z

Sorry, I didn't change PR description after changing default value for new config.
Actually, a behavior change was unexpectedly introduced in #35110, this PR aims to make the behavior of base64 encode configurable. If the change in current behavior is controversial, I would first submit a PR to change default value of the newly introduced config which would revert to the previous behavior. WDYT?

yaooqinn · 2024-07-19T02:28:10Z

I would first submit a PR to change default value of the newly introduced config which would revert to the previous behavior.
+1

…ng.enabled from a legacy/internal config to a regular/public one ### What changes were proposed in this pull request? + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one. + Add test cases for unbase64 ### Why are the changes needed? Keep the same behavior as before. More details: #47303 (comment) ### Does this PR introduce _any_ user-facing change? yes, revert behavior change introduced in #47303 ### How was this patch tested? existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47410 from wForget/SPARK-47307_followup. Lead-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…ng.enabled from a legacy/internal config to a regular/public one ### What changes were proposed in this pull request? + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one. + Add test cases for unbase64 ### Why are the changes needed? Keep the same behavior as before. More details: apache#47303 (comment) ### Does this PR introduce _any_ user-facing change? yes, revert behavior change introduced in apache#47303 ### How was this patch tested? existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47410 from wForget/SPARK-47307_followup. Lead-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit af5eb08)

…4String.enabled from a legacy/internal config to a regular/public one Backports #47410 to 3.5 ### What changes were proposed in this pull request? + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one. + Add test cases for unbase64 ### Why are the changes needed? Keep the same behavior as before. More details: #47303 (comment) ### Does this PR introduce _any_ user-facing change? yes, revert behavior change introduced in #47303 ### How was this patch tested? existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47416 from wForget/SPARK-47307_followup_3.5. Authored-by: wforget <[email protected]> Signed-off-by: Kent Yao <[email protected]>

Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <[email protected]> Co-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Co-authored-by: Ted Chester Jenks <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…change of base64 function  ### What changes were proposed in this pull request? Follow up to apache#47303 Add a migration guide for the behavior change of `base64` function ### Why are the changes needed?  ### Does this PR introduce _any_ user-facing change?  No ### How was this patch tested?  doc change ### Was this patch authored or co-authored using generative AI tooling?  No Closes apache#47371 from wForget/SPARK-47307_doc. Authored-by: wforget <[email protected]> Signed-off-by: allisonwang-db <[email protected]>

…ng.enabled from a legacy/internal config to a regular/public one ### What changes were proposed in this pull request? + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one. + Add test cases for unbase64 ### Why are the changes needed? Keep the same behavior as before. More details: apache#47303 (comment) ### Does this PR introduce _any_ user-facing change? yes, revert behavior change introduced in apache#47303 ### How was this patch tested? existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47410 from wForget/SPARK-47307_followup. Lead-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? Fix the nullability of the `Base64` expression to be based on the child's nullability, and not always be nullable. ### Why are the changes needed? #47303 had a side effect of changing the nullability by the switch to using `StaticInvoke`. This was also backported to Spark 3.5.2 and caused schema mismatch errors for stateful streams when we upgraded. This restores the previous behavior which is supported by StaticInvoke through the `returnNullable` argument. If the child is non-nullable, we know the result will be non-nullable. ### Does this PR introduce _any_ user-facing change? Restores the nullability of the `Base64` expression to what is was in Spark 3.5.1 and earlier. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #47941 from Kimahriman/base64-nullability. Lead-authored-by: Adam Binford <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

### What changes were proposed in this pull request? Fix the nullability of the `Base64` expression to be based on the child's nullability, and not always be nullable. ### Why are the changes needed? #47303 had a side effect of changing the nullability by the switch to using `StaticInvoke`. This was also backported to Spark 3.5.2 and caused schema mismatch errors for stateful streams when we upgraded. This restores the previous behavior which is supported by StaticInvoke through the `returnNullable` argument. If the child is non-nullable, we know the result will be non-nullable. ### Does this PR introduce _any_ user-facing change? Restores the nullability of the `Base64` expression to what is was in Spark 3.5.1 and earlier. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #47941 from Kimahriman/base64-nullability. Lead-authored-by: Adam Binford <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]> (cherry picked from commit c274c5a) Signed-off-by: Max Gekk <[email protected]>

Backports apache#47303 to 3.5 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47325 from wForget/SPARK-47307_3.5. Lead-authored-by: wforget <[email protected]> Co-authored-by: Ted Jenks <[email protected]> Signed-off-by: Kent Yao <[email protected]>

* [SPARK-49476][SQL] Fix nullability of base64 function ### What changes were proposed in this pull request? Fix the nullability of the `Base64` expression to be based on the child's nullability, and not always be nullable. ### Why are the changes needed? apache#47303 had a side effect of changing the nullability by the switch to using `StaticInvoke`. This was also backported to Spark 3.5.2 and caused schema mismatch errors for stateful streams when we upgraded. This restores the previous behavior which is supported by StaticInvoke through the `returnNullable` argument. If the child is non-nullable, we know the result will be non-nullable. ### Does this PR introduce _any_ user-facing change? Restores the nullability of the `Base64` expression to what is was in Spark 3.5.1 and earlier. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47941 from Kimahriman/base64-nullability. Lead-authored-by: Adam Binford <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]> (cherry picked from commit c274c5a) Signed-off-by: Max Gekk <[email protected]> * [SPARK-49476][SQL][3.5][FOLLOWUP] Fix base64 proto test ### What changes were proposed in this pull request? Fix a test that is failing from backporting apache#47941 ### Why are the changes needed? Fix test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Fixed test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47964 from Kimahriman/base64-proto-test. Authored-by: Adam Binford <[email protected]> Signed-off-by: Kent Yao <[email protected]> * [SPARK-49476][SQL][3.5][FOLLOWUP] Fix base64 proto test ### What changes were proposed in this pull request? Fix a test that is failing from backporting apache#47941 ### Why are the changes needed? Fix test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Fixed test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47964 from Kimahriman/base64-proto-test. Authored-by: Adam Binford <[email protected]> Signed-off-by: Kent Yao <[email protected]> --------- Signed-off-by: Max Gekk <[email protected]> Signed-off-by: Kent Yao <[email protected]> Co-authored-by: Adam Binford <[email protected]> Co-authored-by: Maxim Gekk <[email protected]>

Ted Jenks and others added 17 commits July 11, 2024 16:42

move to RFC4648

09d1499

remove others

1305879

do decoders too

54ea063

add test

6f1ace9

fix upstream

dfca282

mime to decode

7226206

scala array

f685701

make it configurable

cc790c0

fix the codegen

47922c8

reorg

0c8593f

try that

0b947c8

do chunk if outside codegen

64a7400

comments

e1cd658

do that

fc48cc3

to gcp

d4504c4

fix

7423512

test

9a4d785

github-actions bot added the SQL label Jul 11, 2024

add chunk encode test

127e5b8

wForget mentioned this pull request Jul 11, 2024

[SPARK-47307][SQL] Add a config to optionally chunk base64 strings #45408

Closed