Skip to content

Conversation

@wForget
Copy link
Member

@wForget wForget commented Jul 11, 2024

Follow up #45408

What changes were proposed in this pull request?

[SPARK-47307] Add a config to optionally chunk base64 strings

Why are the changes needed?

In #35110, it was incorrectly asserted that:

ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

public static byte[] encodeBase64(byte[] binaryData)

Which states:

Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing test suite.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jul 11, 2024
"""})
if (chunkBase64) {
s"""${ev.value} = UTF8String.fromBytes(
${classOf[JBase64].getName}.getMimeEncoder().encode($child));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use the encoder directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use the encoder directly?

java.util.Base64$Encoder is not serializable.

-- !query
select base64(c7), base64(c8), base64(v), ascii(s) from char_tbl4
-- !query schema
struct<>
-- !query output
java.io.NotSerializableException
java.util.Base64$Encoder
Serialization stack:
	- object not serializable (class: java.util.Base64$Encoder, value: java.util.Base64$Encoder@423ed07f)
	- element of array (index: 2)
	- array (class [Ljava.lang.Object;, size 5)
	- field (class: org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory, name: org$apache$spark$sql$execution$WholeStageCodegenEvaluatorFactory$$references, type: class [Ljava.lang.Object;)
	- object (class org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory, org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory@2fd9633e)
	- element of array (index: 0)
	- array (class [Ljava.lang.Object;, size 1)
	- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
	- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/execution/WholeStageCodegenEvaluatorFactory;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=1])
	- writeReplace data (class: java.lang.invoke.SerializedLambda)
	- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$2458/0x000002cf3e949c30, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$2458/0x000002cf3e949c30@603a0fa7)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can follow the work I have done for Encode to make it RuntimeReplaceable with StaticInvoke

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can follow the work I have done for Encode to make it RuntimeReplaceable with StaticInvoke

Thanks, I will try it

.booleanConf
.createWithDefault(false)

val CHUNK_BASE_64_STRING_ENABLED = buildConf("spark.sql.legacy.chunkBase64String.enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: BASE_64 to BASE64

allisonwang-db pushed a commit that referenced this pull request Jul 17, 2024
…change of base64 function

<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
  7. If you want to add a new configuration, please read the guideline first for naming configurations in
     'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
  8. If you want to add or modify an error type or message, please read the guideline first in
     'common/utils/src/main/resources/error/README.md'.
-->

### What changes were proposed in this pull request?

Follow up to #47303

Add a migration guide for the behavior change of `base64` function

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
No

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
-->
doc change

### Was this patch authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this patch, please include the
phrase: 'Generated-by: ' followed by the name of the tool and its version.
If no, write 'No'.
Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
-->
No

Closes #47371 from wForget/SPARK-47307_doc.

Authored-by: wforget <[email protected]>
Signed-off-by: allisonwang-db <[email protected]>
allisonwang-db pushed a commit that referenced this pull request Jul 17, 2024
…change of base64 function

<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
  7. If you want to add a new configuration, please read the guideline first for naming configurations in
     'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
  8. If you want to add or modify an error type or message, please read the guideline first in
     'common/utils/src/main/resources/error/README.md'.
-->

### What changes were proposed in this pull request?

Follow up to #47303

Add a migration guide for the behavior change of `base64` function

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
No

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
-->
doc change

### Was this patch authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this patch, please include the
phrase: 'Generated-by: ' followed by the name of the tool and its version.
If no, write 'No'.
Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
-->
No

Closes #47371 from wForget/SPARK-47307_doc.

Authored-by: wforget <[email protected]>
Signed-off-by: allisonwang-db <[email protected]>
(cherry picked from commit b2e0a4d)
Signed-off-by: allisonwang-db <[email protected]>
@gatorsmile
Copy link
Member

image

@wForget is it true?

@yaooqinn @dongjoon-hyun @wForget I think we need to discuss it in the dev list before merging this PR. This will break the encryption that relies on base64.

I would suggest reverting it first before making a common decision on the dev list.

@dongjoon-hyun
Copy link
Member

Thank you for the feedback, @gatorsmile.

I have three questions to understand your request.

  • Do you see the example, SPARK-47307? What is your opinion?
  • We have the following migration guide, do you want to set spark.sql.legacy.chunkBase64String.enabled=true by default in Apache Spark 3.5.2?

Since 3.5.2, the base64 function will return a non-chunked string. To restore the behavior of chunking base64 encoded strings into lines of at most 76 characters, set spark.sql.legacy.chunkBase64String.enabled to true.

  • If we change the default value, do we still need to revert?

@cloud-fan
Copy link
Contributor

I think the PR description should be improved to mention more information. It's a behavior change, we should clearly define the behavior difference and measure the impact. Does it mean the Spark base64 result can be decoded by more encoders now? Is there any standard encoder that may fail to decode after this change?

@wForget
Copy link
Member Author

wForget commented Jul 19, 2024

Sorry, I didn't change PR description after changing default value for new config.
Actually, a behavior change was unexpectedly introduced in #35110, this PR aims to make the behavior of base64 encode configurable. If the change in current behavior is controversial, I would first submit a PR to change default value of the newly introduced config which would revert to the previous behavior. WDYT?

@yaooqinn
Copy link
Member

I would first submit a PR to change default value of the newly introduced config which would revert to the previous behavior.
+1

yaooqinn added a commit that referenced this pull request Jul 19, 2024
…ng.enabled from a legacy/internal config to a regular/public one

### What changes were proposed in this pull request?

+ Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one.
+ Add test cases for unbase64

### Why are the changes needed?

Keep the same behavior as before. More details: #47303 (comment)

### Does this PR introduce _any_ user-facing change?

yes, revert behavior change introduced in #47303

### How was this patch tested?

existing unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47410 from wForget/SPARK-47307_followup.

Lead-authored-by: wforget <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
wForget added a commit to wForget/spark that referenced this pull request Jul 19, 2024
…ng.enabled from a legacy/internal config to a regular/public one

### What changes were proposed in this pull request?

+ Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one.
+ Add test cases for unbase64

### Why are the changes needed?

Keep the same behavior as before. More details: apache#47303 (comment)

### Does this PR introduce _any_ user-facing change?

yes, revert behavior change introduced in apache#47303

### How was this patch tested?

existing unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47410 from wForget/SPARK-47307_followup.

Lead-authored-by: wforget <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>

(cherry picked from commit af5eb08)
yaooqinn pushed a commit that referenced this pull request Jul 20, 2024
…4String.enabled from a legacy/internal config to a regular/public one

Backports #47410 to 3.5

### What changes were proposed in this pull request?

+ Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one.
+ Add test cases for unbase64

### Why are the changes needed?

Keep the same behavior as before. More details: #47303 (comment)

### Does this PR introduce _any_ user-facing change?

yes, revert behavior change introduced in #47303

### How was this patch tested?

existing unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47416 from wForget/SPARK-47307_followup_3.5.

Authored-by: wforget <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
Follow up apache#45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <[email protected]>
Co-authored-by: wforget <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Co-authored-by: Ted Chester Jenks <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…change of base64 function

<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
  7. If you want to add a new configuration, please read the guideline first for naming configurations in
     'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
  8. If you want to add or modify an error type or message, please read the guideline first in
     'common/utils/src/main/resources/error/README.md'.
-->

### What changes were proposed in this pull request?

Follow up to apache#47303

Add a migration guide for the behavior change of `base64` function

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
No

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
-->
doc change

### Was this patch authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this patch, please include the
phrase: 'Generated-by: ' followed by the name of the tool and its version.
If no, write 'No'.
Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
-->
No

Closes apache#47371 from wForget/SPARK-47307_doc.

Authored-by: wforget <[email protected]>
Signed-off-by: allisonwang-db <[email protected]>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…ng.enabled from a legacy/internal config to a regular/public one

### What changes were proposed in this pull request?

+ Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one.
+ Add test cases for unbase64

### Why are the changes needed?

Keep the same behavior as before. More details: apache#47303 (comment)

### Does this PR introduce _any_ user-facing change?

yes, revert behavior change introduced in apache#47303

### How was this patch tested?

existing unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47410 from wForget/SPARK-47307_followup.

Lead-authored-by: wforget <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
MaxGekk added a commit that referenced this pull request Sep 2, 2024
### What changes were proposed in this pull request?

Fix the nullability of the `Base64` expression to be based on the child's nullability, and not always be nullable.

### Why are the changes needed?

#47303 had a side effect of changing the nullability by the switch to using `StaticInvoke`. This was also backported to Spark 3.5.2 and caused schema mismatch errors for stateful streams when we upgraded. This restores the previous behavior which is supported by StaticInvoke through the `returnNullable` argument. If the child is non-nullable, we know the result will be non-nullable.

### Does this PR introduce _any_ user-facing change?

Restores the nullability of the `Base64` expression to what is was in Spark 3.5.1 and earlier.

### How was this patch tested?

New UT

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47941 from Kimahriman/base64-nullability.

Lead-authored-by: Adam Binford <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
MaxGekk added a commit that referenced this pull request Sep 2, 2024
### What changes were proposed in this pull request?

Fix the nullability of the `Base64` expression to be based on the child's nullability, and not always be nullable.

### Why are the changes needed?

#47303 had a side effect of changing the nullability by the switch to using `StaticInvoke`. This was also backported to Spark 3.5.2 and caused schema mismatch errors for stateful streams when we upgraded. This restores the previous behavior which is supported by StaticInvoke through the `returnNullable` argument. If the child is non-nullable, we know the result will be non-nullable.

### Does this PR introduce _any_ user-facing change?

Restores the nullability of the `Base64` expression to what is was in Spark 3.5.1 and earlier.

### How was this patch tested?

New UT

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47941 from Kimahriman/base64-nullability.

Lead-authored-by: Adam Binford <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
(cherry picked from commit c274c5a)
Signed-off-by: Max Gekk <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
Backports apache#47303 to 3.5

### What changes were proposed in this pull request?

[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?

In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?

 No

Closes apache#47325 from wForget/SPARK-47307_3.5.

Lead-authored-by: wforget <[email protected]>
Co-authored-by: Ted Jenks <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
* [SPARK-49476][SQL] Fix nullability of base64 function

### What changes were proposed in this pull request?

Fix the nullability of the `Base64` expression to be based on the child's nullability, and not always be nullable.

### Why are the changes needed?

apache#47303 had a side effect of changing the nullability by the switch to using `StaticInvoke`. This was also backported to Spark 3.5.2 and caused schema mismatch errors for stateful streams when we upgraded. This restores the previous behavior which is supported by StaticInvoke through the `returnNullable` argument. If the child is non-nullable, we know the result will be non-nullable.

### Does this PR introduce _any_ user-facing change?

Restores the nullability of the `Base64` expression to what is was in Spark 3.5.1 and earlier.

### How was this patch tested?

New UT

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47941 from Kimahriman/base64-nullability.

Lead-authored-by: Adam Binford <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
(cherry picked from commit c274c5a)
Signed-off-by: Max Gekk <[email protected]>

* [SPARK-49476][SQL][3.5][FOLLOWUP] Fix base64 proto test

### What changes were proposed in this pull request?

Fix a test that is failing from backporting apache#47941

### Why are the changes needed?

Fix test

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Fixed test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47964 from Kimahriman/base64-proto-test.

Authored-by: Adam Binford <[email protected]>
Signed-off-by: Kent Yao <[email protected]>

* [SPARK-49476][SQL][3.5][FOLLOWUP] Fix base64 proto test

### What changes were proposed in this pull request?

Fix a test that is failing from backporting apache#47941

### Why are the changes needed?

Fix test

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Fixed test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47964 from Kimahriman/base64-proto-test.

Authored-by: Adam Binford <[email protected]>
Signed-off-by: Kent Yao <[email protected]>

---------

Signed-off-by: Max Gekk <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
Co-authored-by: Adam Binford <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants