Skip to content

GH-49614: [C++] Fix silent truncation in base64_decode on invalid input#49660

Open
Reranko05 wants to merge 1 commit intoapache:mainfrom
Reranko05:fix-base64-invalid-input
Open

GH-49614: [C++] Fix silent truncation in base64_decode on invalid input#49660
Reranko05 wants to merge 1 commit intoapache:mainfrom
Reranko05:fix-base64-invalid-input

Conversation

@Reranko05
Copy link
Copy Markdown

@Reranko05 Reranko05 commented Apr 4, 2026

Rationale for this change

arrow::util::base64_decode silently truncates output when encountering invalid base64 characters, returning partial results without signaling an error. This can lead to unintended data corruption.

What changes are included in this PR?

  • Add upfront validation of input characters in base64_decode
  • Return an empty string if invalid base64 characters are detected
  • Prevent silent truncation of decoded output

Are these changes tested?

Yes. A unit test has been added to verify that invalid input returns an empty string.

Are there any user-facing changes?

Yes. Previously, invalid base64 input could result in partial decoded output. Now, such inputs return an empty string.

This PR contains a "Critical Fix".

This change fixes a correctness issue where invalid base64 input could result in silently truncated output, leading to incorrect data being produced. The fix ensures such inputs are detected and handled safely.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

⚠️ GitHub issue #49614 has been automatically assigned in GitHub to PR creator.


for (char c : encoded_string) {
if (!(is_base64(c) || c == '=')) {
return "";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not comfortable with "" as the error path here. It’s indistinguishable from a valid decode of empty input, so malformed input still fails silently. I’d prefer this API to fail explicitly (Result<std::string> / checked variant) and have Gandiva propagate that as an error.

Returning null would be slightly better than returning "", because at least it doesn’t collide with a valid decoded empty string. But I still don’t think it’s the right default behavior here as null still turns malformed input into a regular value rather than an explicit failure.

std::string ret;

for (char c : encoded_string) {
if (!(is_base64(c) || c == '=')) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely insufficient and will not trip on input like abcd=AAA. Please do some research on best practices for sufficient and efficient base64 input validation.

std::string input = "hello world!"; // invalid base64
std::string output = arrow::util::base64_decode(input);

EXPECT_EQ(output, "");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests! In our day and age with tools that we have this is not even bare minimum. Null input? Valid input? Non-ascii input?

Did you locate other tests? I'm not seeing any other tests for base64_decode in this file so where are they?

@Reranko05 Reranko05 force-pushed the fix-base64-invalid-input branch from 4670ec5 to 5c7db64 Compare April 4, 2026 20:38
@Reranko05
Copy link
Copy Markdown
Author

Thanks for the feedback. I’ve updated the implementation and tests.

  • Added stricter validation (length, padding placement/count, allowed characters)
  • Removed early termination in the decode loop to avoid silent truncation
  • Expanded test coverage to include invalid inputs and edge cases

All tests pass locally. Please let me know if any further adjustments are needed.

@Reranko05 Reranko05 force-pushed the fix-base64-invalid-input branch from 5c7db64 to 8f053b7 Compare April 4, 2026 21:07
Copy link
Copy Markdown
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use arrow::Result<std::string> return type instead of using ARROW_LOG()?

@kou
Copy link
Copy Markdown
Member

kou commented Apr 5, 2026

FYI: You can run CI on your fork by enabling GitHub Actions on your fork.

@Reranko05
Copy link
Copy Markdown
Author

Could you use arrow::Result<std::string> return type instead of using ARROW_LOG()?

Thanks for the suggestion @kou !

Just to clarify, would you prefer changing the existing base64_decode API to return arrow::Resultstd::string, or introducing a separate checked variant while keeping the current API unchanged?

I want to make sure the approach aligns with existing usage and expectations.

@kou
Copy link
Copy Markdown
Member

kou commented Apr 5, 2026

"changing the existing base64_decode API to return arrow::Resultstd::string".
But I want to know how many changes are required for existing code that use base64_decode().

@Reranko05
Copy link
Copy Markdown
Author

@kou I checked the current usages of base64_decode(), and it appears to be used in a very limited number of places (primarily in tests and one internal call site in flight_test.cc).

Updating to arrow::Result<std::string> would require adjusting those call sites to use ARROW_ASSIGN_OR_RAISE, but the impact seems quite localized and manageable.

I can proceed with the API change and update the affected call sites accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants