GH-49614: [C++] Fix silent truncation in base64_decode on invalid input#49660
GH-49614: [C++] Fix silent truncation in base64_decode on invalid input#49660Reranko05 wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
|
|
||
| for (char c : encoded_string) { | ||
| if (!(is_base64(c) || c == '=')) { | ||
| return ""; |
There was a problem hiding this comment.
I’m not comfortable with "" as the error path here. It’s indistinguishable from a valid decode of empty input, so malformed input still fails silently. I’d prefer this API to fail explicitly (Result<std::string> / checked variant) and have Gandiva propagate that as an error.
Returning null would be slightly better than returning "", because at least it doesn’t collide with a valid decoded empty string. But I still don’t think it’s the right default behavior here as null still turns malformed input into a regular value rather than an explicit failure.
cpp/src/arrow/vendored/base64.cpp
Outdated
| std::string ret; | ||
|
|
||
| for (char c : encoded_string) { | ||
| if (!(is_base64(c) || c == '=')) { |
There was a problem hiding this comment.
This is absolutely insufficient and will not trip on input like abcd=AAA. Please do some research on best practices for sufficient and efficient base64 input validation.
cpp/src/arrow/util/string_test.cc
Outdated
| std::string input = "hello world!"; // invalid base64 | ||
| std::string output = arrow::util::base64_decode(input); | ||
|
|
||
| EXPECT_EQ(output, ""); |
There was a problem hiding this comment.
More tests! In our day and age with tools that we have this is not even bare minimum. Null input? Valid input? Non-ascii input?
Did you locate other tests? I'm not seeing any other tests for base64_decode in this file so where are they?
4670ec5 to
5c7db64
Compare
|
Thanks for the feedback. I’ve updated the implementation and tests.
All tests pass locally. Please let me know if any further adjustments are needed. |
5c7db64 to
8f053b7
Compare
kou
left a comment
There was a problem hiding this comment.
Could you use arrow::Result<std::string> return type instead of using ARROW_LOG()?
|
FYI: You can run CI on your fork by enabling GitHub Actions on your fork. |
Thanks for the suggestion @kou ! Just to clarify, would you prefer changing the existing base64_decode API to return arrow::Resultstd::string, or introducing a separate checked variant while keeping the current API unchanged? I want to make sure the approach aligns with existing usage and expectations. |
|
"changing the existing base64_decode API to return arrow::Resultstd::string". |
|
@kou I checked the current usages of Updating to I can proceed with the API change and update the affected call sites accordingly. |
Rationale for this change
arrow::util::base64_decodesilently truncates output when encountering invalid base64 characters, returning partial results without signaling an error. This can lead to unintended data corruption.What changes are included in this PR?
Are these changes tested?
Yes. A unit test has been added to verify that invalid input returns an empty string.
Are there any user-facing changes?
Yes. Previously, invalid base64 input could result in partial decoded output. Now, such inputs return an empty string.
This PR contains a "Critical Fix".
This change fixes a correctness issue where invalid base64 input could result in silently truncated output, leading to incorrect data being produced. The fix ensures such inputs are detected and handled safely.