fix(rust): Bound polars-parquet thrift list capacity by remaining input#27490
Open
masumi-ryugo wants to merge 1 commit intopola-rs:mainfrom
Open
fix(rust): Bound polars-parquet thrift list capacity by remaining input#27490masumi-ryugo wants to merge 1 commit intopola-rs:mainfrom
masumi-ryugo wants to merge 1 commit intopola-rs:mainfrom
Conversation
Three sites in `crates/polars-parquet/src/parquet/handwritten_thrift`
called `Vec::with_capacity(list_size as usize)` over the
attacker-controlled list-size varint emitted by `read_list_begin`:
- `read_thrift_vec` (parquet_thrift.rs)
- `read_list` (file_metadata_thrift.rs, the local helper that
drives `decode_file_metadata`)
`read_list_begin` itself silently truncated the varint to `i32`,
so a varint that decoded above `i32::MAX` would wrap into a
negative size that downstream allocation code then has to re-validate.
A 5-minute cargo-fuzz run of a `polars-parquet/fuzz/thrift_metadata_decode`
harness produces a 7-byte input (`2b f9 ee ee ee ee 43`) that drives
`Vec::with_capacity` past 100 GiB and OOMs the process:
==1345626== ERROR: libFuzzer: out-of-memory (malloc(107932189872))
Three changes to defuse this and keep the failure mode well-defined
under malformed input:
1. `read_list_begin` now uses `i32::try_from(self.read_vlq()?)?`,
reporting a new `IntegerOverflow` error for varints above
`i32::MAX` instead of silently wrapping.
2. A new default `ThriftCompactInputProtocol::bytes_remaining(&self)`
method returns `usize::MAX` for streaming readers and is overridden
by `ThriftSliceInputProtocol` to return `self.buf.len()`, the exact
byte count still in the slice.
3. `read_thrift_vec` and `read_list` cap the up-front capacity by
`min(declared, prot.bytes_remaining())` and then `try_reserve_exact`
as belt-and-suspenders. Each Thrift list element occupies at least
one byte on the wire, so a declared size larger than what the
reader has left is never legitimate input. A header that overstates
its list size now surfaces as `ParquetError::oos(...)` instead of
crashing the process.
After this patch the same 7-byte input passes through the fuzz target
in <1 ms with exit 0; a 60-second confidence run on the same harness
under `-rss_limit_mb=512` yields 277,000 runs with no OOMs and a peak
RSS of 288 MiB.
`read_bytes_owned` on `ThriftReadInputProtocol` (the streaming
`Read`-backed protocol) has the same `Vec::with_capacity(len)` shape
and would benefit from a chunked-read fix in the spirit of
apache/arrow-rs#9869, but it sits on a different code path than the
slice-backed footer decoder this PR exercises and deserves its own
follow-up.
Found by the cargo-fuzz harness being prototyped for pola-rs#27488.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Three sites in
crates/polars-parquet/src/parquet/handwritten_thriftcalledVec::with_capacity(list_size as usize)over the attacker-controlled list-size varint emitted byread_list_begin:read_thrift_vec(parquet_thrift.rs)read_list(file_metadata_thrift.rs, the local helper that drivesdecode_file_metadata)read_list_beginitself also silently truncated the varint toi32, so a varint that decoded abovei32::MAXwould wrap into a negative size that downstream allocation code then has to re-validate.Repro
A 5-minute cargo-fuzz run of a
polars-parquet/fuzz/thrift_metadata_decodeharness produces a 7-byte input that drivesVec::with_capacitypast 100 GiB and OOMs the process:After this patch the same 7 bytes pass through the fuzz target in <1 ms with exit 0; a 60-second confidence run on the same harness under
-rss_limit_mb=512yields 277,000 runs with no OOMs and a peak RSS of 288 MiB.Changes
read_list_beginnow usesi32::try_from(self.read_vlq()?)?, reporting a newIntegerOverflowerror for varints abovei32::MAXinstead of silently wrapping into a negativei32.ThriftCompactInputProtocol::bytes_remaining(&self) -> usize— a new default method that returnsusize::MAXfor streaming readers and is overridden byThriftSliceInputProtocolto returnself.buf.len(), the exact byte count still in the slice.read_thrift_vecandread_listcap the up-front capacity bymin(declared, prot.bytes_remaining())and thentry_reserve_exactas belt-and-suspenders:Each Thrift list element occupies at least one byte on the wire, so a declared size larger than what the reader has left is never legitimate input. A header that overstates its list size now surfaces as
ParquetError::oos(...)instead of crashing the process.Why
try_reserve_exactalone isn't enoughThe same shape of fix in apache/arrow-rs (apache/arrow-rs#9868) used
try_reserve_exactwithout abytes_remainingclamp, and that turned out to be insufficient under libFuzzer / ASan: on Linux with default overcommit the allocator returns success for the requested size, so the OOM only surfaces inside libFuzzer's malloc hook and thetry_reserve_exactErrpath is never taken. Capping bybytes_remainingkeeps the allocation request proportional to the input size, which is the actual fix;try_reserve_exactthen catches edge cases on protocols that can't compute a tight bound. (apache/arrow-rs#9883 fixes this on the arrow-rs side; same model applied here.)Out-of-scope (follow-up)
read_bytes_ownedonThriftReadInputProtocol(the streamingRead-backed protocol) has the sameVec::with_capacity(len)shape but sits on a different code path than the slice-backed footer decoder this PR exercises. A chunked-read fix in the spirit of apache/arrow-rs#9869 would defuse it; happy to send that as a follow-up once this lands.xref #27488 (sibling tracking issue for cargo-fuzz / OSS-Fuzz coverage of the polars-* reader stack).