Skip to content

fix(rust): Bound polars-parquet thrift list capacity by remaining input#27490

Open
masumi-ryugo wants to merge 1 commit intopola-rs:mainfrom
masumi-ryugo:fix/polars-parquet-thrift-bound-list
Open

fix(rust): Bound polars-parquet thrift list capacity by remaining input#27490
masumi-ryugo wants to merge 1 commit intopola-rs:mainfrom
masumi-ryugo:fix/polars-parquet-thrift-bound-list

Conversation

@masumi-ryugo
Copy link
Copy Markdown

What

Three sites in crates/polars-parquet/src/parquet/handwritten_thrift called Vec::with_capacity(list_size as usize) over the attacker-controlled list-size varint emitted by read_list_begin:

  • read_thrift_vec (parquet_thrift.rs)
  • read_list (file_metadata_thrift.rs, the local helper that drives decode_file_metadata)

read_list_begin itself also silently truncated the varint to i32, so a varint that decoded above i32::MAX would wrap into a negative size that downstream allocation code then has to re-validate.

Repro

A 5-minute cargo-fuzz run of a polars-parquet/fuzz/thrift_metadata_decode harness produces a 7-byte input that drives Vec::with_capacity past 100 GiB and OOMs the process:

input bytes: 2b f9 ee ee ee ee 43

==1345626== ERROR: libFuzzer: out-of-memory (malloc(107932189872))

After this patch the same 7 bytes pass through the fuzz target in <1 ms with exit 0; a 60-second confidence run on the same harness under -rss_limit_mb=512 yields 277,000 runs with no OOMs and a peak RSS of 288 MiB.

Changes

  1. read_list_begin now uses i32::try_from(self.read_vlq()?)?, reporting a new IntegerOverflow error for varints above i32::MAX instead of silently wrapping into a negative i32.

  2. ThriftCompactInputProtocol::bytes_remaining(&self) -> usize — a new default method that returns usize::MAX for streaming readers and is overridden by ThriftSliceInputProtocol to return self.buf.len(), the exact byte count still in the slice.

  3. read_thrift_vec and read_list cap the up-front capacity by min(declared, prot.bytes_remaining()) and then try_reserve_exact as belt-and-suspenders:

    let declared = list_ident.size as usize;
    let bound = declared.min(prot.bytes_remaining());
    let mut res: Vec<T> = Vec::new();
    if res.try_reserve_exact(bound).is_err() {
        return Err(general_err!(...));
    }
    for _ in 0..list_ident.size {            // EOF surfaces if header overstated size}

Each Thrift list element occupies at least one byte on the wire, so a declared size larger than what the reader has left is never legitimate input. A header that overstates its list size now surfaces as ParquetError::oos(...) instead of crashing the process.

Why try_reserve_exact alone isn't enough

The same shape of fix in apache/arrow-rs (apache/arrow-rs#9868) used try_reserve_exact without a bytes_remaining clamp, and that turned out to be insufficient under libFuzzer / ASan: on Linux with default overcommit the allocator returns success for the requested size, so the OOM only surfaces inside libFuzzer's malloc hook and the try_reserve_exact Err path is never taken. Capping by bytes_remaining keeps the allocation request proportional to the input size, which is the actual fix; try_reserve_exact then catches edge cases on protocols that can't compute a tight bound. (apache/arrow-rs#9883 fixes this on the arrow-rs side; same model applied here.)

Out-of-scope (follow-up)

read_bytes_owned on ThriftReadInputProtocol (the streaming Read-backed protocol) has the same Vec::with_capacity(len) shape but sits on a different code path than the slice-backed footer decoder this PR exercises. A chunked-read fix in the spirit of apache/arrow-rs#9869 would defuse it; happy to send that as a follow-up once this lands.

xref #27488 (sibling tracking issue for cargo-fuzz / OSS-Fuzz coverage of the polars-* reader stack).

Three sites in `crates/polars-parquet/src/parquet/handwritten_thrift`
called `Vec::with_capacity(list_size as usize)` over the
attacker-controlled list-size varint emitted by `read_list_begin`:

  - `read_thrift_vec` (parquet_thrift.rs)
  - `read_list` (file_metadata_thrift.rs, the local helper that
    drives `decode_file_metadata`)

`read_list_begin` itself silently truncated the varint to `i32`,
so a varint that decoded above `i32::MAX` would wrap into a
negative size that downstream allocation code then has to re-validate.

A 5-minute cargo-fuzz run of a `polars-parquet/fuzz/thrift_metadata_decode`
harness produces a 7-byte input (`2b f9 ee ee ee ee 43`) that drives
`Vec::with_capacity` past 100 GiB and OOMs the process:

  ==1345626== ERROR: libFuzzer: out-of-memory (malloc(107932189872))

Three changes to defuse this and keep the failure mode well-defined
under malformed input:

1. `read_list_begin` now uses `i32::try_from(self.read_vlq()?)?`,
   reporting a new `IntegerOverflow` error for varints above
   `i32::MAX` instead of silently wrapping.

2. A new default `ThriftCompactInputProtocol::bytes_remaining(&self)`
   method returns `usize::MAX` for streaming readers and is overridden
   by `ThriftSliceInputProtocol` to return `self.buf.len()`, the exact
   byte count still in the slice.

3. `read_thrift_vec` and `read_list` cap the up-front capacity by
   `min(declared, prot.bytes_remaining())` and then `try_reserve_exact`
   as belt-and-suspenders. Each Thrift list element occupies at least
   one byte on the wire, so a declared size larger than what the
   reader has left is never legitimate input. A header that overstates
   its list size now surfaces as `ParquetError::oos(...)` instead of
   crashing the process.

After this patch the same 7-byte input passes through the fuzz target
in <1 ms with exit 0; a 60-second confidence run on the same harness
under `-rss_limit_mb=512` yields 277,000 runs with no OOMs and a peak
RSS of 288 MiB.

`read_bytes_owned` on `ThriftReadInputProtocol` (the streaming
`Read`-backed protocol) has the same `Vec::with_capacity(len)` shape
and would benefit from a chunked-read fix in the spirit of
apache/arrow-rs#9869, but it sits on a different code path than the
slice-backed footer decoder this PR exercises and deserves its own
follow-up.

Found by the cargo-fuzz harness being prototyped for pola-rs#27488.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-io-parquet Area: reading/writing Parquet files first-contribution First contribution by user fix Bug fix rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant