Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions parquet/src/file/serialized_reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -387,8 +387,6 @@ pub(crate) fn decode_page(
can_decompress = header_v2.is_compressed.unwrap_or(true);
}

// TODO: page header could be huge because of statistics. We should set a
// maximum page header size and abort if that is exceeded.
let buffer = match decompressor {
Some(decompressor) if can_decompress => {
let uncompressed_page_size = usize::try_from(page_header.uncompressed_page_size)?;
Expand All @@ -398,6 +396,8 @@ pub(crate) fn decode_page(
let decompressed_size = uncompressed_page_size - offset;
let mut decompressed = Vec::with_capacity(uncompressed_page_size);
decompressed.extend_from_slice(&buffer[..offset]);
// decompressed size of zero corresponds to a page with only null values
// see https://github.com/apache/parquet-format/blob/master/README.md#data-pages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this is ok to me, but I don't know would some page being just be empty (with num-rows == 0), I know currently most writer would not write that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I did some code coverage testing, I found it was rare but not impossible (it happens 3 times in the parquet-rs suite: #8756 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would page without values or some word better here?

Copy link
Contributor

@etseidl etseidl Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would page without values or some word better here?

I think the current wording is correct. In this case num_values is non-zero, so there are values, but they all just happen to be null (and thus are not encoded in the data).

I'd think the ways decompressed_size can be zero are 1) v2 page with all null values, 2) v1 page with no nesting and all nulls at the top level (so D is always 0). Second case would have definition level data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current wording is correct. In this case num_values is non-zero, so there are values, but they all just happen to be null (and thus are not encoded in the data).

I mean, I don't know would a case that just num_rows ( not num_values ) == 0 would exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a page with num_rows == 0 doesn't seem like it should happen

However, my understanding of this comment is that it is trying to say that there are no non-null values

I don't think I fully understand your comment @mapleFU -- if you think what is written is imprecise, could you perhaps suggest some alternate wording?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, my understanding of this comment is that it is trying to say that there are no non-null values

Current comment is ok for me, I wonder would:

 // decompressed size of zero corresponds to a page with no non-null values

Be better, since this also cover the case that num_rows == 0 => num_values == 0 && num_nulls == 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it - that makes sense. thank you

if decompressed_size > 0 {
let compressed = &buffer[offset..];
decompressor.decompress(compressed, &mut decompressed, Some(decompressed_size))?;
Expand Down
Loading