Handle sign-extension while decoding Parquet decimal stats by pramodsatya · Pull Request #22402 · rapidsai/cudf

pramodsatya · 2026-05-06T22:56:27Z

Description

Parquet decimal statistics for physical BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY
columns are signed two's-complement unscaled integers stored in big-endian byte
order. cuDF's Parquet stats filter only handled the full-width decimal128
case correctly, and it did not widen shorter stats payloads into the selected
cuDF decimal storage width with sign extension.

That can make predicate pushdown compare incorrect row-group or page min/max
values for decimal columns. The affected metadata is:

row-group ColumnMetaData.statistics.min_value / max_value
row-group deprecated ColumnMetaData.statistics.min / max
page-index ColumnIndex.min_values / max_values

This change replaces the decimal128-specific byte-array stats decoder with a
templated decoder used for integral decimal storage representations. The new
decoder validates the stats payload size, accumulates bytes in Parquet's
big-endian order, and sign-extends negative values when the Parquet stats
payload is narrower than the cuDF storage type. Both the row-group and
page-index stats paths share this conversion helper, so the fix applies to both
levels of Parquet predicate pushdown.

Added ParquetReaderTest.DecimalStatsFilterVariableWidthByteArrayStats to
cover positive, negative, short-width, full-width, BYTE_ARRAY,
FIXED_LEN_BYTE_ARRAY, decimal32, decimal64, and decimal128
stats-decoding cases.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-05-06T22:56:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…uet-variable-width-decimal-stats

Remove incorrect `typename` keyword before `std::conditional_t` in stats_filter_helpers.hpp. The `_t` suffix indicates the type alias already produces a type, so `typename` is not needed or appropriate here. This fixes the clang-tidy modernize-type-traits error that was treating the warning as an error in the cpp-linters build. Co-authored-by: Cursor <[email protected]>

mhaseeb123 · 2026-05-07T17:36:36Z

/ok to test 34cb61e

coderabbitai · 2026-05-07T17:37:31Z

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Improved handling of decimal value decoding when reading Parquet files with byte array and fixed-length byte array types, including proper sign-extension for signed decimal values.
Tests
- Added test coverage for Parquet statistics decoding of variable-width decimal values across multiple numeric types.

Walkthrough

This PR refactors Parquet statistics decoding to support variable-width decimal types. A new decode_byte_array_decimal helper decodes signed big-endian byte sequences from BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY statistics with sign-extension. The stats casting path now dispatches chronos via representation-type decoding and decimals via the new helper. Test coverage validates decoding across multiple target types including negative values and error cases.

Changes

Parquet Statistics Decimal Decoding

Layer / File(s)	Summary
Standard Library Includes `cpp/src/io/parquet/stats_filter_helpers.hpp`, `cpp/tests/io/parquet_reader_test.cpp`	Added `<bit>`, `<numeric>`, `<span>`, `<string_view>`, `<type_traits>` headers for byte manipulation and type introspection in implementation; included stats filter helpers header in test file.
Byte Array Decimal Decoder `cpp/src/io/parquet/stats_filter_helpers.hpp`	New `decode_byte_array_decimal` static template helper in `stats_caster_base` decodes signed two's-complement decimal payloads from big-endian byte sequences, with sign detection and sign-extension support.
Fixed-Width Decode Adjustment `cpp/src/io/parquet/stats_filter_helpers.hpp`	`decode_fixed_width_value` now copies up to the smaller of the stats size and the destination type size instead of requiring exact size equality.
Stats Casting Integration `cpp/src/io/parquet/stats_filter_helpers.hpp`	Refactored `convert` method's BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY branches: chrono route through rep-type fixed-width decoding; decimals route through `decode_byte_array_decimal`; retained TODO for UUID support.
Test Coverage `cpp/tests/io/parquet_reader_test.cpp`	Added `ParquetStatsDecoder` test helper with templated `decode` method; implemented `DecodeVariableWidthDecimalStats` test validating decoding for int32, int64, and decimal128 types, including negative values and exception cases for invalid input size.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: handling sign-extension in Parquet decimal stats decoding, which is the core purpose of the PR.
Description check	✅ Passed	The description thoroughly explains the bug, the root cause, the solution, affected metadata, and test coverage, directly relating to all changes in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mhaseeb123 · 2026-05-07T17:38:11Z

/ok to test 1822c22

mhaseeb123 · 2026-05-07T17:39:09Z

@coderabbitai help

coderabbitai · 2026-05-07T17:39:52Z

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai rate limit to show your current review rate limit status.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
@coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
@coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve merge conflict to automatically resolve merge conflicts.
@coderabbitai autofix to automatically fix issues identified in unresolved review comments.
@coderabbitai resolve to resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

Visit our status page to check the current availability of CodeRabbit.
Create a ticket on our support page for assistance with any issues or questions.
Visit our documentation site for detailed information on how to use CodeRabbit.
Join our Discord community to connect with other users and get help from the community.
Follow us on X/Twitter for updates and announcements.

mhaseeb123 · 2026-05-07T17:40:22Z

@coderabbitai full review

coderabbitai · 2026-05-07T17:41:14Z

✅ Actions performed

Full review triggered.

coderabbitai

🧹 Nitpick comments (2)

cpp/src/io/parquet/stats_filter_helpers.hpp (2)
144-167: 💤 Low value

Dead [[fallthrough]] after exhaustive returning branches.

Both if constexpr arms of the BYTE_ARRAY/FIXED_LEN_BYTE_ARRAY case return unconditionally (chrono → return; else → both fixed_point and non-fixed_point sub-branches return). The trailing [[fallthrough]] at Line 163 is therefore unreachable today. It only becomes meaningful once the UUID TODO is implemented as a non-returning branch. Consider removing it now and re-introducing it together with the UUID handling, or convert it into an explicit comment so static analyzers don't flag unreachable code.
Suggested cleanup
       case Type::BYTE_ARRAY: [[fallthrough]];
       case Type::FIXED_LEN_BYTE_ARRAY:
-        // Handle chronos, decimals and UUIDs here
+        // Handle chronos and decimals here; UUID support is pending (see TODO below)
         if constexpr (cudf::is_chrono<T>()) {
           ...
         } else {
           ...
         }
         // TODO(mh): add support for `UUID` (big-endian but no sign extension) here
-        [[fallthrough]];
       default:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/io/parquet/stats_filter_helpers.hpp` around lines 144 - 167, The
trailing [[fallthrough]] after the Type::BYTE_ARRAY / Type::FIXED_LEN_BYTE_ARRAY
handling is dead because every branch inside (cudf::is_chrono<T>(),
cudf::is_fixed_point<T>(), and the non-fixed_point branch) returns; remove the
unreachable [[fallthrough]] (or replace it with an explanatory comment) from the
switch in stats_filter_helpers.hpp where the case for Type::BYTE_ARRAY and
Type::FIXED_LEN_BYTE_ARRAY is implemented so static analyzers won't flag
unreachable code; re-introduce a fallthrough only when implementing the UUID
path as a non-returning branch.
45-71: 💤 Low value

Constraint admits unsigned T while sign-extension assumes signed semantics.

The requires-clause permits any non-boolean integral, including unsigned types. If this helper is ever invoked with an unsigned T and a payload whose first byte has its high bit set (e.g., a 1-byte payload 0x80 decoded into uint16_t), the function will OR in ~UnsignedT{0} << (stats_size * CHAR_BIT) and produce a value that does not represent the original unsigned magnitude. Today this isn't exercised because Parquet decimal payloads are always signed two's-complement and current callers route through T::rep (signed) or signed Ts in tests. As a defensive measure, consider tightening the constraint to cudf::is_signed<T>() (or asserting it) so future misuse fails at compile time rather than silently producing surprising values.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/io/parquet/stats_filter_helpers.hpp` around lines 45 - 71, The
template decode_byte_array_decimal currently allows any integral T (via
requires(cudf::is_integral<T>() and !cudf::is_boolean<T>())) but the
sign-extension logic assumes T is signed; tighten the constraint to require a
signed type (e.g., replace the requires-clause with cudf::is_signed<T>() or add
a static_assert(cudf::is_signed<T>::value) at the top of
decode_byte_array_decimal) so misuse with unsigned T is a compile-time error,
keeping the existing UnsignedT, is_negative_value and sign-extension logic
intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/src/io/parquet/stats_filter_helpers.hpp`:
- Around line 144-167: The trailing [[fallthrough]] after the Type::BYTE_ARRAY /
Type::FIXED_LEN_BYTE_ARRAY handling is dead because every branch inside
(cudf::is_chrono<T>(), cudf::is_fixed_point<T>(), and the non-fixed_point
branch) returns; remove the unreachable [[fallthrough]] (or replace it with an
explanatory comment) from the switch in stats_filter_helpers.hpp where the case
for Type::BYTE_ARRAY and Type::FIXED_LEN_BYTE_ARRAY is implemented so static
analyzers won't flag unreachable code; re-introduce a fallthrough only when
implementing the UUID path as a non-returning branch.
- Around line 45-71: The template decode_byte_array_decimal currently allows any
integral T (via requires(cudf::is_integral<T>() and !cudf::is_boolean<T>())) but
the sign-extension logic assumes T is signed; tighten the constraint to require
a signed type (e.g., replace the requires-clause with cudf::is_signed<T>() or
add a static_assert(cudf::is_signed<T>::value) at the top of
decode_byte_array_decimal) so misuse with unsigned T is a compile-time error,
keeping the existing UnsignedT, is_negative_value and sign-extension logic
intact.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e8e22625-e02b-4789-bcf0-d1aa8aec7bd5

📥 Commits

Reviewing files that changed from the base of the PR and between 65df106 and 1822c22.

📒 Files selected for processing (2)

cpp/src/io/parquet/stats_filter_helpers.hpp
cpp/tests/io/parquet_reader_test.cpp

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/io/parquet/stats_filter_helpers.hpp`:
- Around line 45-47: Add explicit static_asserts inside the template function
decode_byte_array_decimal<T> to enforce the same constraints as the
requires-clause: assert that T is an integral type and not bool (e.g., using
cudf::is_integral<T>() && !cudf::is_boolean<T>()), and assert that T is a signed
type (e.g., std::is_signed_v<T>), with clear error messages indicating the
requirement for an integral non-boolean signed type because byte-array decimal
stats require signed two's-complement storage; place these static_asserts at the
top of the decode_byte_array_decimal function so misuse fails at compile time
with the provided messages.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5871785e-468a-4ff1-93c4-88a5251ea53f

📥 Commits

Reviewing files that changed from the base of the PR and between 1822c22 and 45a2a51.

📒 Files selected for processing (1)

cpp/src/io/parquet/stats_filter_helpers.hpp

coderabbitai · 2026-05-07T19:12:43Z

+  template <typename T>
+  static inline T decode_byte_array_decimal(uint8_t const* stats_val, size_t stats_size)
+    requires(cudf::is_integral<T>() and !cudf::is_boolean<T>())


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether template helper currently lacks static_assert guards. rg -n -C2 'decode_byte_array_decimal|static_assert' cpp/src/io/parquet/stats_filter_helpers.hpp

Repository: rapidsai/cudf

Length of output: 672

Add explicit static_assert to enforce type constraints per coding guidelines.

The template decode_byte_array_decimal lacks explicit static_assert statements. While the requires clause constrains instantiation, the coding guidelines for C++ templates require clear static_assert messages to prevent misuse. Add assertions for:

Integral non-boolean type requirement

Signed type requirement (two's-complement decimal stats require signed storage)

Suggested patch

template <typename T> static inline T decode_byte_array_decimal(uint8_t const* stats_val, size_t stats_size) requires(cudf::is_integral<T>() and !cudf::is_boolean<T>()) { + static_assert(std::is_integral_v<T> && !std::is_same_v<T, bool>, + "decode_byte_array_decimal requires a non-boolean integral type"); + static_assert(std::is_signed_v<T>, + "decode_byte_array_decimal expects a signed storage type for two's-complement decimal stats"); CUDF_EXPECTS(stats_size > 0, "Parquet reader encountered an empty decimal statistics vector");

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/io/parquet/stats_filter_helpers.hpp` around lines 45 - 47, Add explicit static_asserts inside the template function decode_byte_array_decimal<T> to enforce the same constraints as the requires-clause: assert that T is an integral type and not bool (e.g., using cudf::is_integral<T>() && !cudf::is_boolean<T>()), and assert that T is a signed type (e.g., std::is_signed_v<T>), with clear error messages indicating the requirement for an integral non-boolean signed type because byte-array decimal stats require signed two's-complement storage; place these static_asserts at the top of the decode_byte_array_decimal function so misuse fails at compile time with the provided messages.

@coderabbitai We check for sign before running the two's complement so this is safe as is, isn't it?

@mhaseeb123, you're correct that the function itself is free of undefined behavior — accumulation is performed in UnsignedT to avoid signed overflow, the sign check examines the raw bytes, and std::bit_cast<T> reinterprets bits cleanly. The first static_assert I suggested (integral non-boolean) is indeed redundant with the requires clause, so that part can be dropped.

The std::is_signed_v<T> assertion is the only one with a remaining argument: the requires clause technically permits unsigned integral types (e.g., uint32_t). If such a T were ever passed, the function wouldn't crash or invoke UB, but a negative Parquet decimal value would be decoded as a large positive unsigned number — semantically wrong for predicate pushdown. In practice all call sites supply signed rep types (int32_t, int64_t, __int128_t), so the risk is theoretical.

Given that, I'll withdraw the suggestion — the current requires clause is sufficient for the actual usage, and adding a static_assert for signedness would be purely defensive. Your reasoning holds.

mhaseeb123 · 2026-05-07T19:18:25Z

/ok to test ad22dea

fix(parquet): Decode variable-width decimal stats

b53ad1c

pramodsatya requested a review from a team as a code owner May 6, 2026 22:56

pramodsatya requested review from mhaseeb123 and wence- May 6, 2026 22:56

github-actions Bot assigned pramodsatya May 6, 2026

github-actions Bot added the libcudf Affects libcudf (C++/CUDA) code. label May 6, 2026

pramodsatya and others added 2 commits May 6, 2026 23:08

fix formatting

0c783e6

Merge branch 'main' into fix/parquet-variable-width-decimal-stats

1733646

github-actions Bot assigned mhaseeb123 May 7, 2026

mhaseeb123 added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels May 7, 2026

mhaseeb123 changed the title ~~fix(parquet): Decode variable-width decimal stats~~ Handle sign-extension when decoding variable-width decimal stats May 7, 2026

mhaseeb123 changed the title ~~Handle sign-extension when decoding variable-width decimal stats~~ Handle sign-extensions while decoding Parquet decimal stats May 7, 2026

mhaseeb123 changed the title ~~Handle sign-extensions while decoding Parquet decimal stats~~ Handle sign-extension while decoding Parquet decimal stats May 7, 2026

Minor improvements

67a958c

pramodsatya mentioned this pull request May 7, 2026

fix(cudf): Apply decimal Parquet filters post-read facebookincubator/velox#17432

Open

Minor improvements

6de5e2e

mhaseeb123 approved these changes May 7, 2026

View reviewed changes

mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels May 7, 2026

Merge branch 'main' of https://github.com/rapidsai/cudf into fix/parq…

1f66eb8

…uet-variable-width-decimal-stats

mhaseeb123 added this to libcudf May 7, 2026

mhaseeb123 moved this to Burndown in libcudf May 7, 2026

Update .gitignore

4053377

Update .gitignore

1822c22

rapidsai deleted a comment from coderabbitai Bot May 7, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

vuule self-requested a review May 7, 2026 19:00

mhaseeb123 added 3 commits May 7, 2026 19:05

Add nolint to the std::conditional_t

32a0f05

Apply coderabbit suggestions

83b2f26

Merge branch 'main' into fix/parquet-variable-width-decimal-stats

45a2a51

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into _babysit22402

ad22dea

Conversation

pramodsatya commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

mhaseeb123 commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

mhaseeb123 commented May 7, 2026

Uh oh!

mhaseeb123 commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026

Chat

CodeRabbit commands

Other keywords and placeholders

Status, support, documentation and community

Uh oh!

mhaseeb123 commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mhaseeb123 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pramodsatya commented May 6, 2026 •

edited

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading

coderabbitai Bot May 7, 2026 •

edited

Loading

mhaseeb123 May 7, 2026 •

edited

Loading