Skip to content

Handle sign-extension while decoding Parquet decimal stats#22402

Open
pramodsatya wants to merge 13 commits intorapidsai:mainfrom
pramodsatya:fix/parquet-variable-width-decimal-stats
Open

Handle sign-extension while decoding Parquet decimal stats#22402
pramodsatya wants to merge 13 commits intorapidsai:mainfrom
pramodsatya:fix/parquet-variable-width-decimal-stats

Conversation

@pramodsatya
Copy link
Copy Markdown

@pramodsatya pramodsatya commented May 6, 2026

Description

Parquet decimal statistics for physical BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY
columns are signed two's-complement unscaled integers stored in big-endian byte
order. cuDF's Parquet stats filter only handled the full-width decimal128
case correctly, and it did not widen shorter stats payloads into the selected
cuDF decimal storage width with sign extension.

That can make predicate pushdown compare incorrect row-group or page min/max
values for decimal columns. The affected metadata is:

  • row-group ColumnMetaData.statistics.min_value / max_value
  • row-group deprecated ColumnMetaData.statistics.min / max
  • page-index ColumnIndex.min_values / max_values

This change replaces the decimal128-specific byte-array stats decoder with a
templated decoder used for integral decimal storage representations. The new
decoder validates the stats payload size, accumulates bytes in Parquet's
big-endian order, and sign-extends negative values when the Parquet stats
payload is narrower than the cuDF storage type. Both the row-group and
page-index stats paths share this conversion helper, so the fix applies to both
levels of Parquet predicate pushdown.

Added ParquetReaderTest.DecimalStatsFilterVariableWidthByteArrayStats to
cover positive, negative, short-width, full-width, BYTE_ARRAY,
FIXED_LEN_BYTE_ARRAY, decimal32, decimal64, and decimal128
stats-decoding cases.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@pramodsatya pramodsatya requested a review from a team as a code owner May 6, 2026 22:56
@pramodsatya pramodsatya requested review from mhaseeb123 and wence- May 6, 2026 22:56
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the libcudf Affects libcudf (C++/CUDA) code. label May 6, 2026
@mhaseeb123 mhaseeb123 added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels May 7, 2026
@mhaseeb123 mhaseeb123 changed the title fix(parquet): Decode variable-width decimal stats Handle sign-extension when decoding variable-width decimal stats May 7, 2026
@mhaseeb123 mhaseeb123 changed the title Handle sign-extension when decoding variable-width decimal stats Handle sign-extensions while decoding Parquet decimal stats May 7, 2026
@mhaseeb123 mhaseeb123 changed the title Handle sign-extensions while decoding Parquet decimal stats Handle sign-extension while decoding Parquet decimal stats May 7, 2026
@mhaseeb123 mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels May 7, 2026
@mhaseeb123 mhaseeb123 moved this to Burndown in libcudf May 7, 2026
Remove incorrect `typename` keyword before `std::conditional_t` in
stats_filter_helpers.hpp. The `_t` suffix indicates the type alias already
produces a type, so `typename` is not needed or appropriate here.

This fixes the clang-tidy modernize-type-traits error that was treating
the warning as an error in the cpp-linters build.

Co-authored-by: Cursor <[email protected]>
@mhaseeb123
Copy link
Copy Markdown
Member

/ok to test 34cb61e

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of decimal value decoding when reading Parquet files with byte array and fixed-length byte array types, including proper sign-extension for signed decimal values.
  • Tests

    • Added test coverage for Parquet statistics decoding of variable-width decimal values across multiple numeric types.

Walkthrough

This PR refactors Parquet statistics decoding to support variable-width decimal types. A new decode_byte_array_decimal helper decodes signed big-endian byte sequences from BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY statistics with sign-extension. The stats casting path now dispatches chronos via representation-type decoding and decimals via the new helper. Test coverage validates decoding across multiple target types including negative values and error cases.

Changes

Parquet Statistics Decimal Decoding

Layer / File(s) Summary
Standard Library Includes
cpp/src/io/parquet/stats_filter_helpers.hpp, cpp/tests/io/parquet_reader_test.cpp
Added <bit>, <numeric>, <span>, <string_view>, <type_traits> headers for byte manipulation and type introspection in implementation; included stats filter helpers header in test file.
Byte Array Decimal Decoder
cpp/src/io/parquet/stats_filter_helpers.hpp
New decode_byte_array_decimal static template helper in stats_caster_base decodes signed two's-complement decimal payloads from big-endian byte sequences, with sign detection and sign-extension support.
Fixed-Width Decode Adjustment
cpp/src/io/parquet/stats_filter_helpers.hpp
decode_fixed_width_value now copies up to the smaller of the stats size and the destination type size instead of requiring exact size equality.
Stats Casting Integration
cpp/src/io/parquet/stats_filter_helpers.hpp
Refactored convert method's BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY branches: chrono route through rep-type fixed-width decoding; decimals route through decode_byte_array_decimal; retained TODO for UUID support.
Test Coverage
cpp/tests/io/parquet_reader_test.cpp
Added ParquetStatsDecoder test helper with templated decode method; implemented DecodeVariableWidthDecimalStats test validating decoding for int32, int64, and decimal128 types, including negative values and exception cases for invalid input size.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: handling sign-extension in Parquet decimal stats decoding, which is the core purpose of the PR.
Description check ✅ Passed The description thoroughly explains the bug, the root cause, the solution, affected metadata, and test coverage, directly relating to all changes in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mhaseeb123
Copy link
Copy Markdown
Member

/ok to test 1822c22

@mhaseeb123
Copy link
Copy Markdown
Member

@coderabbitai help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai rate limit to show your current review rate limit status.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
  • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
  • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve merge conflict to automatically resolve merge conflicts.
  • @coderabbitai autofix to automatically fix issues identified in unresolved review comments.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@mhaseeb123
Copy link
Copy Markdown
Member

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

✅ Actions performed

Full review triggered.

@rapidsai rapidsai deleted a comment from coderabbitai Bot May 7, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cpp/src/io/parquet/stats_filter_helpers.hpp (2)

144-167: 💤 Low value

Dead [[fallthrough]] after exhaustive returning branches.

Both if constexpr arms of the BYTE_ARRAY/FIXED_LEN_BYTE_ARRAY case return unconditionally (chrono → return; else → both fixed_point and non-fixed_point sub-branches return). The trailing [[fallthrough]] at Line 163 is therefore unreachable today. It only becomes meaningful once the UUID TODO is implemented as a non-returning branch. Consider removing it now and re-introducing it together with the UUID handling, or convert it into an explicit comment so static analyzers don't flag unreachable code.

Suggested cleanup
       case Type::BYTE_ARRAY: [[fallthrough]];
       case Type::FIXED_LEN_BYTE_ARRAY:
-        // Handle chronos, decimals and UUIDs here
+        // Handle chronos and decimals here; UUID support is pending (see TODO below)
         if constexpr (cudf::is_chrono<T>()) {
           ...
         } else {
           ...
         }
         // TODO(mh): add support for `UUID` (big-endian but no sign extension) here
-        [[fallthrough]];
       default:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/io/parquet/stats_filter_helpers.hpp` around lines 144 - 167, The
trailing [[fallthrough]] after the Type::BYTE_ARRAY / Type::FIXED_LEN_BYTE_ARRAY
handling is dead because every branch inside (cudf::is_chrono<T>(),
cudf::is_fixed_point<T>(), and the non-fixed_point branch) returns; remove the
unreachable [[fallthrough]] (or replace it with an explanatory comment) from the
switch in stats_filter_helpers.hpp where the case for Type::BYTE_ARRAY and
Type::FIXED_LEN_BYTE_ARRAY is implemented so static analyzers won't flag
unreachable code; re-introduce a fallthrough only when implementing the UUID
path as a non-returning branch.

45-71: 💤 Low value

Constraint admits unsigned T while sign-extension assumes signed semantics.

The requires-clause permits any non-boolean integral, including unsigned types. If this helper is ever invoked with an unsigned T and a payload whose first byte has its high bit set (e.g., a 1-byte payload 0x80 decoded into uint16_t), the function will OR in ~UnsignedT{0} << (stats_size * CHAR_BIT) and produce a value that does not represent the original unsigned magnitude. Today this isn't exercised because Parquet decimal payloads are always signed two's-complement and current callers route through T::rep (signed) or signed Ts in tests. As a defensive measure, consider tightening the constraint to cudf::is_signed<T>() (or asserting it) so future misuse fails at compile time rather than silently producing surprising values.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/io/parquet/stats_filter_helpers.hpp` around lines 45 - 71, The
template decode_byte_array_decimal currently allows any integral T (via
requires(cudf::is_integral<T>() and !cudf::is_boolean<T>())) but the
sign-extension logic assumes T is signed; tighten the constraint to require a
signed type (e.g., replace the requires-clause with cudf::is_signed<T>() or add
a static_assert(cudf::is_signed<T>::value) at the top of
decode_byte_array_decimal) so misuse with unsigned T is a compile-time error,
keeping the existing UnsignedT, is_negative_value and sign-extension logic
intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/src/io/parquet/stats_filter_helpers.hpp`:
- Around line 144-167: The trailing [[fallthrough]] after the Type::BYTE_ARRAY /
Type::FIXED_LEN_BYTE_ARRAY handling is dead because every branch inside
(cudf::is_chrono<T>(), cudf::is_fixed_point<T>(), and the non-fixed_point
branch) returns; remove the unreachable [[fallthrough]] (or replace it with an
explanatory comment) from the switch in stats_filter_helpers.hpp where the case
for Type::BYTE_ARRAY and Type::FIXED_LEN_BYTE_ARRAY is implemented so static
analyzers won't flag unreachable code; re-introduce a fallthrough only when
implementing the UUID path as a non-returning branch.
- Around line 45-71: The template decode_byte_array_decimal currently allows any
integral T (via requires(cudf::is_integral<T>() and !cudf::is_boolean<T>())) but
the sign-extension logic assumes T is signed; tighten the constraint to require
a signed type (e.g., replace the requires-clause with cudf::is_signed<T>() or
add a static_assert(cudf::is_signed<T>::value) at the top of
decode_byte_array_decimal) so misuse with unsigned T is a compile-time error,
keeping the existing UnsignedT, is_negative_value and sign-extension logic
intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e8e22625-e02b-4789-bcf0-d1aa8aec7bd5

📥 Commits

Reviewing files that changed from the base of the PR and between 65df106 and 1822c22.

📒 Files selected for processing (2)
  • cpp/src/io/parquet/stats_filter_helpers.hpp
  • cpp/tests/io/parquet_reader_test.cpp

@vuule vuule self-requested a review May 7, 2026 19:00
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/io/parquet/stats_filter_helpers.hpp`:
- Around line 45-47: Add explicit static_asserts inside the template function
decode_byte_array_decimal<T> to enforce the same constraints as the
requires-clause: assert that T is an integral type and not bool (e.g., using
cudf::is_integral<T>() && !cudf::is_boolean<T>()), and assert that T is a signed
type (e.g., std::is_signed_v<T>), with clear error messages indicating the
requirement for an integral non-boolean signed type because byte-array decimal
stats require signed two's-complement storage; place these static_asserts at the
top of the decode_byte_array_decimal function so misuse fails at compile time
with the provided messages.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5871785e-468a-4ff1-93c4-88a5251ea53f

📥 Commits

Reviewing files that changed from the base of the PR and between 1822c22 and 45a2a51.

📒 Files selected for processing (1)
  • cpp/src/io/parquet/stats_filter_helpers.hpp

Comment on lines +45 to +47
template <typename T>
static inline T decode_byte_array_decimal(uint8_t const* stats_val, size_t stats_size)
requires(cudf::is_integral<T>() and !cudf::is_boolean<T>())
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether template helper currently lacks static_assert guards.
rg -n -C2 'decode_byte_array_decimal|static_assert' cpp/src/io/parquet/stats_filter_helpers.hpp

Repository: rapidsai/cudf

Length of output: 672


Add explicit static_assert to enforce type constraints per coding guidelines.

The template decode_byte_array_decimal lacks explicit static_assert statements. While the requires clause constrains instantiation, the coding guidelines for C++ templates require clear static_assert messages to prevent misuse. Add assertions for:

  • Integral non-boolean type requirement
  • Signed type requirement (two's-complement decimal stats require signed storage)
Suggested patch
 template <typename T>
 static inline T decode_byte_array_decimal(uint8_t const* stats_val, size_t stats_size)
   requires(cudf::is_integral<T>() and !cudf::is_boolean<T>())
 {
+  static_assert(std::is_integral_v<T> && !std::is_same_v<T, bool>,
+                "decode_byte_array_decimal requires a non-boolean integral type");
+  static_assert(std::is_signed_v<T>,
+                "decode_byte_array_decimal expects a signed storage type for two's-complement decimal stats");
   CUDF_EXPECTS(stats_size > 0, "Parquet reader encountered an empty decimal statistics vector");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/io/parquet/stats_filter_helpers.hpp` around lines 45 - 47, Add
explicit static_asserts inside the template function
decode_byte_array_decimal<T> to enforce the same constraints as the
requires-clause: assert that T is an integral type and not bool (e.g., using
cudf::is_integral<T>() && !cudf::is_boolean<T>()), and assert that T is a signed
type (e.g., std::is_signed_v<T>), with clear error messages indicating the
requirement for an integral non-boolean signed type because byte-array decimal
stats require signed two's-complement storage; place these static_asserts at the
top of the decode_byte_array_decimal function so misuse fails at compile time
with the provided messages.

Copy link
Copy Markdown
Member

@mhaseeb123 mhaseeb123 May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai We check for sign before running the two's complement so this is safe as is, isn't it?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhaseeb123, you're correct that the function itself is free of undefined behavior — accumulation is performed in UnsignedT to avoid signed overflow, the sign check examines the raw bytes, and std::bit_cast<T> reinterprets bits cleanly. The first static_assert I suggested (integral non-boolean) is indeed redundant with the requires clause, so that part can be dropped.

The std::is_signed_v<T> assertion is the only one with a remaining argument: the requires clause technically permits unsigned integral types (e.g., uint32_t). If such a T were ever passed, the function wouldn't crash or invoke UB, but a negative Parquet decimal value would be decoded as a large positive unsigned number — semantically wrong for predicate pushdown. In practice all call sites supply signed rep types (int32_t, int64_t, __int128_t), so the risk is theoretical.

Given that, I'll withdraw the suggestion — the current requires clause is sufficient for the actual usage, and adding a static_assert for signedness would be purely defensive. Your reasoning holds.

@mhaseeb123
Copy link
Copy Markdown
Member

/ok to test ad22dea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

Status: Burndown

Development

Successfully merging this pull request may close these issues.

2 participants