Handle sign-extension while decoding Parquet decimal stats#22402
Handle sign-extension while decoding Parquet decimal stats#22402pramodsatya wants to merge 13 commits intorapidsai:mainfrom
Conversation
…uet-variable-width-decimal-stats
Remove incorrect `typename` keyword before `std::conditional_t` in stats_filter_helpers.hpp. The `_t` suffix indicates the type alias already produces a type, so `typename` is not needed or appropriate here. This fixes the clang-tidy modernize-type-traits error that was treating the warning as an error in the cpp-linters build. Co-authored-by: Cursor <[email protected]>
|
/ok to test 34cb61e |
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR refactors Parquet statistics decoding to support variable-width decimal types. A new decode_byte_array_decimal helper decodes signed big-endian byte sequences from BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY statistics with sign-extension. The stats casting path now dispatches chronos via representation-type decoding and decimals via the new helper. Test coverage validates decoding across multiple target types including negative values and error cases. ChangesParquet Statistics Decimal Decoding
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test 1822c22 |
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
cpp/src/io/parquet/stats_filter_helpers.hpp (2)
144-167: 💤 Low valueDead
[[fallthrough]]after exhaustive returning branches.Both
if constexprarms of theBYTE_ARRAY/FIXED_LEN_BYTE_ARRAYcase return unconditionally (chrono → return; else → both fixed_point and non-fixed_point sub-branches return). The trailing[[fallthrough]]at Line 163 is therefore unreachable today. It only becomes meaningful once the UUID TODO is implemented as a non-returning branch. Consider removing it now and re-introducing it together with the UUID handling, or convert it into an explicit comment so static analyzers don't flag unreachable code.Suggested cleanup
case Type::BYTE_ARRAY: [[fallthrough]]; case Type::FIXED_LEN_BYTE_ARRAY: - // Handle chronos, decimals and UUIDs here + // Handle chronos and decimals here; UUID support is pending (see TODO below) if constexpr (cudf::is_chrono<T>()) { ... } else { ... } // TODO(mh): add support for `UUID` (big-endian but no sign extension) here - [[fallthrough]]; default:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/io/parquet/stats_filter_helpers.hpp` around lines 144 - 167, The trailing [[fallthrough]] after the Type::BYTE_ARRAY / Type::FIXED_LEN_BYTE_ARRAY handling is dead because every branch inside (cudf::is_chrono<T>(), cudf::is_fixed_point<T>(), and the non-fixed_point branch) returns; remove the unreachable [[fallthrough]] (or replace it with an explanatory comment) from the switch in stats_filter_helpers.hpp where the case for Type::BYTE_ARRAY and Type::FIXED_LEN_BYTE_ARRAY is implemented so static analyzers won't flag unreachable code; re-introduce a fallthrough only when implementing the UUID path as a non-returning branch.
45-71: 💤 Low valueConstraint admits unsigned
Twhile sign-extension assumes signed semantics.The requires-clause permits any non-boolean integral, including unsigned types. If this helper is ever invoked with an unsigned
Tand a payload whose first byte has its high bit set (e.g., a 1-byte payload0x80decoded intouint16_t), the function will OR in~UnsignedT{0} << (stats_size * CHAR_BIT)and produce a value that does not represent the original unsigned magnitude. Today this isn't exercised because Parquet decimal payloads are always signed two's-complement and current callers route throughT::rep(signed) or signedTs in tests. As a defensive measure, consider tightening the constraint tocudf::is_signed<T>()(or asserting it) so future misuse fails at compile time rather than silently producing surprising values.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/io/parquet/stats_filter_helpers.hpp` around lines 45 - 71, The template decode_byte_array_decimal currently allows any integral T (via requires(cudf::is_integral<T>() and !cudf::is_boolean<T>())) but the sign-extension logic assumes T is signed; tighten the constraint to require a signed type (e.g., replace the requires-clause with cudf::is_signed<T>() or add a static_assert(cudf::is_signed<T>::value) at the top of decode_byte_array_decimal) so misuse with unsigned T is a compile-time error, keeping the existing UnsignedT, is_negative_value and sign-extension logic intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@cpp/src/io/parquet/stats_filter_helpers.hpp`:
- Around line 144-167: The trailing [[fallthrough]] after the Type::BYTE_ARRAY /
Type::FIXED_LEN_BYTE_ARRAY handling is dead because every branch inside
(cudf::is_chrono<T>(), cudf::is_fixed_point<T>(), and the non-fixed_point
branch) returns; remove the unreachable [[fallthrough]] (or replace it with an
explanatory comment) from the switch in stats_filter_helpers.hpp where the case
for Type::BYTE_ARRAY and Type::FIXED_LEN_BYTE_ARRAY is implemented so static
analyzers won't flag unreachable code; re-introduce a fallthrough only when
implementing the UUID path as a non-returning branch.
- Around line 45-71: The template decode_byte_array_decimal currently allows any
integral T (via requires(cudf::is_integral<T>() and !cudf::is_boolean<T>())) but
the sign-extension logic assumes T is signed; tighten the constraint to require
a signed type (e.g., replace the requires-clause with cudf::is_signed<T>() or
add a static_assert(cudf::is_signed<T>::value) at the top of
decode_byte_array_decimal) so misuse with unsigned T is a compile-time error,
keeping the existing UnsignedT, is_negative_value and sign-extension logic
intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e8e22625-e02b-4789-bcf0-d1aa8aec7bd5
📒 Files selected for processing (2)
cpp/src/io/parquet/stats_filter_helpers.hppcpp/tests/io/parquet_reader_test.cpp
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/io/parquet/stats_filter_helpers.hpp`:
- Around line 45-47: Add explicit static_asserts inside the template function
decode_byte_array_decimal<T> to enforce the same constraints as the
requires-clause: assert that T is an integral type and not bool (e.g., using
cudf::is_integral<T>() && !cudf::is_boolean<T>()), and assert that T is a signed
type (e.g., std::is_signed_v<T>), with clear error messages indicating the
requirement for an integral non-boolean signed type because byte-array decimal
stats require signed two's-complement storage; place these static_asserts at the
top of the decode_byte_array_decimal function so misuse fails at compile time
with the provided messages.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5871785e-468a-4ff1-93c4-88a5251ea53f
📒 Files selected for processing (1)
cpp/src/io/parquet/stats_filter_helpers.hpp
| template <typename T> | ||
| static inline T decode_byte_array_decimal(uint8_t const* stats_val, size_t stats_size) | ||
| requires(cudf::is_integral<T>() and !cudf::is_boolean<T>()) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether template helper currently lacks static_assert guards.
rg -n -C2 'decode_byte_array_decimal|static_assert' cpp/src/io/parquet/stats_filter_helpers.hppRepository: rapidsai/cudf
Length of output: 672
Add explicit static_assert to enforce type constraints per coding guidelines.
The template decode_byte_array_decimal lacks explicit static_assert statements. While the requires clause constrains instantiation, the coding guidelines for C++ templates require clear static_assert messages to prevent misuse. Add assertions for:
- Integral non-boolean type requirement
- Signed type requirement (two's-complement decimal stats require signed storage)
Suggested patch
template <typename T>
static inline T decode_byte_array_decimal(uint8_t const* stats_val, size_t stats_size)
requires(cudf::is_integral<T>() and !cudf::is_boolean<T>())
{
+ static_assert(std::is_integral_v<T> && !std::is_same_v<T, bool>,
+ "decode_byte_array_decimal requires a non-boolean integral type");
+ static_assert(std::is_signed_v<T>,
+ "decode_byte_array_decimal expects a signed storage type for two's-complement decimal stats");
CUDF_EXPECTS(stats_size > 0, "Parquet reader encountered an empty decimal statistics vector");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/io/parquet/stats_filter_helpers.hpp` around lines 45 - 47, Add
explicit static_asserts inside the template function
decode_byte_array_decimal<T> to enforce the same constraints as the
requires-clause: assert that T is an integral type and not bool (e.g., using
cudf::is_integral<T>() && !cudf::is_boolean<T>()), and assert that T is a signed
type (e.g., std::is_signed_v<T>), with clear error messages indicating the
requirement for an integral non-boolean signed type because byte-array decimal
stats require signed two's-complement storage; place these static_asserts at the
top of the decode_byte_array_decimal function so misuse fails at compile time
with the provided messages.
There was a problem hiding this comment.
@coderabbitai We check for sign before running the two's complement so this is safe as is, isn't it?
There was a problem hiding this comment.
@mhaseeb123, you're correct that the function itself is free of undefined behavior — accumulation is performed in UnsignedT to avoid signed overflow, the sign check examines the raw bytes, and std::bit_cast<T> reinterprets bits cleanly. The first static_assert I suggested (integral non-boolean) is indeed redundant with the requires clause, so that part can be dropped.
The std::is_signed_v<T> assertion is the only one with a remaining argument: the requires clause technically permits unsigned integral types (e.g., uint32_t). If such a T were ever passed, the function wouldn't crash or invoke UB, but a negative Parquet decimal value would be decoded as a large positive unsigned number — semantically wrong for predicate pushdown. In practice all call sites supply signed rep types (int32_t, int64_t, __int128_t), so the risk is theoretical.
Given that, I'll withdraw the suggestion — the current requires clause is sufficient for the actual usage, and adding a static_assert for signedness would be purely defensive. Your reasoning holds.
|
/ok to test ad22dea |
Description
Parquet decimal statistics for physical
BYTE_ARRAYandFIXED_LEN_BYTE_ARRAYcolumns are signed two's-complement unscaled integers stored in big-endian byte
order. cuDF's Parquet stats filter only handled the full-width
decimal128case correctly, and it did not widen shorter stats payloads into the selected
cuDF decimal storage width with sign extension.
That can make predicate pushdown compare incorrect row-group or page min/max
values for decimal columns. The affected metadata is:
ColumnMetaData.statistics.min_value/max_valueColumnMetaData.statistics.min/maxColumnIndex.min_values/max_valuesThis change replaces the
decimal128-specific byte-array stats decoder with atemplated decoder used for integral decimal storage representations. The new
decoder validates the stats payload size, accumulates bytes in Parquet's
big-endian order, and sign-extends negative values when the Parquet stats
payload is narrower than the cuDF storage type. Both the row-group and
page-index stats paths share this conversion helper, so the fix applies to both
levels of Parquet predicate pushdown.
Added
ParquetReaderTest.DecimalStatsFilterVariableWidthByteArrayStatstocover positive, negative, short-width, full-width,
BYTE_ARRAY,FIXED_LEN_BYTE_ARRAY,decimal32,decimal64, anddecimal128stats-decoding cases.
Checklist