[FEA] Add a custom decoder to convert Parquet dictionary pages into `cuco::static_set` objects

**Is your feature request related to a problem? Please describe.**
We would like to evaluate filter predicates based on the keys present in each Parquet column chunk's dictionary page. To do this, we need an API that decodes the dictionary page for each (filter) column chunk in each row group and materialize a corresponding `cuco::static_set_ref`.

See story issue #17896 for more context about row group filtering with dictionary pages. Also see `dictionary_page_filter.cu` and `hybrid_scan_helpers.hpp` in draft PR #18011 for some of the related infrastructure for applying predicates to dictionary key data.

**Describe the solution you'd like**
We would like an implementation for the `aggregate_reader_metadata::materialize_dictionaries` API. This API is declared in `hybrid_scan_helpers.hpp` and has a placeholder implementation is in `dictionary_page_filter.cu` in draft PR #18011. The placeholder implementation can be replaced with the actual implementation. Here is the API signature but please feel free to modify as needed:

```cpp
   /**
   * @brief Materializes column chunk dictionary pages into `cuco::static_set`s
   *
   * @param dictionary_page_data Raw dictionary page data device buffers for each input row group
   * @param input_row_group_indices Lists of input row group indices, one per source
   * @param total_row_groups Total number of row groups in `input_row_group_indices`
   * @param output_dtypes Datatypes of output (aka filter) columns
   * @param dictionary_col_schemas schema indices of filter columns
   * @param stream CUDA stream used for device memory operations and kernel launches
   *
   * @return A flattened list of `cuco::static_set_ref` device buffers for each predicate column
   * across row groups
   */
   [[nodiscard]] std::vector<rmm::device_buffer> materialize_dictionaries(
    cudf::host_span<rmm::device_buffer> dictionary_page_data,
    host_span<std::vector<size_type> const> input_row_group_indices,
    host_span<data_type const> output_dtypes,
    host_span<int const> dictionary_col_schemas,
    rmm::cuda_stream_view stream) const;
```

 The API has these inputs and outputs:

Inputs
* span of device buffers (indexed by each input column and each row group in file) of decompressed raw Parquet dictionary page bytes. The API must deduce page headers from these raw bytes.
* schema indices and data types of filter columns (under question)

Output
* Span of `cuco::static_set_ref` bytes for later probing according to the filter predicate

**Additional context**
The `materialize_dictionaries` API is called from `filter_row_groups_with_dictionary_pages` API in `hybrid_scan_helpers.cpp` in the draft PR #18011 

The output dictionaries will be used by `apply_dictionary_filter` API (will be implemented soon by @mhaseeb123) to actually query the dictionaries and prune row groups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add a custom decoder to convert Parquet dictionary pages into `cuco::static_set` objects #18046

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Add a custom decoder to convert Parquet dictionary pages into cuco::static_set objects #18046

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FEA] Add a custom decoder to convert Parquet dictionary pages into `cuco::static_set` objects #18046