-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem? Please describe.
We would like to evaluate filter predicates based on the keys present in each Parquet column chunk's dictionary page. To do this, we need an API that decodes the dictionary page for each (filter) column chunk in each row group and materialize a corresponding cuco::static_set_ref.
See story issue #17896 for more context about row group filtering with dictionary pages. Also see dictionary_page_filter.cu and hybrid_scan_helpers.hpp in draft PR #18011 for some of the related infrastructure for applying predicates to dictionary key data.
Describe the solution you'd like
We would like an implementation for the aggregate_reader_metadata::materialize_dictionaries API. This API is declared in hybrid_scan_helpers.hpp and has a placeholder implementation is in dictionary_page_filter.cu in draft PR #18011. The placeholder implementation can be replaced with the actual implementation. Here is the API signature but please feel free to modify as needed:
/**
* @brief Materializes column chunk dictionary pages into `cuco::static_set`s
*
* @param dictionary_page_data Raw dictionary page data device buffers for each input row group
* @param input_row_group_indices Lists of input row group indices, one per source
* @param total_row_groups Total number of row groups in `input_row_group_indices`
* @param output_dtypes Datatypes of output (aka filter) columns
* @param dictionary_col_schemas schema indices of filter columns
* @param stream CUDA stream used for device memory operations and kernel launches
*
* @return A flattened list of `cuco::static_set_ref` device buffers for each predicate column
* across row groups
*/
[[nodiscard]] std::vector<rmm::device_buffer> materialize_dictionaries(
cudf::host_span<rmm::device_buffer> dictionary_page_data,
host_span<std::vector<size_type> const> input_row_group_indices,
host_span<data_type const> output_dtypes,
host_span<int const> dictionary_col_schemas,
rmm::cuda_stream_view stream) const;The API has these inputs and outputs:
Inputs
- span of device buffers (indexed by each input column and each row group in file) of decompressed raw Parquet dictionary page bytes. The API must deduce page headers from these raw bytes.
- schema indices and data types of filter columns (under question)
Output
- Span of
cuco::static_set_refbytes for later probing according to the filter predicate
Additional context
The materialize_dictionaries API is called from filter_row_groups_with_dictionary_pages API in hybrid_scan_helpers.cpp in the draft PR #18011
The output dictionaries will be used by apply_dictionary_filter API (will be implemented soon by @mhaseeb123) to actually query the dictionaries and prune row groups.