Skip to content

[FEA] Add a custom decoder to convert Parquet dictionary pages into cuco::static_set objects #18046

@GregoryKimball

Description

@GregoryKimball

Is your feature request related to a problem? Please describe.
We would like to evaluate filter predicates based on the keys present in each Parquet column chunk's dictionary page. To do this, we need an API that decodes the dictionary page for each (filter) column chunk in each row group and materialize a corresponding cuco::static_set_ref.

See story issue #17896 for more context about row group filtering with dictionary pages. Also see dictionary_page_filter.cu and hybrid_scan_helpers.hpp in draft PR #18011 for some of the related infrastructure for applying predicates to dictionary key data.

Describe the solution you'd like
We would like an implementation for the aggregate_reader_metadata::materialize_dictionaries API. This API is declared in hybrid_scan_helpers.hpp and has a placeholder implementation is in dictionary_page_filter.cu in draft PR #18011. The placeholder implementation can be replaced with the actual implementation. Here is the API signature but please feel free to modify as needed:

   /**
   * @brief Materializes column chunk dictionary pages into `cuco::static_set`s
   *
   * @param dictionary_page_data Raw dictionary page data device buffers for each input row group
   * @param input_row_group_indices Lists of input row group indices, one per source
   * @param total_row_groups Total number of row groups in `input_row_group_indices`
   * @param output_dtypes Datatypes of output (aka filter) columns
   * @param dictionary_col_schemas schema indices of filter columns
   * @param stream CUDA stream used for device memory operations and kernel launches
   *
   * @return A flattened list of `cuco::static_set_ref` device buffers for each predicate column
   * across row groups
   */
   [[nodiscard]] std::vector<rmm::device_buffer> materialize_dictionaries(
    cudf::host_span<rmm::device_buffer> dictionary_page_data,
    host_span<std::vector<size_type> const> input_row_group_indices,
    host_span<data_type const> output_dtypes,
    host_span<int const> dictionary_col_schemas,
    rmm::cuda_stream_view stream) const;

The API has these inputs and outputs:

Inputs

  • span of device buffers (indexed by each input column and each row group in file) of decompressed raw Parquet dictionary page bytes. The API must deduce page headers from these raw bytes.
  • schema indices and data types of filter columns (under question)

Output

  • Span of cuco::static_set_ref bytes for later probing according to the filter predicate

Additional context
The materialize_dictionaries API is called from filter_row_groups_with_dictionary_pages API in hybrid_scan_helpers.cpp in the draft PR #18011

The output dictionaries will be used by apply_dictionary_filter API (will be implemented soon by @mhaseeb123) to actually query the dictionaries and prune row groups.

Metadata

Metadata

Assignees

Labels

cuIOcuIO issuefeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions