Skip to content

API for filtering / evaluation directly on encoded data #8842

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The idea of executing directly on "compressed" (encoded) data is a well known technique in the academic columnar store literature. The Parquet format currently supports several different encodings such as:

  • Plain
  • Dictionary (and hybrid)
  • RLE/Bitpacked
  • DeltaBinaryPacked
  • DeltaByteArray
  • ByteStreamSplit

The current Rust Parquet reader supports evaluating predicates during the scan via the ArrowReaderBuilder::with_row_filter. However, the current RowFilter API only permits evaluation on Arrow arrays, aka after decompression and decoding.

It would be interesting to consider supporting evaluating predicates directly on parquet encoded data, to avoid decode costs when unnecessary -- for example when doing "needle in the haystack" type queries that filter out almost all rows

As the community considers [adding more encodings], such as ALP and FSST that are even more amenable to encoded operation, the benefit of such an API will increase.

Describe the solution you'd like
I would like some way to evaluate predicates directly on the encoded Parquet data

Describe alternatives you've considered

One potential thing we could do is extend the existing RowFilter API so different implementations could be provided for different encodings

// add predicates evaluated on (decoded) Arrow arrays
let filter = RowFilter::new(arrow_predicates)
  // add specializations that can evaluate directly on RLE encoded data
  // if the data is not RLE encoded, fall back to the arrow ones
  .with_rle_predicates(...)

We would have to do some more experiments to know what the API for RLE (and similar) predicates should be

Additional context

The Vortex file implementation seems to be heading towards adding their own Expression implementation for common expressions and then providing the evaluation implementation too

For example, see how VortexExpr::evaluate looks (it has all the knowledge of the types, etc)
https://docs.rs/vortex-expr/0.54.0/vortex_expr/trait.VortexExpr.html#method.evaluate

This might be necessary if Parquet ends up with generic cascaded encodings

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions