Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The idea of executing directly on "compressed" (encoded) data is a well known technique in the academic columnar store literature. The Parquet format currently supports several different encodings such as:
- Plain
- Dictionary (and hybrid)
- RLE/Bitpacked
- DeltaBinaryPacked
- DeltaByteArray
- ByteStreamSplit
The current Rust Parquet reader supports evaluating predicates during the scan via the ArrowReaderBuilder::with_row_filter. However, the current RowFilter API only permits evaluation on Arrow arrays, aka after decompression and decoding.
It would be interesting to consider supporting evaluating predicates directly on parquet encoded data, to avoid decode costs when unnecessary -- for example when doing "needle in the haystack" type queries that filter out almost all rows
As the community considers [adding more encodings], such as ALP and FSST that are even more amenable to encoded operation, the benefit of such an API will increase.
Describe the solution you'd like
I would like some way to evaluate predicates directly on the encoded Parquet data
Describe alternatives you've considered
One potential thing we could do is extend the existing RowFilter API so different implementations could be provided for different encodings
// add predicates evaluated on (decoded) Arrow arrays
let filter = RowFilter::new(arrow_predicates)
// add specializations that can evaluate directly on RLE encoded data
// if the data is not RLE encoded, fall back to the arrow ones
.with_rle_predicates(...)
We would have to do some more experiments to know what the API for RLE (and similar) predicates should be
Additional context
The Vortex file implementation seems to be heading towards adding their own Expression implementation for common expressions and then providing the evaluation implementation too
For example, see how VortexExpr::evaluate looks (it has all the knowledge of the types, etc)
https://docs.rs/vortex-expr/0.54.0/vortex_expr/trait.VortexExpr.html#method.evaluate
This might be necessary if Parquet ends up with generic cascaded encodings
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The idea of executing directly on "compressed" (encoded) data is a well known technique in the academic columnar store literature. The Parquet format currently supports several different encodings such as:
The current Rust Parquet reader supports evaluating predicates during the scan via the ArrowReaderBuilder::with_row_filter. However, the current RowFilter API only permits evaluation on Arrow arrays, aka after decompression and decoding.
It would be interesting to consider supporting evaluating predicates directly on parquet encoded data, to avoid decode costs when unnecessary -- for example when doing "needle in the haystack" type queries that filter out almost all rows
As the community considers [adding more encodings], such as ALP and FSST that are even more amenable to encoded operation, the benefit of such an API will increase.
Describe the solution you'd like
I would like some way to evaluate predicates directly on the encoded Parquet data
Describe alternatives you've considered
One potential thing we could do is extend the existing RowFilter API so different implementations could be provided for different encodings
We would have to do some more experiments to know what the API for RLE (and similar) predicates should be
Additional context
The Vortex file implementation seems to be heading towards adding their own Expression implementation for common expressions and then providing the evaluation implementation too
For example, see how VortexExpr::evaluate looks (it has all the knowledge of the types, etc)
https://docs.rs/vortex-expr/0.54.0/vortex_expr/trait.VortexExpr.html#method.evaluate
This might be necessary if Parquet ends up with generic cascaded encodings