-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem? Please describe.
In Parquet 1.11 a new feature was added for column indexes/page indexes.
https://github.com/apache/parquet-format/blob/master/PageIndex.md
When I grep through the code I do not see support for this feature, but I could have missed it. Spark supports using this feature on the CPU to reduce the total amount of data read from disk, and it would be great to be able to write parquet files that support this too. This is so our customers who need to read data using the CPU too can read the data written by the GPU and fast as data written by the CPU.
Describe the solution you'd like
Insert in the ColumnIndex and OffsetIndex automatically for each parquet file we write.
Describe alternatives you've considered
We cannot do this without the help of cudf, so there really are no other alternatives.
Additional context
None