Skip to content

[FEA] Parquet writer to include Column Index feature #9268

@revans2

Description

@revans2

Is your feature request related to a problem? Please describe.
In Parquet 1.11 a new feature was added for column indexes/page indexes.

https://github.com/apache/parquet-format/blob/master/PageIndex.md

When I grep through the code I do not see support for this feature, but I could have missed it. Spark supports using this feature on the CPU to reduce the total amount of data read from disk, and it would be great to be able to write parquet files that support this too. This is so our customers who need to read data using the CPU too can read the data written by the GPU and fast as data written by the CPU.

Describe the solution you'd like
Insert in the ColumnIndex and OffsetIndex automatically for each parquet file we write.

Describe alternatives you've considered
We cannot do this without the help of cudf, so there really are no other alternatives.

Additional context
None

Metadata

Metadata

Assignees

No one assigned

    Labels

    SparkFunctionality that helps Spark RAPIDScuIOcuIO issuefeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions