Skip to content

[C++][Parquet][CI] Improve Parquet fuzzing seed corpus #43709

@pitrou

Description

@pitrou

Describe the enhancement requested

Currently, for our Parquet fuzzing seed corpus, we generate a grand total of 1 file here:
https://github.com/apache/arrow/blob/fb202ee66d73572f46035c5b2f21ac22f74ba951/cpp/src/parquet/arrow/generate_fuzz_corpus.cc

We should probably generate more files (and/or more batch columns) and/or enable more features:

  • vary data page version
  • vary compression codec
  • vary encodings (e.g. delta binary, byte stream split...)
  • enable page checksums (and verify them on reading: that's actually a bad idea as it would prevent exercising the actual decoding most of the time)
  • enable statistics (and load them on reading)
  • enable page indices
  • enable bloom filters once GH-34785: [C++][Parquet] Add bloom filter write support #37400 is merged

We should also add more datatypes, at least Boolean and FixedSizeBinary, possibly also Decimal128 and Decimal256.

Component(s)

C++, Continuous Integration, Parquet

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions