### Describe the enhancement requested Currently, for our Parquet fuzzing seed corpus, we generate a grand total of 1 file here: https://github.com/apache/arrow/blob/fb202ee66d73572f46035c5b2f21ac22f74ba951/cpp/src/parquet/arrow/generate_fuzz_corpus.cc We should probably generate more files (and/or more batch columns) and/or enable more features: * vary data page version * vary compression codec * vary encodings (e.g. delta binary, byte stream split...) * enable page checksums (<s>and verify them on reading</s>: that's actually a bad idea as it would prevent exercising the actual decoding most of the time) * enable statistics (and load them on reading) * enable page indices * enable bloom filters once https://github.com/apache/arrow/pull/37400 is merged We should also add more datatypes, at least Boolean and FixedSizeBinary, possibly also Decimal128 and Decimal256. ### Component(s) C++, Continuous Integration, Parquet
Describe the enhancement requested
Currently, for our Parquet fuzzing seed corpus, we generate a grand total of 1 file here:
https://github.com/apache/arrow/blob/fb202ee66d73572f46035c5b2f21ac22f74ba951/cpp/src/parquet/arrow/generate_fuzz_corpus.cc
We should probably generate more files (and/or more batch columns) and/or enable more features:
and verify them on reading: that's actually a bad idea as it would prevent exercising the actual decoding most of the time)We should also add more datatypes, at least Boolean and FixedSizeBinary, possibly also Decimal128 and Decimal256.
Component(s)
C++, Continuous Integration, Parquet