Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,65 @@ chunks they are interested in. The columns chunks should then be read sequentia

![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)

### PAR3 File Footers

PAR3 file footer footer format designed to better support wider-schemas and more control

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:

Suggested change
PAR3 file footer footer format designed to better support wider-schemas and more control
PAR3 file footer footer format designed to better support wider-schemas and more control
Suggested change
PAR3 file footer footer format designed to better support wider-schemas and more control
PAR3 file footer format designed to better support wider-schemas and more control

over the various footer size vs compute trade-offs. Its format is as follows:
- Data pages containing serialized Thrift metadata objects that were modeled as lists
in PAR1.These are stored contiguously with offsets stored in the FileMetadata. See
parquet.thrift for more details on each.
- Serialized Thrift FileMetadata Structure
- (Optional) 4 byte CRC32 of the serialized Thrift FileMetadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be optional.

All bytes are not equal in a file. In particular footer bytes are very important because if those are corrupt - we can't read any bytes of the file. If anything Footers not having a required checksum for their content is a design flaw of the original parquet specification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree, other had concerns that it was gratuitous. I can update to make it required.

- 4-byte length in bytes (little endian) of the serialized FileMetadata structure.
- 4-byte length in bytes (little endian) of all preceding elements in the footer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend a crc32 of the length itself. This is to detect early that a footer is corrupt and avoid reading some epic amount of garbage from the end of the file. For example think of a bit flip in one of the top bits of the length, it will cause a reader to read 100s of MBs of the end of the file only to check that the crc doesn't match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK adding this. As a counter pointer 100s of MB would ideally be rejected by reasonable memory limitations on the footer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's generally an expectation that filesystems do validation, but CRCs are relatively low cost, and help find problems in networking and NIC

- 1 byte flag field to indicate features that require special parsing of the footer.
Readers MUST raise an error if there is an unrecognized flag. Current flags:

* 0x01 - Footer encryption enabled (when set the encryption information is written before
FileMeta structure as in the PAR1 footer).
* 0x02 - CRC32 of FileMetadata Footer.

- 4-byte magic number "PAR3"

When parsing the footer implementations SHOULD read at least the last 10 bytes of the footer. Then
read in the entirety of the footer based on the length of all preceding elements. This prevents further
I/O cost for accessing metadata stored in the data pages. PAR3 footers can fully replace PAR1 footers.
If a file is written with only PAR3 footer, implementation MUT write PAR3 as the first four bytes in
they file. PAR3 footers can also be written in a backwards compatible way after PAR1 Metadata
(see next section for details).

#### Dual Mode PAR1 and PAR3 footers

There is a desire to gradually rollout PAR3 footers to allow newer readers to take advantage of them, while
older readers can still properly parse the file. This section outlines a strategy to do this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my 2 cents, but I'd rather see a sketch of the actual layout, rather than a description of a strategy :-)
(especially as the strategy is best left to each implementation, depending on their specific constraints and tradeoffs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add this if we decide to move forward with this PR (as compared to the your proposal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to just describe the layout.


As backgroud, Thrift structs are always serialized with a 0 trailing byte do delimit there ending.
Therefore for PAR1 written before PAR3 was introduced are always expect the files to have the following
trailing 9 bytes [0x00, x, x, x, x, P, A, R, 1] (where x can be any value). We also expect all compliant
Thrift parsers to only parse the first available FileMetadata message and stop consuming the stream once read.
Today, we don't believe that any Parquet readers validate that the entire "length in bytes of file metadata"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's going be a need for some regression tests here, maybe module to validate output from parquet-java against parquet readers from: older parquet-java release(s), parquet-cpp, impala parser, parquet dotnet. Because these are the older/frozen releases, a docker image with everything installed should work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be sufficient to generate dummy PAR1 file with extraneous padding at the end of the FileMetadata serialization, and see if all implementations can still read it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way to avoid this. I am preparing a proposal for it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of doing this just by modifying parquet-java to add some trailing bytes and see what broke in the rest of the code, that is: make no changes to the reader

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steveloughran That was my suggestion as well :-)

Copy link
Contributor Author

@emkornfield emkornfield May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alkis if this is the unknown thrift field approach, I originally started drafting it up but figured I'd hold off due to @pitrou strong feelings but I think it is likely worth seeing it on paper which might help weight the pros/cons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revise this and move the design considerations elsewhere.

@alkis if this was the approached you sketched on https://github.com/apache/parquet-format/pull/242/files#r1607838732 since I'm revising this anyways, i can incorporate it into the PR so we can consolidate discussion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a TODO in the readme to figure out desired approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed most of this phrasing in favor of a layout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am talking about an enhanced version of my original proposal in #242.

https://github.com/apache/parquet-format/pull/254/files?short_path=b335630#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5

I feel this is very powerful. The only complexity is only on writers to incorporate the short snippet of code that appends binary to an existing thrift struct. Readers can look at the tail of any thrift serialized struct and find the extension they expect without parsing thrift.

This means that we can use the approach as a generalized extension mechanism for any thrift struct in parquet.thrift. Once a vendor has a solid extension (or part of the extension) they want to make official it can enter the official struct proper and be removed from the extension.

is consumed. Therefore, to allow both footers to exist simultaneously in the file the following algorithm is used:

1. Serialize and write the original (PAR1) FileMetadata thrift structure
2. Transform the original FileMetadata structure to conform to PAR3
* Move data elements if necessary
* Generate data pages for elements stored in metadata pages
* Clear the lists that were transferred to metadata pages
3. Write out metadata pages
4. Serialize and write the updated Thrift FileMetadata structure.
5. Write out remainder of PAR3 header (last bytes written are "PAR3").
6. Write out the total size in bytes of both the serialized (PAR1) data structure plus the
size of the PAR3 footer as the final 4-byte byte length.
7. Write PAR1

When these steps are followed readers wishing to use PAR3 footers SHOULD read the last 12 bytes of the file
and look for "PAR3" written out in step five at the beginning of the 12 bytes. As noted above, there should be
no ambiguity with files generated by Parquet reference implementations, as without PAR3 we expected [x, x, x, 0x00]
for PAR1 files. Any ambiguity can be completely eliminated if the CRC32 is written in PAR3 mode and verified by
readers.

When embedded into a PAR1 file no modification to the magic number at the beginning of the file is mandated.

## Metadata
There are three types of metadata: file metadata, column (chunk) metadata and page
header metadata. All thrift structures are serialized using the TCompactProtocol.
Expand Down
136 changes: 126 additions & 10 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -537,6 +537,39 @@ enum Encoding {
Support for INT32, INT64 and FIXED_LEN_BYTE_ARRAY added in 2.11.
*/
BYTE_STREAM_SPLIT = 9;

/** Encoding for variable length binary data that allows random access of values.
*
* This encoding designed for random access of BYTE_ARRAY values. It is mostly useful in cases
* for non-nullable BYTE_ARRAY columns where determining the exact offset of the value does not require
* parsing definition levels.
*
* The layout consists of the following elements elements:
* 1. byte_arrays - Byte Array values layed out contiguously. The BYTE_ARRAYs are immediately contiguous the cumulative
* offsets.
* 2. offsets: A contiguous set of signed N-byte little-endian unsigned integers
* representing the end byte offset (exclusive) of a BYTE_ARRAY value from
* the the beginning of the page. For simplicity of implementation the 0 index is
* always as zero.
* 3. The last byte indicates the number of bytes used for offsets (valid values are 1,2,3 and 4).
* Implementations SHOULD try to use the smallest byte value that meets the length requirements.
*
* Note the order of lengths is reversed from DELTA_BINARY_PACKED to allow for byte array values to
* potentially allow for incremental compression in the case of Data Page V2 or other future data pages
* where values are compressed separately from nesting information.
*
* The beginning offset of the offsets can be determined using the final offset element.
*
* An individual byte array element can be found at an index using the following pseudo-code
* (real implementations SHOULD do bounds checking):
*
* return byte_arrays[offsets[index] : offsets[index+1]]
*
*
* Example encoding of "f", "oo", "bar1" (square brackets delimit the components listed):
* [foobar1][0,1,3,7][1]
*/
RANDOM_ACCESS_BYTE_ARRAY = 10;
}

/**
Expand Down Expand Up @@ -779,8 +812,12 @@ struct ColumnMetaData {
* whether we can decode those pages. **/
2: required list<Encoding> encodings

/** Path in schema **/
3: required list<string> path_in_schema
/** Path in schema
* Example of deprecated a field for PAR3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously I'm a bit biased, but I find the approach of defining separate structs (FileMetadataV3, etc.) much cleaner than tediously documenting what is required in V1 and forbidden in V3, and vice-versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are trade-offs I replied on the mailing list with more details (hopefully we can centralize discussion there).

* PAR1 Footer: Required
* PAR3 Footer: Deprecated (don't populate)
*/
3: optional list<string> path_in_schema

/** Compression codec **/
4: required CompressionCodec codec
Expand All @@ -792,7 +829,11 @@ struct ColumnMetaData {
6: required i64 total_uncompressed_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make num_values, total_uncompressed_size, total_compressed_size i32s?

This doesn't matter much for Thrift, but if we are happy with such a change, it makes a difference for other encodings like flatbuffers.

In addition num_values can be optional and if left unset it can inherit RowGroup.num_rows. Most column chunks are dense and we can save repeating the same value over and over for every column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make num_values, total_uncompressed_size, total_compressed_size i32s?

No, we've had bugs in the past due to i32 overflow in various implementations and incompatibilities with Arrow's i32 offsets because the data stored is larger. I don't recall which of these fields had issues exactly but based on that it would indicate that there are in fact some users that overflow at least signed representations, so even unsigned int32 seems like a potential risk.

In addition num_values can be optional and if left unset it can inherit RowGroup.num_rows. Most column chunks are dense and we can save repeating the same value over and over for every column.

I agree this is a reasonable optimizations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think uncompressed_size at a minimum would have to remain i64. I've seen single row groups with string columns that exceed 2GB in size. I'd argue the same for total_compressed_size on uncompressed data.


/** total byte size of all compressed, and potentially encrypted, pages
* in this column chunk (including the headers) **/
* in this column chunk (including the headers)
*
* Fetching the range of min(dictionary_page_offset, data_page_offset) + total_compressed_size
* should fetch all data in the the given column chunk
*/
7: required i64 total_compressed_size

/** Optional key/value metadata **/
Expand All @@ -812,7 +853,7 @@ struct ColumnMetaData {

/** Set of all encodings used for pages in this column chunk.
* This information can be used to determine if all data pages are
* dictionary encoded for example **/
* dictionary encoded for example **/
13: optional list<PageEncodingStats> encoding_stats;

/** Byte offset from beginning of file to Bloom filter data. **/
Expand Down Expand Up @@ -881,15 +922,21 @@ struct ColumnChunk {
/** Crypto metadata of encrypted columns **/
8: optional ColumnCryptoMetaData crypto_metadata

/** Encrypted column metadata for this chunk **/
/** Encrypted column metadata for this chunk
*
* PAR3: Not set see column_metadata_page on FileMetadata struct
**/
9: optional binary encrypted_column_metadata
}

struct RowGroup {
/** Metadata for each column chunk in this row group.
* This list must have the same order as the SchemaElement list in FileMetaData.
*
* PAR1: Required
* PAR3: Not populated. Use columns_page on FileMetadata.
**/
1: required list<ColumnChunk> columns
1: optional list<ColumnChunk> columns

/** Total byte size of all the uncompressed column data in this row group **/
2: required i64 total_byte_size
Expand Down Expand Up @@ -1115,6 +1162,35 @@ union EncryptionAlgorithm {
2: AesGcmCtrV1 AES_GCM_CTR_V1
}

/**
* Description of location of a metadata page.
*
* A metadata page is a data page used to store metadata about
* the data stored in the file. This is a key feature of PAR3
* footers which allow for deferred decoding of metadata.
Comment on lines +1195 to +1197
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the page will have a PageHeader/DataPageHeader at the top? Will nulls or repetition be allowed, i.e. do we need definition and repetition level data? If not, then should we define a new page type instead so we don't have to encode unused level encoding types? Then we could also drop the language below about not writing statistics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a reasonable approach to define a new page type. I was leaving this open in case in the future we want nulls. Whether nulls are allowed and exact structure is dictated by its use on the field and also to minimize spec changes in this draft. The nice thing about this approach is it can work transparently if/when a new page type is added.

*
* For common use cases the current recommendation is to use a
* an encoding that supported random access (e.g. PLAIN for fixed types
* and RANDOM_ACCESS_BYTE_ARRAY for variable sized types). Implementations
* SHOULD consider allowing configurability per page to allow for end-users
* to optimize size vs compute trade-offs that make sense for their use-case.
*
* Statistics for Metadata pages SHOULD NOT be written.
*/
struct MetadataPageLocation {
// Offset from the beginning of the PAR3 footer to the header
// of the data page.
1: optional i32 footer_offset

// The length of the serialized page (header + data) in bytes. This
// is redundant with information in the header but allow
// for more robust checks before doing any Thrift parsing.
2: optional i32 full_page_size

// Optional compression applied to the page.
3: optional CompressionCodec compression
}

/**
* Description for file metadata
*/
Expand All @@ -1127,16 +1203,52 @@ struct FileMetaData {
* are flattened to a list by doing a depth-first traversal.
* The column metadata contains the path in the schema for that column which can be
* used to map columns to nodes in the schema.
* The first element is the root **/
2: required list<SchemaElement> schema;
* The first element is the root
*
* PAR1: Required
* PAR3: Use schema_metadata_page
*
* TODO: This might be too much (i.e. leave as a list for PAR3), but potentially useful for
* wide Schemas if a "schema index" is every added.
**/
2: optional list<SchemaElement> schema;

/** Required BYTE_ARRAY data where each element is REQUIRED.
*
* Each element is a serialized SchemaElement. The order and content should
* have a one to one correspondence with schema.
*
* If encryption is applied to the footer each element is encrypted individually.
*/
10: optional MetadataPageLocation schema_page;

/** Number of rows in this file **/
3: required i64 num_rows

/** Row groups in this file **/
/** Row groups in this file
*
* TODO: Decide if this should be moved to a metadata page.
**/
4: required list<RowGroup> row_groups

/** Optional key/value metadata **/
/** Required BYTE_ARRAY data where each element is REQUIRED.
*
* Each element is a serialized ColumnChunk. The number of
* elements is M * N, where M is the number row groups in the file
* and N is the number of columns storing data. An columns metadata
* object is stored at `m*N + column index` where m is the row-group
* index.
*
* If encryption applies to the footer each element in page is encrypted
* individually.
*
* PAR1: Don't include
* PAR3: Required **/
11: optional MetadataPageLocation columns_page

/** Optional key/value metadata
* TODO: Consider if this should be moved to use a data page as well
**/
5: optional list<KeyValue> key_value_metadata

/** String for application that wrote this file. This should be in the format
Expand All @@ -1160,6 +1272,10 @@ struct FileMetaData {
*
* The obsolete min and max fields in the Statistics object are always sorted
* by signed comparison regardless of column_orders.
*
* TODO: consider moving to a data page. While fast to decode, this potentially
* compresses/encodes extremely well since it is only a single value at the
* moment.
*/
7: optional list<ColumnOrder> column_orders;

Expand Down