Skip to content

Commit 5b564f3

Browse files
authored
PARQUET-2139: Deprecate ColumnChunk::file_offset field (#440)
This field is not consistently set or read by implementations.
1 parent 3857dc1 commit 5b564f3

2 files changed

Lines changed: 27 additions & 18 deletions

File tree

README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -89,38 +89,38 @@ more pages.
8989
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.
9090

9191
4-byte magic number "PAR1"
92-
<Column 1 Chunk 1 + Column Metadata>
93-
<Column 2 Chunk 1 + Column Metadata>
92+
<Column 1 Chunk 1>
93+
<Column 2 Chunk 1>
9494
...
95-
<Column N Chunk 1 + Column Metadata>
96-
<Column 1 Chunk 2 + Column Metadata>
97-
<Column 2 Chunk 2 + Column Metadata>
95+
<Column N Chunk 1>
96+
<Column 1 Chunk 2>
97+
<Column 2 Chunk 2>
9898
...
99-
<Column N Chunk 2 + Column Metadata>
99+
<Column N Chunk 2>
100100
...
101-
<Column 1 Chunk M + Column Metadata>
102-
<Column 2 Chunk M + Column Metadata>
101+
<Column 1 Chunk M>
102+
<Column 2 Chunk M>
103103
...
104-
<Column N Chunk M + Column Metadata>
104+
<Column N Chunk M>
105105
File Metadata
106106
4-byte length in bytes of file metadata (little endian)
107107
4-byte magic number "PAR1"
108108

109109
In the above example, there are N columns in this table, split into M row
110-
groups. The file metadata contains the locations of all the column metadata
110+
groups. The file metadata contains the locations of all the column chunk
111111
start locations. More details on what is contained in the metadata can be found
112112
in the Thrift definition.
113113

114-
Metadata is written after the data to allow for single pass writing.
114+
File Metadata is written after the data to allow for single pass writing.
115115

116116
Readers are expected to first read the file metadata to find all the column
117117
chunks they are interested in. The columns chunks should then be read sequentially.
118118

119119
![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
120120

121121
## Metadata
122-
There are three types of metadata: file metadata, column (chunk) metadata and page
123-
header metadata. All thrift structures are serialized using the TCompactProtocol.
122+
There are two types of metadata: file metadata and page header metadata. All thrift structures
123+
are serialized using the TCompactProtocol.
124124

125125
![Metadata diagram](https://github.com/apache/parquet-format/raw/master/doc/images/FileFormat.gif)
126126

src/main/thrift/parquet.thrift

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -867,12 +867,21 @@ struct ColumnChunk {
867867
**/
868868
1: optional string file_path
869869

870-
/** Byte offset in file_path to the ColumnMetaData **/
871-
2: required i64 file_offset
870+
/** Deprecated: Byte offset in file_path to the ColumnMetaData
871+
*
872+
* Past use of this field has been inconsistent, with some implementations
873+
* using it to point to the ColumnMetaData and some using it to point to
874+
* the first page in the column chunk. In many cases, the ColumnMetaData at this
875+
* location is wrong. This field is now deprecated and should not be used.
876+
* Writers should set this field to 0 if no ColumnMetaData has been written outside
877+
* the footer.
878+
*/
879+
2: required i64 file_offset = 0
872880

873-
/** Column metadata for this chunk. This is the same content as what is at
874-
* file_path/file_offset. Having it here has it replicated in the file
875-
* metadata.
881+
/** Column metadata for this chunk. Some writers may also replicate this at the
882+
* location pointed to by file_path/file_offset.
883+
* Note: while marked as optional, this field is in fact required by most major
884+
* Parquet implementations. As such, writers MUST populate this field.
876885
**/
877886
3: optional ColumnMetaData meta_data
878887

0 commit comments

Comments
 (0)