Skip to content

Commit e52693d

Browse files
author
Nong Li
committed
Added thrift struct file format definitions.
1 parent d3c396b commit e52693d

5 files changed

Lines changed: 305 additions & 0 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
generated/*

Encodings.txt

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
This file contains the specification of all supported encodings.
2+
3+
Plain:
4+
- Supported Types: all
5+
This is the plain encoding that must be supported for types. It is
6+
intended to be the simplest encoding. Values are encoded back to back.
7+
- For native types, this outputs the data as little endian. Floating
8+
point types are encoded in IEEE.
9+
- For the byte array type, it encodes the length as a 4 byte little
10+
endian, followed by the bytes.
11+
12+
GroupVarInt:
13+
- Supported Types: INT32, INT64
14+
32-bit ints are encoded in groups of 4 with 1 leading bytes to encode the
15+
byte length of the following 4 ints. 64-bit are encoded in groups of 5,
16+
with 2 leading bytes to encode the byte length of the 5 ints.
17+
18+
For 32-bit ints, the leading byte contains 2 bits per int. Each length
19+
encoding specifies the number of bytes minus 1 for that int. For example
20+
a byte value of 0b00101101, indicates that:
21+
the first int has 1 byte (0b00 + 1),
22+
the second int has 3 bytes (0b10 + 1),
23+
the third int has 4 bytes (0b11 + 1), and
24+
the 4th int has 2 bytes (0b01 + 1)
25+
26+
In this case, the entire row group would be: 1 + (1 + 3 + 4 + 2) = 11 bytes.
27+
The bytes that follow the leading byte is just the int data encoded in little
28+
endian. With this example:
29+
the first int starts at byte offset 1 with a max value of 0xFF,
30+
the second int starts at byte offset 2 with a max value of 0xFFFFFF,
31+
the third int starts at byte offset 5 with a max value of 0xFFFFFFFF, and
32+
the 4th int starts at byte ofset 9 with a max value of 0xFFFF.
33+
34+
For 64-bit ints, the lengths of the 5 ints is encoded as 3 bits. Combined,
35+
this uses 15 bits and fits in 2 bytes. The msb of the two bytes is unused.
36+
Like the 32-bit case, after the length bytes, the data bytes follow.
37+
38+
In the case where the data does not make a complete group, (e.g. 3 32-bit ints),
39+
a complete group should still be output with 0's filling in for the remainder.
40+
For example, if the input was (1,2,3,4,5): the resulting encoding should
41+
behave as if the input was (1,2,3,4,5,0,0,0) and the two groups should be
42+
encoded back to back.
43+

Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
thrift:
2+
thrift --gen cpp -o generated src/thrift/redfile.thrift
3+
thrift --gen java -o generated src/thrift/redfile.thrift

README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
redfile
2+
======
3+
4+
## Glossary
5+
- Block (hdfs block): This means a block in hdfs and the meaning is
6+
unchanged for describing this file format. The file format is
7+
designed to work well ontop of hdfs.
8+
9+
- File: A hdfs file that must include the metadata for the file.
10+
It does not need to actually contain the data.
11+
12+
- Row group: A logical horizontal partitioning of the data into rows.
13+
There is no physical structure that is guaranteed for a row group.
14+
A row group consists of a column chunk for each column in the dataset.
15+
16+
- Column chunk: A chunk of the data for a particular column. These live
17+
in a particular row group and is guaranteed to be contiguous in the file.
18+
19+
- Page: Column chunks are divided up into pages. A page is conceptually
20+
an indivisible unit (in terms of compression and encoding). There can
21+
be multiple page types which is interleaved in a column chunk.
22+
23+
Hierarchically, a file consists of one or more rows groups. A row group
24+
contains exactly one column chunk per column. Column chunks contain one or
25+
more pages.
26+
27+
## Unit of parallelization
28+
- MapReduce - File/Row Group
29+
- IO - Column chunk
30+
- Encoding/Compression - Page
31+
32+
## File format
33+
This file and the thrift definition should be read together to understand the format.
34+
35+
4-byte magic number "RED1"
36+
<Column 1 Chunk 1 + Column Metadata>
37+
<Column 2 Chunk 1 + Column Metadata>
38+
...
39+
<Column N Chunk 1 + Column Metadata>
40+
<Column 1 Chunk 2 + Column Metadata>
41+
<Column 2 Chunk 2 + Column Metadata>
42+
...
43+
<Column N Chunk 2 + Column Metadata>
44+
...
45+
<Column 1 Chunk M + Column Metadata>
46+
<Column 2 Chunk M + Column Metadata>
47+
...
48+
<Column N Chunk M + Column Metadata>
49+
File Metadata
50+
4-byte offset from end of file to start of file metadata
51+
4-byte magic number "RED1"
52+
53+
In the above example, there are N columns in this table, split into M row groups. The file metadata contains the locations of all the column metadata start locations. More details on what is contained in the metdata can be found in the thrift files.
54+
55+
Metadata is written after the data to allow for single pass writing.
56+
57+
Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially.
58+
59+
## Column chunks
60+
Column chunks are composed of pages written back to back. The pages same a fixed header and readers can skip over page they are not interested in. The data for the page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the metadata.
61+
62+
## Checksumming
63+
Data pages are individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups.
64+
65+
## Error recovery
66+
If the file metadata is corrupt, the file is lost. If the column metdata is corrupt, that column chunk is lost (but column chunks for this column in order row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.
67+
68+
Potential extension: With smaller row groups, the biggest issue is lowing the file metadata at the end. If this happens in the write path, all the data written will be unreadable. This can be fixed by writing the file metadata every Nth row group. Each file metadata would be cumulative and include all the row groups written so far. Combining this with the strategy used for rc or avro files using sync markers, a reader could recovery partially written files.
69+
70+
## Configurations
71+
- Row group size: Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write). We recommend large row groups (512GB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file.
72+
- Data page size: Data pages should be considered indivisible so smaller data pages allow for more fine grained reading (e.g. single row lookup). Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers). Note: for sequential scans, it is not expected to read a page at a time; this is not the IO chunk. We recommend 8KB for page sizes.
73+
74+
## Extensibility
75+
There are many places in the format for compatible extensions:
76+
- File Version: The file metadata contains a version.
77+
- Encodings: Encodings are specified by enum and more can be added in the future.
78+
- Page types: Additional page types can be added and safely skipped.

src/thrift/redfile.thrift

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
/**
2+
* Licensed to the Apache Software Foundation (ASF) under one
3+
* or more contributor license agreements. See the NOTICE file
4+
* distributed with this work for additional information
5+
* regarding copyright ownership. The ASF licenses this file
6+
* to you under the Apache License, Version 2.0 (the
7+
* "License"); you may not use this file except in compliance
8+
* with the License. You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing, software
13+
* distributed under the License is distributed on an "AS IS" BASIS,
14+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
* See the License for the specific language governing permissions and
16+
* limitations under the License.
17+
*/
18+
19+
/**
20+
* File format description for the redfile file format
21+
*/
22+
namespace cpp redfile
23+
namespace java com.apache.redfile
24+
25+
/**
26+
* Types supported by redfile. These types are intended to be for the storage
27+
* format, and in particular how they interact with different encodings.
28+
*/
29+
enum Type {
30+
BOOLEAN = 0;
31+
INT32 = 1;
32+
INT64 = 2;
33+
INT96 = 3;
34+
FLOAT = 4;
35+
DOUBLE = 5;
36+
BYTE_ARRAY = 6;
37+
}
38+
39+
/**
40+
* Encodings supported by redfile. Not all encodings are valid for all types.
41+
*/
42+
enum Encoding {
43+
/** Default encoding.
44+
* BOOLEAN - 1 bit per value.
45+
* INT32 - 4 bytes per value. Stored as little-endian.
46+
* INT64 - 8 bytes per value. Stored as little-endian.
47+
* FLOAT - 4 bytes per value. IEEE. Stored as little-endian.
48+
* DOUBLE - 8 bytes per value. IEEE. Stored as little-endian.
49+
* BYTE_ARRAY - 4 byte length stored as little endian, followed by bytes.
50+
*/
51+
PLAIN = 0;
52+
53+
/** Group VarInt encoding for INT32/INT64. **/
54+
GROUP_VAR_INT = 1;
55+
}
56+
57+
/**
58+
* Supported compression algorithms.
59+
*/
60+
enum Compression {
61+
UNCOMPRESSED = 0;
62+
SNAPPY = 1;
63+
GZIP = 2;
64+
LZO = 3;
65+
}
66+
67+
enum PageType {
68+
DATA_PAGE = 0;
69+
INDEX_PAGE = 1;
70+
}
71+
72+
/** Data page header **/
73+
struct DataPageHeader {
74+
1: required i32 num_values
75+
76+
/** Encoding used for this data page **/
77+
2: required Encoding encoding
78+
}
79+
80+
struct IndexPageHeader {
81+
/** TODO: **/
82+
}
83+
84+
struct PageHeader {
85+
1: required PageType type
86+
87+
/** Uncompressed page size in bytes **/
88+
2: required i32 uncompressed_page_size
89+
90+
/** Compressed page size in bytes **/
91+
3: required i32 compressed_page_size
92+
93+
/** 32bit crc for the data below. This allows for disabling checksumming in
94+
* if only a few pages needs to be read
95+
**/
96+
4: required i32 crc
97+
98+
5: optional DataPageHeader data_page;
99+
6: optional IndexPageHeader index_page;
100+
}
101+
102+
/**
103+
* Wrapper struct to store key values
104+
*/
105+
struct KeyValue {
106+
1: required string key
107+
2: optional string value
108+
}
109+
110+
/**
111+
* Description for column metadata
112+
*/
113+
struct ColumnMetaData {
114+
/** Type of this column **/
115+
1: required Type type
116+
117+
/** Set of all encodings used for this column **/
118+
2: required list<Encoding> encodings
119+
120+
/** Path in schema **/
121+
3: required list<string> path_in_schema
122+
123+
/** Compression codec **/
124+
4: required Compression codec
125+
126+
/** Number of values in this column **/
127+
5: required i64 num_values
128+
129+
/** Max defintion and repetition levels **/
130+
6: required i32 max_definition_level
131+
7: required i32 max_repetition_level
132+
133+
/** Byte offset from beginning of file to first data page **/
134+
8: optional i64 data_page_offset
135+
136+
/** Byte offset from beginning of file to root index page **/
137+
9: optional i64 index_page_offset
138+
139+
/** Optional key/value metadata **/
140+
10: list<KeyValue> key_value_metadata
141+
}
142+
143+
struct ColumnStart {
144+
/** File where column data is stored. If not set, assumed to be same file as
145+
* metadata
146+
**/
147+
1: optional string file_path
148+
149+
/** Byte offset in file_path to the ColumnMetaData **/
150+
2: required i64 file_offset
151+
}
152+
153+
struct RowGroup {
154+
1: required list<ColumnStart> columns
155+
/** Total byte size of all the uncompressed column data in this row group **/
156+
2: required i64 total_byte_size
157+
}
158+
159+
/**
160+
* Description for file metadata
161+
*/
162+
struct FileMetaData {
163+
/** Version of this file **/
164+
1: required i32 version
165+
166+
/** Number of rows in this file **/
167+
2: required i64 num_rows
168+
169+
/** Number of cols in the schema for this file **/
170+
3: required i32 num_cols
171+
172+
/** Row groups in this file **/
173+
4: list<RowGroup> row_groups
174+
175+
/** Optional key/value metadata **/
176+
5: list<KeyValue> key_value_metadata
177+
178+
/** 32bit crc for the file metadata **/
179+
6: optional i32 meta_data_crc
180+
}

0 commit comments

Comments
 (0)