Added thrift struct file format definitions.

Nong Li · Nong Li · commit e52693de1733 · 2013-03-11T12:17:27.000-07:00
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+generated/*
diff --git a/Encodings.txt b/Encodings.txt
@@ -0,0 +1,43 @@
+This file contains the specification of all supported encodings.
+
+Plain:
+  - Supported Types: all
+  This is the plain encoding that must be supported for types.  It is
+  intended to be the simplest encoding.  Values are encoded back to back. 
+  - For native types, this outputs the data as little endian. Floating
+  point types are encoded in IEEE.  
+  - For the byte array type, it encodes the length as a 4 byte little
+  endian, followed by the bytes.
+
+GroupVarInt:
+  - Supported Types: INT32, INT64
+  32-bit ints are encoded in groups of 4 with 1 leading bytes to encode the
+  byte length of the following 4 ints.  64-bit are encoded in groups of 5,
+  with 2 leading bytes to encode the byte length of the 5 ints.  
+
+  For 32-bit ints, the leading byte contains 2 bits per int.  Each length
+  encoding specifies the number of bytes minus 1 for that int.  For example
+  a byte value of 0b00101101, indicates that:
+    the first int has 1 byte (0b00 + 1), 
+    the second int has 3 bytes (0b10 + 1),
+    the third int has 4 bytes (0b11 + 1), and
+    the 4th int has 2 bytes (0b01 + 1)
+
+  In this case, the entire row group would be: 1 + (1 + 3 + 4 + 2) = 11 bytes.  
+  The bytes that follow the leading byte is just the int data encoded in little
+  endian.  With this example:
+    the first int starts at byte offset 1 with a max value of 0xFF,
+    the second int starts at byte offset 2 with a max value of 0xFFFFFF,
+    the third int starts at byte offset 5 with a max value of 0xFFFFFFFF, and
+    the 4th int starts at byte ofset 9 with a max value of 0xFFFF. 
+
+  For 64-bit ints, the lengths of the 5 ints is encoded as 3 bits.  Combined,
+  this uses 15 bits and fits in 2 bytes.  The msb of the two bytes is unused.
+  Like the 32-bit case, after the length bytes, the data bytes follow.
+
+  In the case where the data does not make a complete group, (e.g. 3 32-bit ints),
+  a complete group should still be output with 0's filling in for the remainder.
+  For example, if the input was (1,2,3,4,5): the resulting encoding should
+  behave as if the input was (1,2,3,4,5,0,0,0) and the two groups should be
+  encoded back to back.
+
diff --git a/Makefile b/Makefile
@@ -0,0 +1,3 @@
+thrift:
+	thrift --gen cpp -o generated src/thrift/redfile.thrift 
+	thrift --gen java -o generated src/thrift/redfile.thrift 
diff --git a/README.md b/README.md
@@ -0,0 +1,78 @@
+redfile
+======
+
+## Glossary
+  - Block (hdfs block): This means a block in hdfs and the meaning is 
+    unchanged for describing this file format.  The file format is 
+    designed to work well ontop of hdfs.
+
+  - File: A hdfs file that must include the metadata for the file.
+    It does not need to actually contain the data.
+
+  - Row group: A logical horizontal partitioning of the data into rows.
+    There is no physical structure that is guaranteed for a row group.
+    A row group consists of a column chunk for each column in the dataset.
+
+  - Column chunk: A chunk of the data for a particular column.  These live
+    in a particular row group and is guaranteed to be contiguous in the file.
+
+  - Page: Column chunks are divided up into pages.  A page is conceptually
+    an indivisible unit (in terms of compression and encoding).  There can
+    be multiple page types which is interleaved in a column chunk.
+
+Hierarchically, a file consists of one or more rows groups.  A row group
+contains exactly one column chunk per column.  Column chunks contain one or
+more pages. 
+
+## Unit of parallelization
+  - MapReduce - File/Row Group
+  - IO - Column chunk
+  - Encoding/Compression - Page
+
+## File format
+This file and the thrift definition should be read together to understand the format.
+
+    4-byte magic number "RED1"
+    <Column 1 Chunk 1 + Column Metadata>
+    <Column 2 Chunk 1 + Column Metadata>
+    ...
+    <Column N Chunk 1 + Column Metadata>
+    <Column 1 Chunk 2 + Column Metadata>
+    <Column 2 Chunk 2 + Column Metadata>
+    ...
+    <Column N Chunk 2 + Column Metadata>
+    ...
+    <Column 1 Chunk M + Column Metadata>
+    <Column 2 Chunk M + Column Metadata>
+    ...
+    <Column N Chunk M + Column Metadata>
+    File Metadata
+    4-byte offset from end of file to start of file metadata
+    4-byte magic number "RED1"
+
+In the above example, there are N columns in this table, split into M row groups.  The file metadata contains the locations of all the column metadata start locations.  More details on what is contained in the metdata can be found in the thrift files.
+
+Metadata is written after the data to allow for single pass writing.
+
+Readers are expected to first read the file metadata to find all the column chunks they are interested in.  The columns chunks should then be read sequentially.
+
+## Column chunks
+Column chunks are composed of pages written back to back.  The pages same a fixed header and readers can skip over page they are not interested in.  The data for the page follows the header and can be compressed and/or encoded.  The compression and encoding is specified in the metadata.
+
+## Checksumming
+Data pages are individually checksummed.  This allows disabling of checksums at the HDFS file level, to better support single row lookups.
+
+## Error recovery
+If the file metadata is corrupt, the file is lost.  If the column metdata is corrupt, that column chunk is lost (but column chunks for this column in order row groups are okay).  If a page header is corrupt, the remaining pages in that chunk are lost.  If the data within a page is corrupt, that page is lost.  The file will be more resilient to corruption with smaller row groups.
+
+Potential extension: With smaller row groups, the biggest issue is lowing the file metadata at the end.  If this happens in the write path, all the data written will be unreadable.  This can be fixed by writing the file metadata every Nth row group.  Each file metadata would be cumulative and include all the row groups written so far.  Combining this with the strategy used for rc or avro files using sync markers, a reader could recovery partially written files.  
+
+## Configurations
+- Row group size: Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO.  Larger groups also require more buffering in the write path (or a two pass write).  We recommend large row groups (512GB - 1GB).  Since an entire row group might need to be read, we want it to completely fit on one HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file.
+- Data page size: Data pages should be considered indivisible so smaller data pages allow for more fine grained reading (e.g. single row lookup).  Larger page sizes incur less space overhead (less page headers) and potentially less parsing overhead (processing headers).  Note: for sequential scans, it is not expected to read a page at a time; this is not the IO chunk.  We recommend 8KB for page sizes.
+
+## Extensibility
+There are many places in the format for compatible extensions:
+- File Version: The file metadata contains a version.
+- Encodings: Encodings are specified by enum and more can be added in the future.  
+- Page types: Additional page types can be added and safely skipped.
diff --git a/src/thrift/redfile.thrift b/src/thrift/redfile.thrift
@@ -0,0 +1,180 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * File format description for the redfile file format
+ */
+namespace cpp redfile
+namespace java com.apache.redfile
+
+/**
+ * Types supported by redfile.  These types are intended to be for the storage
+ * format, and in particular how they interact with different encodings.
+ */
+enum Type {
+  BOOLEAN = 0;
+  INT32 = 1;
+  INT64 = 2;
+  INT96 = 3;
+  FLOAT = 4;
+  DOUBLE = 5;
+  BYTE_ARRAY = 6;
+}
+
+/**
+ * Encodings supported by redfile.  Not all encodings are valid for all types.
+ */
+enum Encoding {
+  /** Default encoding.
+   * BOOLEAN - 1 bit per value.
+   * INT32 - 4 bytes per value.  Stored as little-endian.
+   * INT64 - 8 bytes per value.  Stored as little-endian.
+   * FLOAT - 4 bytes per value.  IEEE. Stored as little-endian.
+   * DOUBLE - 8 bytes per value.  IEEE. Stored as little-endian.
+   * BYTE_ARRAY - 4 byte length stored as little endian, followed by bytes.  
+   */
+  PLAIN = 0;
+
+  /** Group VarInt encoding for INT32/INT64. **/
+  GROUP_VAR_INT = 1;
+}
+
+/**
+ * Supported compression algorithms.  
+ */
+enum Compression {
+  UNCOMPRESSED = 0;
+  SNAPPY = 1;
+  GZIP = 2;
+  LZO = 3;
+}
+
+enum PageType {
+  DATA_PAGE = 0;
+  INDEX_PAGE = 1;
+}
+
+/** Data page header **/
+struct DataPageHeader {
+  1: required i32 num_values
+
+  /** Encoding used for this data page **/
+  2: required Encoding encoding
+}
+
+struct IndexPageHeader {
+  /** TODO: **/
+}
+
+struct PageHeader {
+  1: required PageType type
+
+  /** Uncompressed page size in bytes **/
+  2: required i32 uncompressed_page_size
+  
+  /** Compressed page size in bytes **/
+  3: required i32 compressed_page_size
+
+  /** 32bit crc for the data below. This allows for disabling checksumming in 
+   *  if only a few pages needs to be read 
+   **/
+  4: required i32 crc
+
+  5: optional DataPageHeader data_page;
+  6: optional IndexPageHeader index_page;
+}
+
+/** 
+ * Wrapper struct to store key values
+ */
+ struct KeyValue {
+  1: required string key
+  2: optional string value
+ }
+
+/**
+ * Description for column metadata
+ */
+struct ColumnMetaData {
+  /** Type of this column **/
+  1: required Type type
+
+  /** Set of all encodings used for this column **/
+  2: required list<Encoding> encodings
+
+  /** Path in schema **/
+  3: required list<string> path_in_schema
+
+  /** Compression codec **/
+  4: required Compression codec
+
+  /** Number of values in this column **/
+  5: required i64 num_values
+
+  /** Max defintion and repetition levels **/
+  6: required i32 max_definition_level
+  7: required i32 max_repetition_level
+
+  /** Byte offset from beginning of file to first data page **/
+  8: optional i64 data_page_offset
+
+  /** Byte offset from beginning of file to root index page **/
+  9: optional i64 index_page_offset
+
+  /** Optional key/value metadata **/
+  10: list<KeyValue> key_value_metadata
+}
+
+struct ColumnStart {
+  /** File where column data is stored.  If not set, assumed to be same file as 
+    * metadata 
+    **/
+  1: optional string file_path
+
+  /** Byte offset in file_path to the ColumnMetaData **/
+  2: required i64 file_offset
+}
+  
+struct RowGroup {
+  1: required list<ColumnStart> columns
+  /** Total byte size of all the uncompressed column data in this row group **/
+  2: required i64 total_byte_size
+}
+
+/**
+ * Description for file metadata
+ */
+struct FileMetaData {
+  /** Version of this file **/
+  1: required i32 version
+
+  /** Number of rows in this file **/
+  2: required i64 num_rows
+
+  /** Number of cols in the schema for this file **/
+  3: required i32 num_cols
+
+  /** Row groups in this file **/
+  4: list<RowGroup> row_groups
+
+  /** Optional key/value metadata **/
+  5: list<KeyValue> key_value_metadata
+
+  /** 32bit crc for the file metadata **/
+  6: optional i32 meta_data_crc
+}

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+thrift:`
	`2`	`+ thrift --gen cpp -o generated src/thrift/redfile.thrift`
	`3`	`+ thrift --gen java -o generated src/thrift/redfile.thrift`