diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst index 62bb922afdd..73a34e57873 100644 --- a/docs/source/format/Columnar.rst +++ b/docs/source/format/Columnar.rst @@ -765,6 +765,68 @@ application. We discuss dictionary encoding as it relates to serialization further below. +.. _run-length-encoded-layout: + +Run-Length Encoded Layout +------------------------- + +Run-Length is a data representation that represents data as sequences of the +same value, called runs. Each run is represented as a value, and an integer +describing how often this value is repeated. + +Any array can be run-length encoded. A run-length encoded array has no buffers +by itself, but has two child arrays. The first one holds a signed integer +called a "run end" for each run. The run ends array can hold either 16, 32, or +64-bit integers. The actual values of each run are held +the second child array. + +The values in the first child array represent the length of each run. They do +not hold the length of the respective run directly, but the accumulated length +of all runs from the first to the current one, i.e. the logical index where the +current run ends. This allows relatively efficient random access from a logical +index using binary search. The length of an individual run can be determined by +subtracting two adjacent values. + +A run must have have a length of at least 1. This means the values in the +run ends array all positive and in strictly ascending order. A run end cannot be +null. + +As an example, you could have the following data: :: + + type: Float32 + [1.0, 1.0, 1.0, 1.0, null, null, 2.0] + +In Run-length-encoded form, this could appear as: + +:: + + * Length: 7, Null count: 2 + * Children arrays: + + * run ends (Int32): + * Length: 3, Null count: 0 + * Validity bitmap buffer: Not required + * Values buffer + + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | + |-------------|-------------|-------------|-----------------------| + | 4 | 6 | 7 | unspecified (padding) | + + * values (Float32): + * Length: 3, Null count: 1 + * Validity bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-63 | + |--------------------------|-----------------------| + | 00000101 | 0 (padding) | + + * Values buffer + + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | + |-------------|-------------|-------------|-----------------------| + | 1.0 | unspecified | 2.0 | unspecified (padding) | + + Buffer Listing for Each Layout ------------------------------ @@ -784,6 +846,7 @@ of memory buffers for each layout. "Dense Union",type ids,offsets, "Null",,, "Dictionary-encoded",validity,data (indices), + "Run-length encoded",,, Logical Types =============