Skip to content
61 changes: 60 additions & 1 deletion docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -765,6 +765,65 @@ application.
We discuss dictionary encoding as it relates to serialization further
below.

.. _run-length-encoded-layout:

Run-Length-encoded Layout
-------------------------

Run-Length is a data representation that represents data as sequences of the
same value, called runs. Each run is represented as a value, and an integer
describing how often this value is repeated.

Any array can be run-length-encoded. A run-length encoded array has a single
buffer holding as many signed 32-bit integers, as there are runs. The actual
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not true if we use 32 bits. An array whose length is larger than a 32 bit integer cannot be represented

values are held in a child array, which is just a regular array.

The values in the parent array buffer represent the length of each run. They do
not hold the length of the respective run directly, but the accumulated length
of all runs from the first to the current one. This allows relatively efficient
random access from a logical index using binary search. The length of an
individual run can be determined by subtracting two adjacent values.

A run has to have a length of at least 1. This means the values in the
accumulated run lengths buffer are all positive and in strictly ascending
order.

An accumulated run length cannot be null, therefore the parent array has no
validity buffer.

As an example, you could have the following data: ::

type: Float32
[1.0, 1.0, 1.0, 1.0, null, null, 2.0]

In Run-length-encoded form, this could appear as:

::

* Length: 3, Null count: 2
* Accumulated run lengths buffer:

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 |
|-------------|-------------|-------------|-----------------------|
| 4 | 6 | 7 | unspecified (padding) |

* Children arrays:

* values (Float32):
* Length: 3, Null count: 1
* Validity bitmap buffer:

| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00000101 | 0 (padding) |

* Values buffer

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 |
|-------------|-------------|-------------|-----------------------|
| 1.0 | unspecified | 2.0 | unspecified (padding) |


Buffer Listing for Each Layout
------------------------------

Expand Down Expand Up @@ -957,7 +1016,7 @@ The ``Buffer`` Flatbuffers value describes the location and size of a
piece of memory. Generally these are interpreted relative to the
**encapsulated message format** defined below.

The ``size`` field of ``Buffer`` is not required to account for padding
The ``size`` field of ``Buffer`` is not required to account for paddingeng-career-mgmt
bytes. Since this metadata can be used to communicate in-memory pointer
addresses between libraries, it is recommended to set ``size`` to the actual
memory size rather than the padded size.
Expand Down