-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-16773: [Docs][Format] Document Run-Length encoding in Arrow columnar format #13333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
8186f56
70ea2fa
2ef0a24
3486009
7c348ad
cba824e
f1e1a16
e3d3e9b
77bb500
b501674
7211923
86e12d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -765,6 +765,68 @@ application. | |||||
| We discuss dictionary encoding as it relates to serialization further | ||||||
| below. | ||||||
|
|
||||||
| .. _run-length-encoded-layout: | ||||||
|
|
||||||
| Run-Length Encoded Layout | ||||||
| ------------------------- | ||||||
|
|
||||||
| Run-Length is a data representation that represents data as sequences of the | ||||||
| same value, called runs. Each run is represented as a value, and an integer | ||||||
| describing how often this value is repeated. | ||||||
|
|
||||||
| Any array can be run-length encoded. A run-length encoded array has no buffers | ||||||
| by itself, but has two child arrays. The first one holds a signed integer | ||||||
| called a "run end" for each run. The run ends array can hold either 16, 32, or | ||||||
| 64-bit integers. The actual values of each run are held | ||||||
| the second child array. | ||||||
|
|
||||||
| The values in the first child array represent the length of each run. They do | ||||||
| not hold the length of the respective run directly, but the accumulated length | ||||||
| of all runs from the first to the current one, i.e. the logical index where the | ||||||
| current run ends. This allows relatively efficient random access from a logical | ||||||
| index using binary search. The length of an individual run can be determined by | ||||||
| subtracting two adjacent values. | ||||||
|
|
||||||
| A run must have have a length of at least 1. This means the values in the | ||||||
| run ends array all positive and in strictly ascending order. A run end cannot be | ||||||
| null. | ||||||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| As an example, you could have the following data: :: | ||||||
|
|
||||||
| type: Float32 | ||||||
| [1.0, 1.0, 1.0, 1.0, null, null, 2.0] | ||||||
|
|
||||||
| In Run-length-encoded form, this could appear as: | ||||||
|
|
||||||
| :: | ||||||
|
|
||||||
| * Length: 7, Null count: 2 | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This formulation of null count is a little surprising, it implies that when computing the parent ArrayData from a set of children, it must iterate the null mask in conjunction with the run ends. This is also inconsistent with how null counts work for other nested types. I've asked for clarification of this on the mailing list - https://lists.apache.org/thread/4x14b0h3fcfwzk68jpoq3n5xvr241qz5
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right, it should actually say 0, since the array has no validity bitmap (I tried to reply via the reply button in the ml web ui but that does not seem to have come through)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is this a general restriction, or is it just in this case there is no validity mask?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this is a general restriction (at least I think so, and that's how the code works). The idea so far is that Null is just one of the possible values for a run. The is somewhat consistent with how union types work. (if we were to allow the RLE array parent to have an additional null mask, the null count field would represent that - there seems to be a generall assumption in Arrow code that a non-zero (or array length for the NULL) null count means the presence of the standard null mask)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We should probably document that if so, personally it seems a shame as it complicates things like nullif kernels which now have to explicitly handle the case of RLE data, whereas normally they don't need concern themselves with anything but the array data.
Yeah, union types are a bit of a special snowflake though 😆
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is actually documented by saying there are no buffers, but that's a bit indirect, so guess mentioning it explicitly won't hurt
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I submitted a PR for this here, please take a look: |
||||||
| * Children arrays: | ||||||
|
|
||||||
| * run ends (Int32): | ||||||
| * Length: 3, Null count: 0 | ||||||
| * Validity bitmap buffer: Not required | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
See comment above
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still think it would be best to be explicit here that no validity bitmap is allowed to avoid ambiguity |
||||||
| * Values buffer | ||||||
|
|
||||||
| | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | | ||||||
| |-------------|-------------|-------------|-----------------------| | ||||||
| | 4 | 6 | 7 | unspecified (padding) | | ||||||
|
|
||||||
| * values (Float32): | ||||||
| * Length: 3, Null count: 1 | ||||||
| * Validity bitmap buffer: | ||||||
|
|
||||||
| | Byte 0 (validity bitmap) | Bytes 1-63 | | ||||||
| |--------------------------|-----------------------| | ||||||
| | 00000101 | 0 (padding) | | ||||||
|
|
||||||
| * Values buffer | ||||||
|
|
||||||
| | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 | | ||||||
| |-------------|-------------|-------------|-----------------------| | ||||||
| | 1.0 | unspecified | 2.0 | unspecified (padding) | | ||||||
|
|
||||||
|
|
||||||
| Buffer Listing for Each Layout | ||||||
| ------------------------------ | ||||||
|
|
||||||
|
|
@@ -784,6 +846,7 @@ of memory buffers for each layout. | |||||
| "Dense Union",type ids,offsets, | ||||||
| "Null",,, | ||||||
| "Dictionary-encoded",validity,data (indices), | ||||||
| "Run-length encoded",,, | ||||||
|
|
||||||
| Logical Types | ||||||
| ============= | ||||||
|
|
||||||
Uh oh!
There was an error while loading. Please reload this page.