Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 104 additions & 37 deletions doc/modules/ROOT/pages/exports.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,11 @@ As far as we know, the remainder of the first page after the table pointers is u
The table header is followed by the table pages themselves.
These each have the size specified by __len_page__ in the above diagram, and the following structure:

.Table page.
==== Table Page Header

All pages, regardless of type, begin with a common header structure.

.Table Page Header
[bytefield]
----
(draw-column-headers)
Expand All @@ -157,7 +161,37 @@ These each have the size specified by __len_page__ in the above diagram, and the
(draw-box (text "p" :math [:sub "f"]))
(draw-box (text "free" :math [:sub "s"]) {:span 2})
(draw-box (text "used" :math [:sub "s"]) {:span 2})
----

The first four bytes of a table page always seem to be zero.
This is followed by a four-byte value _page_index_ which identifies the index of this page within the list of table pages (the header has index 0, the first actual data page the index 1, and so on).
This value seems to be redundant, because it can be calculated by dividing the offset of the
start of the page by _len_page_, but perhaps it serves as a sanity check.

This is followed by another four-byte value, _type_, which identifies the type of the page, using the values shown in the <<table-types,preceding table>>.
This again seems redundant because the table header which was followed to reach this page also identified the table type, but perhaps it is another sanity check, or an alternate way to tell, when following page links, that you have reached the end of the table you are interested in.
Speaking of which, the next four-byte value, __next_page__, is that link: it identifies the index at which the next page of this table can be found, as long as we have not already reached the final page of the table, as described in <<file-header>>.

The exact meaning of _unknown~1~_ is unclear. Mr. Lesinak said “sequence number (0→1: 8→13, 1→2: 22, 2→3: 27)” but I don’t know how to interpret that.
Even less is known about _unknown~2~_.
But __num_rows_small__ at byte `18` within the page (abbreviated _n~rs~_ in the byte field diagram above) holds the number of rows that are present in the page, unless a corresponding value in the page-type-specific header (__num_rows_large__ in Data Pages) is larger.

The purpose of the next two bytes are is also unclear.
Of _u~3~_ Mr. Lesniak said “a bitmask (first track: 32)”, and he described _u~4~_ as “often 0, sometimes larger, especially for pages with a high number of rows (e.g. 12 for 101 rows)”.

Byte{nbsp}``1b`` is called __page_flags__ (abbreviated _p~f~_ in the diagram). It is used to differentiate between Data pages and Index pages.
Data pages, which contain table rows, have flag values of `24` or `34`. Meanwhile Index pages, which contain pointers to other pages, have the flag value `64`.
According to previous analysis, there are also "strange" pages with the flag value `44`, which are not yet understood. I have not found pages with these flags in any of the analysis I've done myself.
Bytes{nbsp}``1c``-`1d` are called __free_size__ (abbreviated _free~s~_ in the diagram), and store the amount of unused space in the page heap; __used_size__ at bytes{nbsp}``1e``-`1f` (abbreviated _used~s~_) stores the number of bytes that are in use in the page heap. For index pages, these are both zero.

==== Data Pages

Data pages follow the common header with a data-specific header and a heap for row data.

.Data Page Details
[bytefield]
----
(draw-column-headers)
(draw-box (text "u" :math [:sub "5"]) {:span 2})
(draw-box (text "num" :math [:sub "rl"]) {:span 2})
(draw-box (text "u" :math [:sub "6"]) {:span 2})
Expand All @@ -179,50 +213,22 @@ These each have the size specified by __len_page__ in the above diagram, and the
(draw-box (text "pad" :math [:sub "0"]) {:span 2})
----

Data pages all seem to have the header structure described here, but not all of them actually store data.
Some of them are “strange” and we have not yet figured out why.
The discussion below describes how to recognize a strange page, and avoid trying to read it as a data page.

The first four bytes of a table page always seem to be zero.
This is followed by a four-byte value _page_index_ which identifies the index of this page within the list of table pages (the header has index 0, the first actual data page the index 1, and so on).
This value seems to be redundant, because it can be calculated by dividing the offset of the
start of the page by _len_page_, but perhaps it serves as a sanity check.

This is followed by another four-byte value, _type_, which identifies the type of the page, using the values shown in the <<table-types,preceding table>>.
This again seems redundant because the table header which was followed to reach this page also identified the table type, but perhaps it is another sanity check, or an alternate way to tell, when following page links, that you have reached the end of the table you are interested in.
Speaking of which, the next four-byte value, __next_page__, is that link: it identifies the index at which the next page of this table can be found, as long as we have not already reached the final page of the table, as described in <<file-header>>.

The exact meaning of _unknown~1~_ is unclear. Mr. Lesinak said “sequence number (0→1: 8→13, 1→2: 22, 2→3: 27)” but I don’t know how to interpret that.
Even less is known about _unknown~2~_.
But __num_rows_small__ at byte `18` within the page (abbreviated _n~rs~_ in the byte field diagram above) holds the number of rows that are present in the page, unless __num_rows_large__ (below) holds a value that is larger than it (but not equal to `1fff`).
This seems like a strange mechanism for dealing with the fact that some tables (like playlist entries) have a lot of very small rows, too many to count with a single byte.
But then why not just always use __num_rows_large__?

NOTE: The row counter entries represent the number of rows that have ever been allocated in the page, but some will no longer be valid due to deletion or updates.
To find the actual rows, you need to scan all 16 entries of each of the row groups present in the page, ignoring any whose <<#row-presence-bits,row presence bit>> is zero.

The purpose of the next two bytes are is also unclear.
Of _u~3~_ Mr. Lesniak said “a bitmask (first track: 32)”, and he described _u~4~_ as “often 0, sometimes larger, especially for pages with a high number of rows (e.g. 12 for 101 rows)”.

Byte{nbsp}``1b`` is called __page_flags__ (abbreviated _p~f~_ in the diagram).
According to Mr. Lesniak, “strange” (non-data) pages will have the value `44` or `64`, and other pages have had the values `24` or `34`.
Crate Digger considers a page to be a data page if __page_flags__&``40``{nbsp}={nbsp}`0`.
Bytes{nbsp}``0``-`1`, _u~5~_ , are of unclear purpose. Mr. Lesniak labeled them “(0→1: 2).”

Bytes{nbsp}``1c``-`1d` are called __free_size__ (abbreviated _free~s~_ in the diagram), and store the amount of unused space in the page heap (excluding the row index which is built backwards from the end of the page); __used_size__ at bytes{nbsp}``1c``-`1d` (abbreviated _used~s~_) stores the number of bytes that are in use in the page heap.
Bytes{nbsp}``2``-`3`, __num_rows_large__ (abbreviated _num~rl~_ in the diagram) hold the number of entries in the row index at the end of the page when that value is too large to fit into __num_rows_small__ (as mentioned above), and that situation seems to be indicated when this value is larger than __num_rows_small__, but not equal to `1fff`.

Bytes{nbsp}``20``-`21`, _u~5~_ , are of unclear purpose. Mr. Lesniak labeled them “(0→1: 2).”

Bytes{nbsp}``22``-`23`, __num_rows_large__ (abbreviated _num~rl~_ in the diagram) hold the number of entries in the row index at the end of the page when that value is too large to fit into __num_rows_small__ (as mentioned above), and that situation seems to be indicated when this value is larger than __num_rows_small__, but not equal to `1fff`.

_u~6~_ at bytes{nbsp}``24``-`25` seems to have the value `1004` for strange pages, and `0000` for data pages.
And Mr. Lesniak describes _u~7~_ at bytes{nbsp}``26``-`27` as “always 0 except 1 for history pages, num entries for strange pages?”
_u~6~_ at bytes{nbsp}``4``-`5` seems to have the value `0000` for data pages.
And Mr. Lesniak describes _u~7~_ at bytes{nbsp}``6``-`7` as “always 0 except 1 for history pages”.

After these header fields comes the page heap.
Rows are allocated within this heap starting at byte `28`.
Rows are allocated within this heap starting at byte `8`.
Since rows can be different sizes, there needs to be a way to locate them.
This takes the form of a row index, which is built from the end of the page backwards, in groups of up to sixteen row pointers along with a bitmask saying which of those rows are still part of the table (they might have been deleted).
The number of row index entries is determined, as described above, by the value of either __num_rows_small__ or __num_rows_large__.

NOTE: The row counter entries represent the number of rows that have ever been allocated in the page, but some will no longer be valid due to deletion or updates.
To find the actual rows, you need to scan all 16 entries of each of the row groups present in the page, ignoring any whose <<#row-presence-bits,row presence bit>> is zero.

[#row-presence-bits]
The bit mask for the first group of up to sixteen rows, labeled _row~pf0~_ in the diagram (meaning “row presence flags group 0”), is found near the end of the page.
The last two bytes after each row bitmask (for example _pad~0~_ after _row~pf0~_) have an unknown purpose and may always be zero, and the _row~pf0~_ bitmask takes up the two bytes that precede them.
Expand All @@ -235,6 +241,67 @@ As more rows are added to the page, space is allocated for them in the heap, and
Once there have been sixteen rows added, all the bits in _row~pf0~_ are accounted for, and when another row is added, before its offset entry _ofs~16~_ can be added, another row bit-mask entry _row~pf1~_ needs to be allocated, followed by its corresponding _pad~1~_.
And so the row index grows backwards towards the rows that are being added forwards, and once they are too close for a new row to fit, the page is full, and another page gets allocated to the table.

==== Index Pages

Index pages have a __page_flags__ value of `64`. They follow the common header with an index-specific header and a list of index entries.

.Index Page Details
[bytefield]
----
(draw-column-headers)
(draw-box (text "u" :math [:sub "a"]) {:span 2})
(draw-box (text "u" :math [:sub "b"]) {:span 2})
(draw-related-boxes [(text "ec" :hex) (text "03" :hex)] {:span 1})
(draw-box (text "next" :math [:sub "o"]) {:span 2})
(draw-box (text "page_index" :math) {:span 4})
(draw-box (text "next_page" :math) {:span 4})
(draw-related-boxes [(text "ff" :hex) (text "ff" :hex) (text "ff" :hex) (text "03" :hex)] {:span 1})
(draw-related-boxes [(text "00" :hex) (text "00" :hex) (text "00" :hex) (text "00" :hex)] {:span 1})
(draw-box (text "num" :math [:sub "e"]) {:span 2})
(draw-box (text "first" :math [:sub "e"]) {:span 2})
(draw-gap "Index Entries")
(draw-bottom)
----

The header of an index page contains several fields whose purpose is not yet fully understood.

The first two bytes are called _unknown~a~_ and they are usually `1fff` or `0001`, although values like `0002` have been observed on tables with a big number of rows.

After that, bytes `2-3` are called _unkown~b~_ and they're usually `1fff` or `0000`, although values like `00b3` have been observed on tables with a big number of rows.

Next comes a 2-byte magic value `03ec`.

__next_offset__ (bytes `6-7`, abbreviated _next~o~_) specifies the byte offset of where to insert the next index entry. Relative to the start of index entries, usually zero for empty pages.

Bytes `8-11` and `12-15` are _page_index_ and __next_page__, and they reflect the values with the same name that are found in the common page header.

The two following 32-bit fields contain another magic value `03ffffff` and zeros `00000000`.

The __num_entries__ field at bytes `24-25` (abbreviated _num~e~_) indicates how many index entries are present in the page. Field __first_empty__ at bytes `26-27` (abbreviated _first~e~_) points to the first empty index entry, and is `1fff` if there are none. An empty index entry just contains `1ffffff8`.

.Index Entry
[bytefield]
----
(def box-width 30)
(def boxes-per-row 32)
(def left-margin 1)

;; Create a list of 32 labels, from "31" down to "0"
(def bit-labels (mapv str (reverse (range 32))))

;; Use your custom labels for the headers
(draw-column-headers {:labels bit-labels})

;; Example of drawing 32 empty boxes
(draw-box (text "page_index" :math [:sub "28-0"]) {:span 29})
(draw-box (text "index_flags" :math) {:span 3})
----

Each entry is a four-byte value. As we have said above, an index entry of `1ffffff8` indicates an empty slot.
Otherwise, the entry is a bitfield that contains a page index and some flags.

Bits `31-3` are a page index (i.e. the page that is pointed to by this index entry) but without the 3 most significative bits. Bits `0-2` (called `index_flags`) have unknown purpose, and their most common value is `000`.

== Table Rows

The structure of the rows themselves is determined by the _type_ of the table, using the values shown in <<table-types,Table types>>.
Expand Down