Skip to content

Conversation

@blambov
Copy link

@blambov blambov commented Sep 17, 2025

What is the issue

https://github.com/riptano/cndb/issues/10302

What does this PR fix and why was it fixed

Implements the necessary trie machinery to work with trie sets, range and deletion-aware tries, and a memtable that uses it to store deletions in separate per-partition branches of the memtable trie.

Implements a method of skipping over tombstones when converting UnfilteredRowIterator to the filtered RowIterator, which has the effect of ignoring all tombstones when looking for data and speeds up next-live lookups dramatically. Adds a test to demonstrate this effect with the new memtable.

The changes are described in further detail with each commit.

@blambov blambov force-pushed the CNDB-10302 branch 3 times, most recently from e358a60 to 6914cba Compare September 29, 2025 13:00
blambov added 16 commits October 3, 2025 18:57
This also changes the behaviour of subtries to always
include boundaries, their prefixes and their descendant
branches.

This is necessary for well-defined reverse walks and helps
present metadata on the path of queried ranges, and is not
a real limitation for the prefix-free keys that we use.
Range tries are tries made of ranges of coverage, which
track applicable ranges and are mainly to be used to store
deletions and deletion ranges.
Deletion-aware tries combine data and deletion tries. The cursor
of a deletion-aware trie walks the data part of the trie, but
also provides a `deletionBranchCursor` that can return a deletion/
tombstone branch covering the current position and the branch below
it as a range trie. Such a branch can be given only once for any
path in the trie (i.e. there cannot be a deletion branch covering
another deletion branch).

Deletion-aware merges and updates to in-memory tries take deletion
branches into account when merging data so that deleted data is
not produced in the resulting merge.
Implements a row-level trie memtable that uses deletion-aware
tries to store deletions separately from live data, together
with the associated TrieBackedPartition and TriePartitionUpdate.

Every deletion is first converted to its range version (e.g.
deleted rows are now represented as a WHERE ck <= x AND ck >= x,
deleted partitions -- as deletions covering from LT_EXCLUDED
to GT_NEXT_COMPONENT to include static and all normal rows)
and then stored in the deletion path of the trie.
To make tests work, all such ranges are converted back to rows
and partition deletion times on conversion to UnfiteredPartitionIterator.
Adds a new method to UnfilteredRowIterator that is implemented
by the new trie-backed partitions to ask them to stop issuing
tombstones. This is done on filtering (i.e. conversion from
UnfilteredRowIterator to RowIterator) where tombstones have already
done their job and are no longer needed.

Adds JMH tests of tombstones that demonstrate the improvement.
In the initial implementation row deletions were mapped to range tombstones,
which works but isn't compatible with the multitude of tests, which require
deletions to be returned in the form they were made.

This commit changes the representation of deleted rows to use point tombstones.
In addition to making the tests pass, this improves the memory usage of memtables
with row deletions.

Although they only add complexity at this stage, point tombstones (expanded to
apply to the covered branch) will be needed in the next stage of development.
@sonarqubecloud
Copy link

sonarqubecloud bot commented Oct 6, 2025

@blambov
Copy link
Author

blambov commented Oct 6, 2025

Some benchmark results demonstrating the effect:

Benchmark                                (BATCH)  (count)  (deletionPattern)    (deletionSpec)  (deletionsRatio)  (flush)     (memtableClass)  (partitions)  (threadCount)  (useNet)  Mode  Cnt    Score    Error  Units
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START  RANGE_FROM_START             0.997    INMEM        TrieMemtable           999              1     false  avgt   10    8.924 ±  0.117  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START  RANGE_FROM_START             0.997    INMEM  TrieMemtableStage2           999              1     false  avgt   10  185.442 ±  6.448  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START  RANGE_FROM_START             0.997    INMEM  TrieMemtableStage1           999              1     false  avgt   10   48.197 ±  3.383  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START   SINGLETON_RANGE             0.997    INMEM        TrieMemtable           999              1     false  avgt   10   11.465 ±  0.225  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START   SINGLETON_RANGE             0.997    INMEM  TrieMemtableStage2           999              1     false  avgt   10  436.228 ± 14.452  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START   SINGLETON_RANGE             0.997    INMEM  TrieMemtableStage1           999              1     false  avgt   10  261.936 ±  8.704  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START             EQUAL             0.997    INMEM        TrieMemtable           999              1     false  avgt   10   11.073 ±  0.206  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START             EQUAL             0.997    INMEM  TrieMemtableStage2           999              1     false  avgt   10  190.903 ±  4.218  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START             EQUAL             0.997    INMEM  TrieMemtableStage1           999              1     false  avgt   10   81.501 ±  1.221  ms/op

In the table above, the first 99.7% of all rows in each partition are deleted with one of the following operations:

RANGE_FROM_START:  DELETE ... WHERE userid = ? AND picid <= ?
 SINGLETON_RANGE:  DELETE ... WHERE userid = ? AND picid >= ? AND picid <= ?
           EQUAL:  DELETE ... WHERE userid = ? AND picid = ?

and then a SELECT ... WHERE userid = ? AND picid >= ? was issued.

An example with hundreds of thousands of tombstones per read partition:

Benchmark                                (BATCH)  (count)  (deletionPattern)  (deletionSpec)  (deletionsRatio)  (flush)  (memtableClass)  (partitions)  (threadCount)  (useNet)  Mode  Cnt   Score   Error  Units
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START           EQUAL             0.997    INMEM     TrieMemtable             3              1     false  avgt   10  10.280 ± 0.117  ms/op

(this throws TombstoneOverwhelmingException with all other memtable types and runs in the tens of seconds per query when the guardrail is disabled)

Here is the table of the build time and memory usage:

memtableClass        count partitions  deletionsRatio      deletionSpec  build time  on-heap memory  off-heap memory
TrieMemtable       1000000        999           0.997  RANGE_FROM_START     20.910s       23.639MiB        28.149MiB
TrieMemtableStage2 1000000        999           0.997  RANGE_FROM_START     12.965s       88.500MiB        75.671MiB
TrieMemtableStage1 1000000        999           0.997  RANGE_FROM_START     13.512s      116.381MiB        32.013MiB
TrieMemtable       1000000        999           0.997   SINGLETON_RANGE     17.828s       61.638MiB       106.289MiB
TrieMemtableStage2 1000000        999           0.997   SINGLETON_RANGE    106.892s      336.825MiB        75.671MiB
TrieMemtableStage1 1000000        999           0.997   SINGLETON_RANGE     17.354s      598.587MiB        42.013MiB
TrieMemtable       1000000        999           0.997             EQUAL     16.648s       42.617MiB        75.865MiB
TrieMemtableStage2 1000000        999           0.997             EQUAL     13.937s       65.407MiB        75.671MiB
TrieMemtableStage1 1000000        999           0.997             EQUAL     15.645s       93.272MiB        42.013MiB 

Unlike the previous memtables, the new implementation will delete data from the trie when it receives a range tombstone, resulting in some cases in longer build time but lower memory usage.

Full benchmark run to be posted soon.

@cbornet
Copy link

cbornet commented Oct 27, 2025

Impressive work @blambov !
I added a few comments, mostly cosmetics.
Otherwise LGTM!

and use it to avoid a couple of intermediate objects in set union
Fix Cursor.skipToWhenAhead for reverse iteration
Add Cursor.dumpBranch for debugging
Fix various methods to return Preencoded byte-comparables
Fix deletion-aware collection merge cursor's reporting of deletion branch at tail points
Fix deletion-aware collection merge cursor's failure on one deletion branch
@sonarqubecloud
Copy link

@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-2005 approved by Butler


Approved by Butler
See build details here

@lesnik2u lesnik2u self-requested a review November 15, 2025 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants