Skip to content

Feat/overlay support#94

Open
dryajov wants to merge 177 commits intomainfrom
feat/overlay-support
Open

Feat/overlay support#94
dryajov wants to merge 177 commits intomainfrom
feat/overlay-support

Conversation

@dryajov
Copy link
Contributor

@dryajov dryajov commented Feb 12, 2026

This PR introduces several important changes:

  • It switches to a CAS and Atomic kvstore instead of the old dumb datastore - this now allows consistent cross key operations, something that wasn't possible before.
  • It introduces the concept of Overlays, which are a fancy way of saying datasets, but more fitting since not all sets of blocks are datasets, i.e. slots
  • It consolidates expirations by moving them from blocks to the overlay. In the past each block would have a refcount and expiration, which lead to drift and inconsistencies. This changes remove this inconsistencies, blocks keep a refcount and can still be shared across several treeCids - original dataset, protected and verifiable, but allows them to have different lifecycles without stepping on each other. Expirations are handled atomically at the overlay level, so no multi-block updates are needed.

There are more improvements coming after this:

  • Consistent batched operations which should speed things up even further
  • Block exchange engine improvements that bring better connection handling and batched block transfers
  • True parallelized encoding in erasure coding (several encoding/decoding jobs running at the same time)
  • Parallel merkle tree building (tree splitting and merkelezation in parallel)
  • And probably a few more but those should improve speed and stability across the board

The change is quite big due to its crosscutting nature.

This depends on durability-labs/nim-kvstore#2, durability-labs/archivist-dht#7 and durability-labs/nim-metrics#1

@dryajov dryajov force-pushed the feat/overlay-support branch 3 times, most recently from 86b7062 to 4abbbea Compare February 15, 2026 08:51
@dryajov dryajov marked this pull request as ready for review February 16, 2026 20:47
leoDecoderProvider, self.taskpool,
)
encodedManifest = ?await erasure.encode(manifest, ecK, ecM)
manifestBlk = ?await self.repoStore.storeManifest(encodedManifest)
Copy link
Contributor

@benbierens benbierens Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding manifestBlk = ?await self.repoStore.storeManifest(encodedManifest) causes a test failure.
manifestBlk variable is unused, and storing of the manifest is already performed at node.nim:587.

This extra storeManifest causes the protected-manifest to be stored in addition to the verifiable manifest. This isn't necessary since all marketplace and EC operations have been working with verifiable manifests. In the current flow, the protected manifest is never persisted. It's just an intermediate stage of going from basic to verifiable. We have no usecase where protected manifests are used.

The test fails because it expects to find the basic manifest and the verifiable manifest stored in the node. Instead of 2 it finds 3, but two of them have the exact same information (as far as the API is concerned) just a different CID.

(Test failure: DeletesExpiredDataUsedByStorageRequests at 8ba752c)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no harm of storing the protected manifest, even tho the verifiable manifest is an extension of the protected and the top level treeCid is is the same, for consistency reasons it might be fine to store the protected manifest - tho its not critical and I'll remove it for now...

@benbierens
Copy link
Contributor

benbierens commented Feb 17, 2026

Upload performance has decreased by more than 10x. Here are some numbers:

7e8c0c8 - Feat/make submodules great again (#96)

Size Upload Download
1MB (26 ms) (112 ms)
10MB (87 ms) (111 ms)
100MB (756 ms) (432 ms)
1GB (6 secs) (4 secs)
10GB (1 mins, 9 secs) (43 secs)

1ad57bf - format

Size Upload Download
1MB (286 ms) (108 ms)
10MB (2 secs) (212 ms)
100MB (20 secs) (1 secs)
1GB (4 mins, 56 secs) (12 secs)
10GB ( > 30 mins)

I would expect the erasure-coding, and blockexchange performance to be affected as well since they also involve adding blocks to local storage. But I have no evidence of this because I haven't run the tests at this time.

(the above test involves a single node. Download numbers do not include blockexchange.)

self: ArchivistNodeRef, manifestCid: Cid, expiry: SecondsSince1970
): Future[?!void] {.async: (raises: [CancelledError]).} =
without manifest =? await self.fetchManifest(manifestCid, expiry), error:
without manifest =? await self.fetchManifest(manifestCid), error:
Copy link
Contributor

@benbierens benbierens Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocks for expired storage requests are not being cleaned up correctly. Blocks are downloaded as part of slots. The slots are successfully filled, but the request fails to start. (This is part of the test.) We expect the data blocks and the manifest of the failed request to be cleaned up in sync with the request's expiry as provided by the marketplace storeSlot callback.

Overlay does become dropped and reached "Deleting" state.

TRC 2026-02-17 14:25:43.169+00:00 Dropping overlay                           topics="archivist maintenance" tid=1 treeCid=zE2*VghTLV status=Failure expiry=1771424733 count=14647
TRC 2026-02-17 14:25:43.169+00:00 Dropping overlay and cleaning up blocks    topics="archivist repostore overlays" tid=1 treeCid=zE2*VghTLV count=14648
TRC 2026-02-17 14:25:43.169+00:00 Overlay metadata stored                    topics="archivist repostore overlays" tid=1 treeCid=zE2*VghTLV status=Deleting count=14649

(Problem revealed by tests in DeceptiveContractTest at 8ba752c)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks I'll check it out - that is the point of this change, so...

@dryajov
Copy link
Contributor Author

dryajov commented Feb 17, 2026

Upload performance has decreased by more than 10x. Here are some numbers:

7e8c0c8 - Feat/make submodules great again (#96)

Size Upload Download
1MB (26 ms) (112 ms)
10MB (87 ms) (111 ms)
100MB (756 ms) (432 ms)
1GB (6 secs) (4 secs)
10GB (1 mins, 9 secs) (43 secs)
1ad57bf - format

Size Upload Download
1MB (286 ms) (108 ms)
10MB (2 secs) (212 ms)
100MB (20 secs) (1 secs)
1GB (4 mins, 56 secs) (12 secs)
10GB ( > 30 mins)
I would expect the erasure-coding, and blockexchange performance to be affected as well since they also involve adding blocks to local storage. But I have no evidence of this because I haven't run the tests at this time.

(the above test involves a single node. Download numbers do not include blockexchange.)

Yeah, looking into this... there will be some overhead due to CAS and atomic operations, but it shouldn't be this much.

@dryajov dryajov force-pushed the feat/overlay-support branch from 2bb7721 to 5b1513e Compare February 17, 2026 22:37
@benbierens
Copy link
Contributor

There seems to be a crash. It's somewhere in the area of receiving/downloading blocks from peers and storing them, when quota runs out. It may not reproduce reliably... I need to test more. Which I will!

TRC 2026-02-20 16:38:06.814+00:00 Updating counters                          topics="archivist repostore" tid=1 quotaDelta=65536 reservedDelta=0 blocksDelta=1 count=765
TRC 2026-02-20 16:38:06.814+00:00 Updating block count to                    topics="archivist repostore" tid=1 totalBlocks=19 count=766
TRC 2026-02-20 16:38:06.814+00:00 Updating quota to                          topics="archivist repostore" tid=1 quotaUsed=1179775'NByte quotaReserved=0'NByte count=767
TRC 2026-02-20 16:38:06.814+00:00 Storing Leafs and Blocks                   topics="archivist repostore" tid=1 treeCid=zDz*LLZqHD totalItems=1 count=768
TRC 2026-02-20 16:38:06.814+00:00 Putting blocks                             topics="archivist repostore" tid=1 actualBlocks=1 totalSize=65536 treeCid=zDz*LLZqHD totalItems=1 count=769
ERR 2026-02-20 16:38:06.814+00:00 Unhandled exception in async proc, aborting topics="archivist" tid=1 msg="value out of range: -131199 notin 0 .. 9223372036854775807" count=770

@dryajov
Copy link
Contributor Author

dryajov commented Feb 20, 2026

There seems to be a crash. It's somewhere in the area of receiving/downloading blocks from peers and storing them, when quota runs out. It may not reproduce reliably... I need to test more. Which I will!

TRC 2026-02-20 16:38:06.814+00:00 Updating counters                          topics="archivist repostore" tid=1 quotaDelta=65536 reservedDelta=0 blocksDelta=1 count=765
TRC 2026-02-20 16:38:06.814+00:00 Updating block count to                    topics="archivist repostore" tid=1 totalBlocks=19 count=766
TRC 2026-02-20 16:38:06.814+00:00 Updating quota to                          topics="archivist repostore" tid=1 quotaUsed=1179775'NByte quotaReserved=0'NByte count=767
TRC 2026-02-20 16:38:06.814+00:00 Storing Leafs and Blocks                   topics="archivist repostore" tid=1 treeCid=zDz*LLZqHD totalItems=1 count=768
TRC 2026-02-20 16:38:06.814+00:00 Putting blocks                             topics="archivist repostore" tid=1 actualBlocks=1 totalSize=65536 treeCid=zDz*LLZqHD totalItems=1 count=769
ERR 2026-02-20 16:38:06.814+00:00 Unhandled exception in async proc, aborting topics="archivist" tid=1 msg="value out of range: -131199 notin 0 .. 9223372036854775807" count=770

Keep in mind that I'm still in the process of improving perf. The degradation from CAS should be within 10-20%, but not 10x...

I'll look into the crash - thanks!

@dryajov dryajov force-pushed the feat/overlay-support branch from 871859f to a6ccdb1 Compare February 27, 2026 22:30
@benbierens
Copy link
Contributor

Reran the 2GB upload test, just to keep an eye on performance.
Main ccbc239 - Revert "feat: re-enable http pipelining for json-rpc" = (12 secs)
Branch 0109665 - bump kvstore = (2 mins, 45 secs)

On previous commits of this branch, performance was so slow that the test would time out and fail. So performance has definitely improved. It's still far behind when compared to main.

@markspanbroek
Copy link
Contributor

I'm currently reviewing this. I've reviewed about 20% of the files in this PR (excluding the dependency PRs) in one day, so this is going to take a while 😅

@dryajov
Copy link
Contributor Author

dryajov commented Mar 2, 2026

Reran the 2GB upload test, just to keep an eye on performance. Main ccbc239 - Revert "feat: re-enable http pipelining for json-rpc" = (12 secs) Branch 0109665 - bump kvstore = (2 mins, 45 secs)

On previous commits of this branch, performance was so slow that the test would time out and fail. So performance has definitely improved. It's still far behind when compared to main.

Apparently, there is something going on under Linux, I'm getting abnormally slow uploads, so I'm still looking into it. On Mac, this is about %20-%30 faster than our current main.

Copy link
Contributor

@markspanbroek markspanbroek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dryajov, this is a much-needed change that ensures that concurrent updates to the repostore are handled correctly.

What I'm still missing is a way to handle multiple marketplace requests to store the same slot data, for instance when someone posts a new request for data that's about to expire on the network so that it stays on the network. These requests have different expiries for the slot data. But that's probably not something to address in this PR.

Also, the verbosity of the update mechanism in the kvstore makes this PR hard to read. Most updates boil down to the following form:

?await store.update(key, value):
  value.foo = bar
  value.baz = value.baz + qux

But you have to look through a lot of boilerplate to see this.

jobs:
build:
strategy:
fail-fast: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a left-over from testing? I would not disable fail-fast in general, because we have too few github runners to use them on jobs in PRs that are failing anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'll clean it up before merge, but I need this to be able to see what breaks on which platform.

let inconsistencies = (await repo.verifyBlockBitState(treeCid)).tryGet()
check inconsistencies.len == 0

test "Concurrent put and delete operations maintain consistency":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing a test where there's concurrent putOverlay() and dropOverlay() for the same cid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good thinking!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I reworked a lot of the tests, bottom line is that concurrent delete/puts are now handled properly. If an overlay is marked as deleted, which means that delete is in progress, puts to that overlay are no longer possible until the delete is stopped or finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that "stopping the delete" is possible, because we're using a delete future handle, but it hasn't been fully exposed yet. I'll do this in subsequent iterations - this PR is way too big already.

var
repoDs: Datastore
metaDs: Datastore
path = currentSourcePath() # get this file's name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably nicer to use a temporary directory, instead of putting it next to the test sources. You can use createTempDir.


test "validator marks proofs as missing":
let node = await testbed.node.persistence.start()
let node = await testbed.node.log("archivist", "validator").persistence.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that these are left-overs from some testing that you did? They probably shouldn't be committed.

For all integration tests, I've made sure that logging is disabled by default, so that when you enable it for some test on your local machine, you don't need to search through unrelated logs.

Also, when you start a validator, the "validator" log topic is added automatically.

arguments.add("--block-ttl=" & $blockTtl)
if blockMaintenanceInterval =? builder.blockMaintenanceInterval:
arguments.add("--block-mi=" & $blockMaintenanceInterval)
arguments.add("--circuit-dir=" & builder.dataDirResolved / "circuits")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for adding this? The --circom-r1cs, --circom-wasm, --circom-zkey and --circom-graph are already set above, and they use a different circuit directory (the one in hardhat, to match the circuit that is used for the smart contracts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, it will look for circuit files in the OS common directory first, if you ever ran archivist without a --data-dir passed, you would have them downloaded there and it obviously breaks the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testbed always passes --data-dir to a node:

arguments &= "--data-dir=" & $node.dataDir

dryajov and others added 10 commits March 5, 2026 11:59
@dryajov dryajov force-pushed the feat/overlay-support branch from faeb0ce to 930d1f4 Compare March 5, 2026 19:33
Copy link
Contributor

@markspanbroek markspanbroek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recent changes look good, thanks!

I'm approving this, so that you can merge it as soon as you've addressed the remaining review comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants