Conversation
86b7062 to
4abbbea
Compare
| leoDecoderProvider, self.taskpool, | ||
| ) | ||
| encodedManifest = ?await erasure.encode(manifest, ecK, ecM) | ||
| manifestBlk = ?await self.repoStore.storeManifest(encodedManifest) |
There was a problem hiding this comment.
Adding manifestBlk = ?await self.repoStore.storeManifest(encodedManifest) causes a test failure.
manifestBlk variable is unused, and storing of the manifest is already performed at node.nim:587.
This extra storeManifest causes the protected-manifest to be stored in addition to the verifiable manifest. This isn't necessary since all marketplace and EC operations have been working with verifiable manifests. In the current flow, the protected manifest is never persisted. It's just an intermediate stage of going from basic to verifiable. We have no usecase where protected manifests are used.
The test fails because it expects to find the basic manifest and the verifiable manifest stored in the node. Instead of 2 it finds 3, but two of them have the exact same information (as far as the API is concerned) just a different CID.
(Test failure: DeletesExpiredDataUsedByStorageRequests at 8ba752c)
There was a problem hiding this comment.
There is no harm of storing the protected manifest, even tho the verifiable manifest is an extension of the protected and the top level treeCid is is the same, for consistency reasons it might be fine to store the protected manifest - tho its not critical and I'll remove it for now...
|
Upload performance has decreased by more than 10x. Here are some numbers:
I would expect the erasure-coding, and blockexchange performance to be affected as well since they also involve adding blocks to local storage. But I have no evidence of this because I haven't run the tests at this time. (the above test involves a single node. Download numbers do not include blockexchange.) |
| self: ArchivistNodeRef, manifestCid: Cid, expiry: SecondsSince1970 | ||
| ): Future[?!void] {.async: (raises: [CancelledError]).} = | ||
| without manifest =? await self.fetchManifest(manifestCid, expiry), error: | ||
| without manifest =? await self.fetchManifest(manifestCid), error: |
There was a problem hiding this comment.
Blocks for expired storage requests are not being cleaned up correctly. Blocks are downloaded as part of slots. The slots are successfully filled, but the request fails to start. (This is part of the test.) We expect the data blocks and the manifest of the failed request to be cleaned up in sync with the request's expiry as provided by the marketplace storeSlot callback.
Overlay does become dropped and reached "Deleting" state.
TRC 2026-02-17 14:25:43.169+00:00 Dropping overlay topics="archivist maintenance" tid=1 treeCid=zE2*VghTLV status=Failure expiry=1771424733 count=14647
TRC 2026-02-17 14:25:43.169+00:00 Dropping overlay and cleaning up blocks topics="archivist repostore overlays" tid=1 treeCid=zE2*VghTLV count=14648
TRC 2026-02-17 14:25:43.169+00:00 Overlay metadata stored topics="archivist repostore overlays" tid=1 treeCid=zE2*VghTLV status=Deleting count=14649
(Problem revealed by tests in DeceptiveContractTest at 8ba752c)
There was a problem hiding this comment.
thanks I'll check it out - that is the point of this change, so...
Yeah, looking into this... there will be some overhead due to CAS and atomic operations, but it shouldn't be this much. |
2bb7721 to
5b1513e
Compare
|
There seems to be a crash. It's somewhere in the area of receiving/downloading blocks from peers and storing them, when quota runs out. It may not reproduce reliably... I need to test more. Which I will! |
Keep in mind that I'm still in the process of improving perf. The degradation from CAS should be within 10-20%, but not 10x... I'll look into the crash - thanks! |
871859f to
a6ccdb1
Compare
|
Reran the 2GB upload test, just to keep an eye on performance. On previous commits of this branch, performance was so slow that the test would time out and fail. So performance has definitely improved. It's still far behind when compared to main. |
|
I'm currently reviewing this. I've reviewed about 20% of the files in this PR (excluding the dependency PRs) in one day, so this is going to take a while 😅 |
Apparently, there is something going on under Linux, I'm getting abnormally slow uploads, so I'm still looking into it. On Mac, this is about %20-%30 faster than our current main. |
markspanbroek
left a comment
There was a problem hiding this comment.
Thanks @dryajov, this is a much-needed change that ensures that concurrent updates to the repostore are handled correctly.
What I'm still missing is a way to handle multiple marketplace requests to store the same slot data, for instance when someone posts a new request for data that's about to expire on the network so that it stays on the network. These requests have different expiries for the slot data. But that's probably not something to address in this PR.
Also, the verbosity of the update mechanism in the kvstore makes this PR hard to read. Most updates boil down to the following form:
?await store.update(key, value):
value.foo = bar
value.baz = value.baz + qux
But you have to look through a lot of boilerplate to see this.
| jobs: | ||
| build: | ||
| strategy: | ||
| fail-fast: false |
There was a problem hiding this comment.
Probably a left-over from testing? I would not disable fail-fast in general, because we have too few github runners to use them on jobs in PRs that are failing anyway.
There was a problem hiding this comment.
yeah, I'll clean it up before merge, but I need this to be able to see what breaks on which platform.
| let inconsistencies = (await repo.verifyBlockBitState(treeCid)).tryGet() | ||
| check inconsistencies.len == 0 | ||
|
|
||
| test "Concurrent put and delete operations maintain consistency": |
There was a problem hiding this comment.
I'm missing a test where there's concurrent putOverlay() and dropOverlay() for the same cid
There was a problem hiding this comment.
ok, I reworked a lot of the tests, bottom line is that concurrent delete/puts are now handled properly. If an overlay is marked as deleted, which means that delete is in progress, puts to that overlay are no longer possible until the delete is stopped or finished.
There was a problem hiding this comment.
note that "stopping the delete" is possible, because we're using a delete future handle, but it hasn't been fully exposed yet. I'll do this in subsequent iterations - this PR is way too big already.
| var | ||
| repoDs: Datastore | ||
| metaDs: Datastore | ||
| path = currentSourcePath() # get this file's name |
There was a problem hiding this comment.
It's probably nicer to use a temporary directory, instead of putting it next to the test sources. You can use createTempDir.
|
|
||
| test "validator marks proofs as missing": | ||
| let node = await testbed.node.persistence.start() | ||
| let node = await testbed.node.log("archivist", "validator").persistence.start() |
There was a problem hiding this comment.
I'm guessing that these are left-overs from some testing that you did? They probably shouldn't be committed.
For all integration tests, I've made sure that logging is disabled by default, so that when you enable it for some test on your local machine, you don't need to search through unrelated logs.
Also, when you start a validator, the "validator" log topic is added automatically.
| arguments.add("--block-ttl=" & $blockTtl) | ||
| if blockMaintenanceInterval =? builder.blockMaintenanceInterval: | ||
| arguments.add("--block-mi=" & $blockMaintenanceInterval) | ||
| arguments.add("--circuit-dir=" & builder.dataDirResolved / "circuits") |
There was a problem hiding this comment.
What is the reason for adding this? The --circom-r1cs, --circom-wasm, --circom-zkey and --circom-graph are already set above, and they use a different circuit directory (the one in hardhat, to match the circuit that is used for the smart contracts)
There was a problem hiding this comment.
Without this, it will look for circuit files in the OS common directory first, if you ever ran archivist without a --data-dir passed, you would have them downloaded there and it obviously breaks the tests.
There was a problem hiding this comment.
Testbed always passes --data-dir to a node:
reason: nimble is very inconsistent with larger dependency graphs, and doesn't allow overriding of stint with our forked version
Co-authored-by: markspanbroek <mark@spanbroek.net>
Co-authored-by: markspanbroek <mark@spanbroek.net>
Co-authored-by: markspanbroek <mark@spanbroek.net>
Co-authored-by: markspanbroek <mark@spanbroek.net>
Co-authored-by: markspanbroek <mark@spanbroek.net>
faeb0ce to
930d1f4
Compare
markspanbroek
left a comment
There was a problem hiding this comment.
The recent changes look good, thanks!
I'm approving this, so that you can merge it as soon as you've addressed the remaining review comments.
This PR introduces several important changes:
There are more improvements coming after this:
The change is quite big due to its crosscutting nature.
This depends on durability-labs/nim-kvstore#2, durability-labs/archivist-dht#7 and durability-labs/nim-metrics#1