-
Notifications
You must be signed in to change notification settings - Fork 238
IPIP-499: UnixFS CID Profiles #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 18 commits
8842176
4ba68f0
6cc64cb
d8b8389
600d1fc
595588c
41f9b86
229988f
f37e610
7a12f0a
ff69e56
09baf68
cffade8
0402c84
ec07e30
f454912
9c621ba
c109c1a
383f9e3
e564968
bbd547f
70514b9
89c9c62
92352d7
9d0d415
a3dc7e2
94a1b79
7a8d6ab
a3044d6
5b19f2b
3a092a4
123be3d
263892a
e2f95dd
b832bcc
d7e81d7
26162e2
37132f1
0188e10
273a2d3
c2af85f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,132 @@ | ||||||||||
| --- | ||||||||||
| title: 'IPIP-0499: CID Profiles' | ||||||||||
| date: 2025-11-14 | ||||||||||
| ipip: proposal | ||||||||||
| editors: | ||||||||||
| - name: Michelle Lee | ||||||||||
| github: mishmosh | ||||||||||
| affiliation: | ||||||||||
| name: IPFS Foundation | ||||||||||
| url: https://ipfsfoundation.org | ||||||||||
| - name: Daniel Norman | ||||||||||
| github: 2color | ||||||||||
| affiliation: | ||||||||||
| name: Independent | ||||||||||
| url: https://norman.life | ||||||||||
| relatedIssues: | ||||||||||
| - https://discuss.ipfs.tech/t/should-we-profile-cids/18507 | ||||||||||
| order: 0499 | ||||||||||
| tags: ['ipips'] | ||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## Summary | ||||||||||
|
|
||||||||||
| This proposal introduces **configuration profiles** for CIDs that represent files and directories using [UnixFS](https://specs.ipfs.tech/unixfs/). | ||||||||||
|
|
||||||||||
| ## Motivation | ||||||||||
|
|
||||||||||
| UnixFS CIDs are currently non-deterministic. The same file or directory can produce different CIDs across implementations, because parameters like chunk size, DAG width, and layout vary between implementations. Often, these parameters are not even configurable by users. | ||||||||||
|
|
||||||||||
| This creates three problems: | ||||||||||
|
|
||||||||||
| - **Verification difficulty:** The same content produces different CIDs across tools, making content verification unreliable. | ||||||||||
| - **Additional overhead:** Users must store and transfer UnixFS merkle proofs to verify CIDs, adding storage overhead, network bandwidth, and complexity. | ||||||||||
| - **Broken expectations:** Unlike standard hash functions where identical input produces identical output, UnixFS CIDs behave unpredictably. | ||||||||||
|
|
||||||||||
| Configuration profiles solve this by explicitly defining all parameters that affect CID generation. This preserves UnixFS flexibility (users can still choose parameters) while enabling deterministic results. | ||||||||||
|
|
||||||||||
| ## Detailed design | ||||||||||
|
|
||||||||||
| We introduce a set of **named configuration profiles**, each specifying the complete set of parameters for generating UnixFS CIDs. When implementations use these profiles, they guarantee that the same input, processed with the same profile, will yield the same CID across different tools and implementations. | ||||||||||
|
|
||||||||||
| ### UnixFS parameters | ||||||||||
|
|
||||||||||
| Here is the complete set of UnixFS parameters that affect the resulting string encoding of the CID: | ||||||||||
|
|
||||||||||
| 1. CID version, e.g. CIDv0 or CIDv1 | ||||||||||
| 1. Multibase encoding for the CID, e.g. base32 | ||||||||||
| 1. Hash function used for all nodes in the DAG, e.g. sha2-256 | ||||||||||
| 1. UnixFS file chunking algorithm | ||||||||||
| 1. UnixFS file chunk size or target (if required by the chunking algorithm) | ||||||||||
| 1. UnixFS DAG layout (e.g. balanced, trickle etc...) | ||||||||||
| 1. UnixFS DAG width (max number of links per `File` node) | ||||||||||
| 1. `HAMTDirectory` fanout, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). | ||||||||||
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links | ||||||||||
|
||||||||||
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links | |
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links. We do not include details about the estimation algorithm as we do not encourage implementations to support it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit odd to discourage, when both most popular implementations in GO and JS use size-based heurstic - #499 (comment)
Unsure how to handle this. Perhaps clarify the heuristic is implementation-specific, and when deterministic behavior is expected, a specific heuristic should be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be estimating the block size as it's trivial to calculate it exactly. Can we not just define this (and punt to the spec for the details) to make it less hand-wavey?
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links | |
| 1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on the final size of the serialized form of the [PBNode protobuf message](https://specs.ipfs.tech/unixfs/#dag-pb-node) that represents the directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this number is dynamic based on the lengths of the actual link entries in the dag, we will need to specify what algorithm that estimation follows. I would put such things in a special "ipfs legacy" profile to be honest, along with cidv0, non-raw leaves etc. We probably should heavily discourage coming up with profiles that do weird things, like dynamically setting params or not using raw-leaves for things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, each layout would have its own set of layout-params:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's exactly what we're doing by defining this profile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait is kubo dynamically assigning HAMT Directory threshold, currently? i was assuming this was a static number!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current spec mentions fanout but not threshold, so i'm a little confused what current implementations are doing and if it's even worth fitting into the profile system or just giving up and letting a significant portion of HAMT-shared legacy data just but unprofiled/not-recreatable using the profiles...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidel Is this written down in any of the specs? Or is it just in the code at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidel @hsanjuan Trying to understand/resolve this thread. Can you confirm if this is current kubo behavior?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAK decision when to do HAMTDirectory is an implementation-specific behavior. So far the rule of thumb is to keep blocks under 1-2MiB and usually good idea to match chunk size defined (default or defined by user).
Implementation-wise both GO (Boxo/Kubo) and JS (Helia) have size-based heuristic that makes decision when to switch from normal Directory to HAMTDirectory:
iirc (from 2 year old memory, something to check/confirm) is that the size estimation details may/are likely different between GO and JS. They both estimate the serialized DAGNode size by calculating the aggregate byte length of directory entries (link names + CIDs), though the JavaScript implementation appears to include additional metadata in its calculation:
estimatedSize = sum(len(link.Name) + len(link.Cid.Bytes()) for each link)If true, the slight differences in calculation methods might result in directories sharding at marginally different sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be exact you have to take into account any non-zero value fields in the serialized root UnixFS metadata since these affect the block size.
It's quite possible that Kubo will produce a HAMT block that's too big with a certain combination of directory entry names if someone has also changed the encoded directory's default mtime or whatever, probably because the "should-I-shard" feature pre-dates Kubo's ability to add UnixFSv1.5 metadata to things.
Really there's no need to estimate anything - it's trivial to count the actual bytes that a block will take up and then shard if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented the need for