Skip to content
Open
Changes from 17 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
8842176
Create ipip-0000.md
mishmosh Apr 3, 2025
4ba68f0
Update and rename ipip-0000.md to ipip-0499.md
mishmosh Apr 3, 2025
6cc64cb
add extra attributes proposed in review
lidel Apr 15, 2025
d8b8389
incorporate kubo#10774
lidel Apr 15, 2025
600d1fc
Merge branch 'main' into patch-1
bumblefudge May 5, 2025
595588c
Update src/ipips/ipip-0499.md
2color Aug 12, 2025
41f9b86
add daniel as editor
2color Aug 12, 2025
229988f
edit summary and motivation
2color Aug 12, 2025
f37e610
edit summary
2color Aug 12, 2025
7a12f0a
edit parameters and design
2color Aug 12, 2025
ff69e56
edit user benefit and compatibility
2color Aug 12, 2025
09baf68
refine parameters and introduce a named profile
2color Aug 12, 2025
cffade8
Apply suggestions from code review
2color Aug 20, 2025
0402c84
edit based on hector's feedback
2color Aug 20, 2025
ec07e30
Apply suggestions from code review
2color Aug 20, 2025
f454912
add multibase encoding
2color Aug 20, 2025
9c621ba
address feedback from rvagg
2color Aug 20, 2025
c109c1a
Update ipip-0499.md
mishmosh Nov 15, 2025
383f9e3
Update src/ipips/ipip-0499.md
mishmosh Nov 20, 2025
e564968
Update src/ipips/ipip-0499.md
mishmosh Nov 20, 2025
bbd547f
Update src/ipips/ipip-0499.md
lidel Nov 20, 2025
70514b9
fix typo (the the)
mishmosh Nov 21, 2025
89c9c62
Merge branch 'main' into patch-1
lidel Dec 12, 2025
92352d7
feat(ipip-0499): add chunking algorithm and align profile tables
lidel Dec 12, 2025
9d0d415
fix(ipip-0499): correct kubo legacy profile
lidel Dec 12, 2025
a3dc7e2
fix(ipip-0499): document legacy profile filtering behavior
lidel Dec 13, 2025
94a1b79
fix(ipip-0499): note that legacy table includes non-UnixFS implementa…
lidel Dec 13, 2025
7a8d6ab
feat(ipip-0499): add implementation versions to legacy profiles table
lidel Dec 13, 2025
a3044d6
fix(ipip-0499): update HAMTDirectory threshold and clean up parameters
lidel Dec 13, 2025
5b19f2b
feat(ipip-0499): document symlink handling in profiles
lidel Dec 13, 2025
3a092a4
fix(ipip-0499): clarify HAMTDirectory threshold calculation methods
lidel Dec 13, 2025
123be3d
fix(ipip-0499): update metadata and add contributors
lidel Dec 13, 2025
263892a
feat(ipip-0499): document HAMTDirectory threshold estimation methods
lidel Jan 13, 2026
e2f95dd
chore: bump spec-generator to 1.7.0
lidel Jan 13, 2026
b832bcc
feat(ipip-0499): restructure document and rename profiles
lidel Jan 14, 2026
d7e81d7
feat(ipip-0499): document efficiency benefits of modern profile param…
lidel Jan 14, 2026
26162e2
Merge branch 'main' into patch-1
lidel Jan 16, 2026
37132f1
feat(ipip-0499): add singularity to divergence table
lidel Jan 16, 2026
0188e10
feat(ipip-0499): add test fixtures section with deterministic CIDs
lidel Jan 24, 2026
273a2d3
fix(ipip-0499): update test fixtures with chunk threshold vectors
lidel Jan 27, 2026
62d3cae
refactor(ipip-0499): restructure Motivation section
lidel Jan 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions src/ipips/ipip-0499.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
title: 'IPIP-0499: CID Profiles'
date: 2025-04-03
ipip: proposal
editors:
- name: Michelle Lee
github: mishmosh
affiliation:
name: IPFS Foundation
- name: Daniel Norman
github: 2color
affiliation:
name: Shipyard
url: https://ipshipyard.com
relatedIssues:
- https://discuss.ipfs.tech/t/should-we-profile-cids/18507
order: 0499
tags: ['ipips']
---

## Summary

This proposal introduces configuration profiles for CIDs used to represent files and directories with UnixFS. These ensure that the deterministic CID generation for the same data, regardless of the implementation.

Profiles explicitly define the UnixFS parameters, e.g. dag width, hash algorithm, and chunk size, that affect the resulting CID, such that given the profile and input data different implementations will generate identical CIDs.

## Motivation

UnixFS CIDs are not deterministic. This means that the same file tree can yield different CIDs depending on the parameters used by the implementation to generate it, which in some cases, aren't even configurable by the user. For example, the chunk size, DAG width, and layout can vary between implementations or even between different versions of the same implementation.

This lack of determinism makes has a number of drawbacks:

- It is difficult to verify content across different tools and implementations, as the same content may yield different CIDs.
- Users are required to store and transfer UnixFS merkle proofs in order to verify CIDs, adding storage overhead, network bandwidth, and complexity to the verification process.
- In terms of developer experience, it deviates from the mental model of a hash function, where the same input should always yield the same output. This leads to confusion and frustration when working with UnixFS CIDs

By introducing profiles which define the parameters that affect the root CID of the DAG, we can benefit from both the optionality offered by UnixFS, where users are free to chose their own parameters, and determinism through profiles.

## Detailed design

We introduce a set of named profiles that define a set of parameters for generating UnixFS CIDs. These profiles can be used by implementations to ensure that the same content will yield the same CID across different tools and implementations.

### UnixFS parameters

The profiles define a set of parameters that affect the resulting string encoding of the CID. These parameters are based on the UnixFS specification and are used to generate the CID for a given file tree. The parameters include:

1. CID version, e.g. CIDv0 or CIDv1
1. Multibase encoding for the CID, e.g. base32
1. Hash function used for all nodes in the DAG, e.g. sha2-256
1. UnixFS file chunking algorithm
1. UnixFS file chunk size or target (if required by the chunking algorithm)
1. UnixFS DAG layout (e.g. balanced, trickle etc...)
1. UnixFS DAG width (max number of links per `File` node)
1. `HAMTDirectory` bitwidth, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves).
1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this number is dynamic based on the lengths of the actual link entries in the dag, we will need to specify what algorithm that estimation follows. I would put such things in a special "ipfs legacy" profile to be honest, along with cidv0, non-raw leaves etc. We probably should heavily discourage coming up with profiles that do weird things, like dynamically setting params or not using raw-leaves for things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, each layout would have its own set of layout-params:

  • balanced:
    • max-links: N
  • trickle:
    • max-leaves-per-level: N

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should heavily discourage coming up with profiles that do weird things, like dynamically setting params or not using raw-leaves for things.

Yeah, that's exactly what we're doing by defining this profile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait is kubo dynamically assigning HAMT Directory threshold, currently? i was assuming this was a static number!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current spec mentions fanout but not threshold, so i'm a little confused what current implementations are doing and if it's even worth fitting into the profile system or just giving up and letting a significant portion of HAMT-shared legacy data just but unprofiled/not-recreatable using the profiles...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel Is this written down in any of the specs? Or is it just in the code at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel @hsanjuan Trying to understand/resolve this thread. Can you confirm if this is current kubo behavior?

HAMTDirectory threshold (max Directory size before switching to HAMTDirectory): based on an estimate of the block size by counting the size of PNNode.Links

Copy link
Member

@lidel lidel Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAK decision when to do HAMTDirectory is an implementation-specific behavior. So far the rule of thumb is to keep blocks under 1-2MiB and usually good idea to match chunk size defined (default or defined by user).

Implementation-wise both GO (Boxo/Kubo) and JS (Helia) have size-based heuristic that makes decision when to switch from normal Directory to HAMTDirectory:

iirc (from 2 year old memory, something to check/confirm) is that the size estimation details may/are likely different between GO and JS. They both estimate the serialized DAGNode size by calculating the aggregate byte length of directory entries (link names + CIDs), though the JavaScript implementation appears to include additional metadata in its calculation:

  • Kubo's size estimation method is likely estimatedSize = sum(len(link.Name) + len(link.Cid.Bytes()) for each link)
  • Helia is likely "the size of the final DAGNode (including link names, sizes, optional metadata fields etc)"

If true, the slight differences in calculation methods might result in directories sharding at marginally different sizes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be exact you have to take into account any non-zero value fields in the serialized root UnixFS metadata since these affect the block size.

It's quite possible that Kubo will produce a HAMT block that's too big with a certain combination of directory entry names if someone has also changed the encoded directory's default mtime or whatever, probably because the "should-I-shard" feature pre-dates Kubo's ability to add UnixFSv1.5 metadata to things.

Really there's no need to estimate anything - it's trivial to count the actual bytes that a block will take up and then shard if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented the need for

  • kubo implementing correct estimation base don total block size
  • kubo configuration to switch between old and new estimation method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links
1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links. We do not include details about the estimation algorithm as we do not encourage implementations to support it.

Copy link
Member

@lidel lidel Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit odd to discourage, when both most popular implementations in GO and JS use size-based heurstic - #499 (comment)

Unsure how to handle this. Perhaps clarify the heuristic is implementation-specific, and when deterministic behavior is expected, a specific heuristic should be used?

Copy link
Member

@achingbrain achingbrain Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be estimating the block size as it's trivial to calculate it exactly. Can we not just define this (and punt to the spec for the details) to make it less hand-wavey?

Suggested change
1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on an estimate of the block size by counting the size of PNNode.Links
1. `HAMTDirectory` threshold (max `Directory` size before switching to `HAMTDirectory`): based on the final size of the serialized form of the [PBNode protobuf message](https://specs.ipfs.tech/unixfs/#dag-pb-node) that represents the directory.

1. Leaf Envelope: either `dag-pb` or `raw`
1. Whether empty directories are included in the DAG. Some implementations apply filtering before merkleizing filesystem entries in the DAG.
1. Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a `Directory` with link to the file.
2. Presence and accurate setting of `Tsize`.

This would be specified as a table in (forthcoming [UnixFS spec](https://github.com/ipfs/specs/pull/331/files)).

## Named profiles

To make it easier for users and implementations to choose a set of parameters, we define a named profile `unixfs-2025` to encapsulate the parameters established as the baseline default by multiple implementations as of 2025.

The **`unixfs-2025`** profile name is designed to be referenced by implementations and users to ensure that the same content will yield the same CID across different tools and implementations.

The profile is defined as follows:

| Parameter | Value |
| ----------------------------- | ------------------------------------------------------- |
| CID version | CIDv1 |
| Hash function | sha2-256 |
| Max chunk size | 1MiB |
| DAG layout | balanced |
| DAG width (children per node) | 1024 |
| `HAMTDirectory` fanout | 256 blocks |
| `HAMTDirectory` threshold | 256KiB (estimated by counting the size of PBNode.links) |
| Leaves | raw |
| Empty directories | TODO |

## Current defaults

Here is a summary table of current (2025-Q2) defaults:

| | Helia default | Kubo `legacy-cid-v0` (default) | Storacha default | Kubo `test-cid-v1` | Kubo `test-cid-v1-wide` | DASL |
| ----------------------------- | ------------- | ------------------------------ | ---------------- | ------------------ | ----------------------- | ------------- |
| CID version | CIDv1 | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 |
| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 |
| Max chunk size | 1MiB | 256KiB | 1MiB | 1MiB | 1MiB | not specified |
| DAG layout | balanced | balanced | balanced | balanced | balanced | not specified |
| DAG width (children per node) | 1024 | 174 | 1024 | 174 | **1024** | not specified |
| `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified |
| `HAMTDirectory` threshold | 256KiB (est) | 256KiB (est:links[name+cid]) | 1000 **links** | 256KiB | **1MiB** | not specified |
| Leaves | raw | raw | raw | raw | raw | not specified |
| Empty directories | Included | Included | Ignored | Included | Included | not specified |

See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507/

### User benefit

Profiles reduce the burden of verifying UnixFS content, as users can simply choose a profile and know that the resulting CIDs will be deterministic across implementations. This eliminates the need for users to understand the underlying parameters that affect CID generation, and allows them to focus on the content itself.

Moreover, profiles allow users to verify content without needing to rely on additional merkle proofs and CAR files, which can be cumbersome and inefficient.

Finally, profiles improve the developer experience by aligning with the mental model of a hash function.

### Compatibility

UnixFS Data encoded with the profiles defined in this IPIP is fully compatible with existing implementations, as it is fully compliant with the UnixFS specification.

To produce CIDs that are compliant with this IPIP, implementations will need to support the parameters defined in the profiles. This may require changes to existing implementations to expose configuration options for the parameters, or to implement new functionality to support the profiles.

Kubo 0.35 will have [`Import.*` configuration](https://github.com/ipfs/kubo/blob/master/docs/config.md#import) option to control DAG width.

### Alternatives

As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle DAG nodes needed to verify the CID.

## Test fixtures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting this is (imo) a blocker.

We did not merge UnixFS spec until we had sensible set of fixtures that people could use as reference.

The spec may be incomplete, but a fixture will let people reverse-engineer any details, and then PR improvement to spec.

Without fixtures for each UnixFS node type, we risk unknown unknown silently impacting final CID (e.g. because we did not know that someone may decide to place leaves one level sooner as "optimization" and someone else always at bottom, as "formal consistency")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking this in ipfs/kubo#11071

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


TODO

List relevant CIDs. Describe how implementations can use them to determine
specification compliance. This section can be skipped if IPIP does not deal
with the way IPFS handles content-addressed data, or the modified specification
file already includes this information.

### Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
Loading