diff --git a/src/ipips/ipip-0499.md b/src/ipips/ipip-0499.md new file mode 100644 index 00000000..d5340478 --- /dev/null +++ b/src/ipips/ipip-0499.md @@ -0,0 +1,178 @@ +--- +title: 'IPIP-0499: UnixFS CID Profiles' +date: 2025-12-13 +ipip: proposal +editors: + - name: Michelle Lee + github: mishmosh + affiliation: + name: IPFS Foundation + url: https://ipfsfoundation.org + - name: Daniel Norman + github: 2color + affiliation: + name: Independent + url: https://norman.life + - name: Marcin Rataj + github: lidel + affiliation: + name: Shipyard + url: https://ipshipyard.com/ +relatedIssues: + - https://discuss.ipfs.tech/t/should-we-profile-cids/18507 +thanks: + - name: Alex Potsides + github: achingbrain + affiliation: + name: Shipyard + url: https://ipshipyard.com/ + - name: Juan Caballero + github: bumblefudge + affiliation: + name: IPFS Foundation + url: https://ipfsfoundation.org + - name: Hector Sanjuan + github: hsanjuan + affiliation: + name: Shipyard + url: https://ipshipyard.com/ + - name: Steven Vandevelde + github: icidasset + - name: Christian Paul + github: jaller94 + - name: Rod Vagg + github: rvagg + - name: Seth Docherty + github: SethDocherty +order: 0499 +tags: ['ipips'] +--- + +## Summary + +This proposal introduces **configuration profiles** for CIDs that represent files and directories using [UnixFS](https://specs.ipfs.tech/unixfs/). The legacy profiles table also documents non-UnixFS implementations for reference. + +## Motivation + +UnixFS CIDs are currently non-deterministic. The same file or directory can produce different CIDs across implementations, because parameters like chunk size, DAG width, and layout vary between implementations. Often, these parameters are not even configurable by users. + +This creates three problems: + +- **Verification difficulty:** The same content produces different CIDs across tools, making content verification unreliable. +- **Additional overhead:** Users must store and transfer UnixFS merkle proofs to verify CIDs, adding storage overhead, network bandwidth, and complexity. +- **Broken expectations:** Unlike standard hash functions where identical input produces identical output, UnixFS CIDs behave unpredictably. + +Configuration profiles solve this by explicitly defining all parameters that affect CID generation. This preserves UnixFS flexibility (users can still choose parameters) while enabling deterministic results. + +## Detailed design + +We introduce a set of **named configuration profiles**, each specifying the complete set of parameters for generating UnixFS CIDs. When implementations use these profiles, they guarantee that the same input, processed with the same profile, will yield the same CID across different tools and implementations. + +### UnixFS parameters + +Here is the complete set of UnixFS parameters that affect the resulting string encoding of the CID: + +1. CID version, e.g. CIDv0 or CIDv1 +1. Multibase encoding for the CID, e.g. `base32` +1. Hash function used for all nodes in the DAG, e.g. `sha2-256` +1. UnixFS file chunking algorithm +1. UnixFS file chunk size or target (if required by the chunking algorithm) +1. UnixFS DAG layout, e.g. `balanced`, `trickle` +1. UnixFS DAG width (max number of links per `File` node) +1. `HAMTDirectory` fanout, i.e. the number of bits determines the fanout of the `HAMTDirectory` (default bitwidth is 8 == 256 leaves). +1. `HAMTDirectory` threshold: max `Directory` size before switching to `HAMTDirectory`. Size can be calculated using full serialized [PBNode](https://specs.ipfs.tech/unixfs/#dag-pb-node) size (recommended), or estimated by `PBNode.Links` size (name + CID), or link count (naive). +1. Leaf Envelope: either `dag-pb` or `raw` +1. Whether empty directories are included in the DAG. Some implementations may apply filtering. +1. Whether hidden entities (including dot files) are included in the DAG. Some implementations may apply filtering. +1. Directory wrapping for single files: in order to retain the name of a single file, some implementations have the option to wrap the file in a `Directory` with link to the file. +1. Presence and accurate setting of `Tsize`. +1. Symlink handling: preserved as UnixFS Type=4 nodes, or followed (dereferenced to target). + +The [UnixFS spec](https://specs.ipfs.tech/unixfs/) defines Type=4 for symlinks with target path stored in the Data field. + +## CID profiles + +To enable consistent CID generation, we define a series of named profiles that specify complete UnixFS parameter sets. Profile names may have any prefix, but must end in `YYYY-MM`. + +The initial profile in the series, **`unixfs-2025`**, captures the baseline default parameters used by multiple implementations as of November 2025. + +| Parameter | `unixfs-2025` | +| ----------------------------- | ------------------------------------------------------- | +| CID version | CIDv1 | +| Hash function | sha2-256 | +| Chunking algorithm | fixed-size | +| Max chunk size | 1MiB | +| DAG layout | balanced | +| DAG width (children per node) | 1024 | +| `HAMTDirectory` fanout | 256 blocks | +| `HAMTDirectory` threshold | TODO (likely entire block size, as in Helia) | +| Leaves | raw | +| Empty directories | TODO (kubo needs opt-out flag) | +| Hidden entities | TODO | +| Symlinks | TODO (preserved?) | + +## Legacy profiles + +We also define a series of **legacy profiles**, used by various implementations as of November 2025: + +| | `kubo-legacy-2025` (v0.39) | `helia-2025` | `storacha-2025` | `kubo-2025` | `kubo-wide-2025` | `dasl-2025` | +| ----------------------------- | ------------------------------ | --------------- | ------------------ | ------------------ | ----------------------- | ------------- | +| Based on | kubo v0.39 (`legacy-cid-v0`) | @helia/unixfs 6.0.4 | w3cli 7.12.0 | kubo v0.39 (`test-cid-v1`) | kubo v0.39 (`test-cid-v1-wide`) | 2025-12 | +| CID version | CIDv0 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | CIDv1 | +| Hash function | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | sha2-256 | +| Chunking algorithm | fixed-size | fixed-size | fixed-size | fixed-size | fixed-size | not specified | +| Max chunk size | 256KiB | 1MiB | 1MiB | 1MiB | 1MiB | not specified | +| DAG layout | balanced | balanced | balanced | balanced | balanced | not specified | +| DAG width (children per node) | 174 | 1024 | 1024 | 174 | **1024** | not specified | +| `HAMTDirectory` fanout | 256 blocks | 256 blocks | 256 blocks | 256 blocks | **1024** | not specified | +| `HAMTDirectory` threshold | 256KiB (est:links[name+cid]) | 256KiB (est) | 1000 **links** | 256KiB (est:links[name+cid]) | **1MiB** (est:links[name+cid]) | not specified | +| Leaves | dag-pb | raw | raw | raw | raw | not specified | +| Empty directories | included | included | excluded | included | included | not specified | +| Hidden entities | opt-in | opt-in | opt-in | opt-in | opt-in | not specified | +| Symlinks | preserved | followed | followed | preserved | preserved | not specified | + +**Terminology:** +- `included`: Always included in the DAG (no option to exclude) +- `excluded`: Always excluded from the DAG (no option to include) +- `opt-in`: Excluded by default; implementations provide a flag to include (e.g., `--hidden` in Kubo/Storacha, `hidden: true` in Helia) +- `opt-out`: Included by default; implementations provide a flag to exclude +- `preserved`: Symlinks stored as UnixFS Type=4 nodes with target path (per [UnixFS spec](https://specs.ipfs.tech/unixfs/)). Note: Kubo (v0.39) `--dereference-args` only follows symlinks passed as CLI arguments; symlinks found during recursive traversal are always preserved. +- `followed`: Symlinks dereferenced and treated as target files/directories + +See related discussion at https://discuss.ipfs.tech/t/should-we-profile-cids/18507/ + +### User benefit + +Profiles provide 3 key advantages for working with content-addressed data: + +1. **Predictable, deterministic behavior:** Profiles restore the expected property of content addressing: identical input data always produces identical CIDs, regardless of which implementation generates them. + +2. **Lightweight verification:** Users can verify content without needing to rely on additional merkle proofs or CAR files. + +3. **Simplified workflow:** Users can select a profile and automatically get consistent CIDs across all implementations, without needing to configure or understand the underlying parameters. + +### Compatibility + +UnixFS data encoded with the CID profiles defined in this IPIP remains fully compatible with existing implementations, since it conforms to the [https://specs.ipfs.tech/unixfs/](specification). + +To generate CIDs in compliance with this IPIP, implementations must support the parameters defined in the profiles and support the set of named profiles. They MAY also support legacy profiles. + +* Adding new functionality to support parameters and/or profiles +* Exposing configuration options for profiles + +### Alternatives + +As an alternative to profiles, users can store and transfer CAR files of UnixFS content, which include the merkle DAG nodes needed to verify the CID. + +## Test fixtures + +TODO + +List relevant CIDs. Describe how implementations can use them to determine +specification compliance. This section can be skipped if IPIP does not deal +with the way IPFS handles content-addressed data, or the modified specification +file already includes this information. + +### Copyright + +Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).