Status: drafted 2026-04-29, not yet started. This is the successor plan to
PLAN.mdand supersedes the "promotion happens through a new sequenced doc" requirement at the top ofOPTIMIZATIONS.md. Items here were promoted deliberately fromOPTIMIZATIONS.md; the rest of that file remains deferred. Do not pull fromOPTIMIZATIONS.mdwhile this plan is in flight — finish a phase, demo it, then move on.
This plan covers two rounds of work, in order:
- Format support — broaden the set of archives
peelcan extract. - Pipeline & UX improvements — make the existing extraction loop faster, more observable, and more robust under real-world conditions.
The same sequencing discipline as PLAN.md applies: each phase ends with
a runnable demo, and §N+1 does not begin until §N's demo passes.
- Std-first; the
ENGINEERING_STANDARDS.md§2 allowlist still rules. Several phases in this plan call out crates that are not on the allowlist — those calls are flagged as "needs approval before Cargo.toml lands" and require an explicit decision before implementation begins. - No async runtime. The io_uring phase (§9) is the most likely place for a slip; it must use direct syscalls or a sync wrapper, not a new reactor.
- Linux-first. Phases that add macOS- or Windows-specific code are
additive behind the existing
PunchHoletrait pattern. - Hand-rolled HTTP/1.1 stays hand-rolled. Multi-mirror (§16) and
bandwidth limiting (§17) extend the existing client; they do not
introduce
hyper/reqwest. - Backwards-compatible checkpoints. Any phase that grows the
checkpoint format bumps
format_versionand provides a clean rejection path for older readers (percheckpoint.rs§9.2 ofPLAN.md). Migration tables are still out of scope (O.12).
- Items
O.4(parallel zstd block decoding),O.5(NUMA placement),O.9(native peel container format),O.12(resume across version changes),O.15(Windows support),O.16(daemon / library mode),O.17(HTTP/2 / HTTP/3),O.18(pluggable destination), andO.20–O.25(continuous fuzzing, real-world corpus tests, differential testing, file modes, xattrs, links). Those stay inOPTIMIZATIONS.mdfor a future planning round. - Format detection by content sniffing of every byte: we sniff the magic-byte prefix only, not arbitrary content. Heuristic auto-detect beyond magic bytes is out of scope.
The format-support phases share one cross-cutting prerequisite: today
the DecoderRegistry keys decoder factories on
file-name suffixes only. Each new format extends both the suffix map
and a new magic-byte map. §1 lays the groundwork; §2–§5 each register
their format on both maps.
What: extend DecoderRegistry so that, in addition to the existing
suffix lookup, it can identify a format from the first ≤ 1 KiB of the
source. Use it as a fallback (or override) when the URL's suffix is
ambiguous, missing, or contradicts the bytes.
Why now: every later format-support phase needs to plug into this. Doing it once, well, and refactoring zstd/gzip behind it costs less than retrofitting four formats later.
Sketch:
- Add
MagicSignature { offset: u16, bytes: &'static [u8] }and a parallelmagics: Vec<(MagicSignature, DecoderFactory)>to the registry. Most formats live at offset 0; tar lives at offset 257. - Add
factory_for_prefix(&[u8]) -> Option<DecoderFactory>— longest-match wins, same rule as suffix lookup. - Coordinator change: after the first
CHUNK_SIZE-or-max_magic_windowbytes are downloaded (whichever is smaller), callfactory_for_prefixiffactory_for_pathreturnedNoneor if the resolved factories disagree. Disagreement aborts with a cleanFormatMismatcherror rather than silently picking one — the user intent (URL suffix) and the file content (magic) disagree, and we refuse to guess. - Retrofit
.zst/.zstdand.gzregistrations to also publish their magic signatures (zstd:28 B5 2F FDat 0; gzip:1F 8Bat 0). - CLI flags for the disagreement / unknown cases:
--format <name>forces a specific decoder, bypassing both suffix and magic detection (e.g., for URLs likehttps://example.com/download?id=42where the suffix is unhelpful).--force-format-from-magicopts in to "trust the magic when suffix and magic disagree" — same workflow without having to name the format. Mutually exclusive with--format(clapArgGroup). Both default off; the barepeel <url>path keeps the hard-error behavior so silent format swaps cannot happen by accident.
Demo: a test that points the coordinator at four local files —
x.zst, x.bin (zstd content, no suffix), x.gz (zstd content,
wrong magic vs. suffix), and the same wrong-suffix case with
--force-format-from-magic — and verifies: (1) suffix path works,
(2) magic-only path works when no suffix is registered,
(3) suffix/magic mismatch without override aborts cleanly with
FormatMismatch, (4) the same mismatch with override extracts
correctly and emits a warn! recording which signal won.
What: extract uncompressed tar archives streamed over HTTP.
Why now: it's the simplest new format (no decompression layer),
exercises the §1 magic-byte path (tar's ustar magic is at offset
257, the only registered format that doesn't live at 0), and shakes
out any "decoder == identity" assumptions in the extractor that the
zstd-only MVP lets through.
Sketch:
- New
decode/identity.rs: aStreamingDecoderthat owns its source and copies bytes verbatim into the sink in bounded chunks (matches the ~1 MiB step size the zstd impl uses).bytes_consumed()equals bytes written so far.frame_boundary()returnsSome(bytes_consumed)on every step — the whole stream is one big restart-aligned region, so the existing checkpoint cadence policy still works without special-casing. - Register
.tar(suffix) and theustar\0/ustar \0magic at offset 257 (read window must be ≥ 265 bytes). - The extractor's existing
TarSinkalready handles the byte stream; no change needed there. - CLI:
peel https://example.com/dataset.tar -C ./outshould work without any new flag. - Edge cases worth testing explicitly: archives < 1 KiB (the magic window is bigger than the file); archives where the first member is a PAX/GNU extended header (offset 257 still ustar but the parser needs to traverse the extension); empty archives (two zero blocks only).
Demo: north-star use case but against a .tar URL; crash-test
harness covers it.
What: support .tar.xz and bare .xz archives.
Why now: tar.xz is the dominant format for several large public
datasets (some Wikipedia dumps, several Linux distro source tarballs)
that users will hit before they hit .tar.lz4 or .zip.
Dependency: xz2 (bindings to upstream liblzma). Approved
2026-04-29 and added to ENGINEERING_STANDARDS.md §2.2. The
pure-Rust alternative lzma-rs was considered and rejected (less
battle-tested, performance gap, similar block-introspection
weakness).
Sketch:
decode/xz.rs: streaming wrapper aroundxz2::stream::Stream(the raw-API layer, notXzDecoder'sReadadapter — we want explicit step boundaries, same shape as the zstd implementation).- Frame boundaries: xz Streams contain Blocks; resume to a Block
start is correct. The pragmatic MVP-of-xz approach is to treat the
whole xz Stream as one frame for checkpointing — i.e.,
frame_boundary()returnsNoneuntil the Stream ends, then returns the post-Stream offset. That means resume across crash-mid-xz re-decompresses from the Stream start (we still don't re-download), which is acceptable because xz decompression is much faster than xz network downloads and dataset-scale users are network-bound. Smarter block-boundary detection is a follow-on (filed below). - Register
.xz,.tar.xz, and the magicFD 37 7A 58 5A 00at offset 0. decoder_positionsemantics: continues to be the source-stream offset, which is Stream-byte progress — coarse for xz but correct.- Multi-Stream xz files (
xz --keepconcat output): the wrapper loopsStream::new_stream_decoderper Stream and exposes each Stream-end as a frame boundary.
Decision (2026-04-29): round-one ships per-Stream granularity. The
limitation is real for default-encoded .tar.xz (which is most of
them, and which is single-Block — meaning no implementation can
checkpoint within the file; the format itself does not contain a
restart point). Per-Block granularity helps only the multi-threaded /
concatenated case. File a per-Block follow-on as O.6b in
OPTIMIZATIONS.md once §3 lands; promote it deliberately if real
users hit the slow-resume cost.
Demo: peel https://.../linux-6.x.tar.xz -C ./linux extracts
correctly; crash-test harness covers .tar.xz.
What: support .tar.lz4 and bare .lz4 archives in the
LZ4 Frame Format (RFC-style spec at github.com/lz4/lz4/blob/dev/doc).
Why now: not because lz4 is widely used in published archives —
it isn't (per O.7) — but because lz4 frames have clean, cheap
block boundaries (every block carries a 4-byte length prefix), so it
gives us a second post-MVP format with better checkpoint granularity
than xz, which is useful for testing the §1 magic-byte path against
something that isn't ustar-at-257.
Dependency: lz4_flex (pure Rust, no C dep, MIT/Apache).
Approved 2026-04-29 and added to ENGINEERING_STANDARDS.md
§2.2. The C-binding alternative lz4 was considered and rejected
(extra build-time complexity for no functional gain in our use
case).
Sketch:
decode/lz4.rs: streaming wrapper that parses the Frame Format header itself (it's small — magic + flags + content size + HC) and feeds blocks one at a time throughlz4_flex::block::decompress_into. This avoids relying on whatever streaming API the crate exposes for its frame layer (which has been less stable than the block layer).- Frame boundaries: every decoded block ends at a known source
offset;
frame_boundary()returns the post-block offset on every step. - Register
.lz4,.tar.lz4, magic04 22 4D 18at offset 0. (Skippable frames184D2A50–184D2A5Fare also valid prefixes — skip them transparently before locating the real frame.) - Content-checksum and block-checksum flags: validate when present;
surface a
DecodeError::Readon mismatch with offset context.
Demo: round-trip a 100 MB synthetic file through lz4 and peel,
verify byte-identical extraction; crash-test covers .tar.lz4.
What: stream-extract .zip archives.
Why now and why last: zip's central-directory-at-the-end design is fundamentally hostile to the §6.x streaming pipeline. This phase is deliberately the largest in Phase A because it is essentially a second pipeline architecture. We do it last in this plan so all the other format-support work — and the §1 magic-byte detection — is already in place to fall back on.
Dependency: none. The CD parser and per-entry header parser are
hand-rolled, matching the precedent set by tar in PLAN.md §7.3.
The PKWARE APPNOTE is the spec. DEFLATE comes from existing
flate2; STORED is a passthrough; zstd comes from the existing
decoder.
Round-one scope (locked 2026-04-29): STORED + DEFLATE + zstd
entries. AES / password-protected entries, Zip64 end-of-central-
directory locator, multi-disk archives, and any other PKWARE
extensions are out of scope and will be filed as O.8b in
OPTIMIZATIONS.md after §5 lands. Encountering an unsupported
method or feature returns a clean ZipError::UnsupportedFeature
naming the specific feature, not a generic parse failure — the user
should see "AES encryption is not supported", not "malformed
header".
Sketch:
- New
download/zip_pipeline.rs(separate from the streaming pipeline): the URL is opened withHEAD; a small ranged GET fetches the trailingZIP_END_OF_CENTRAL_DIRECTORY(variable-size, but bounded by 64 KiB + comment) — typically the last 64 KiB suffices, with a fallback to a larger window if the comment overflows. - Parse the central directory into a list of entries with
(name, method, compressed_size, uncompressed_size, local_header_offset, crc32). - For each entry, issue a ranged GET for the local file header + compressed data. The download scheduler is reused but with a per-entry chunk plan rather than a global one — workers are assigned to entries, not to byte chunks of one big file.
- Decompress per-entry using either an identity passthrough
(
STORED), DEFLATE (existingflate2), or zstd (we already have the decoder). Other methods returnZipError::UnsupportedFeaturewith the method name surfaced. - Sink: existing
TarSinkis replaced with a newZipSinkthat handles per-entry path safety (same anti-traversal rules asTarSink) and writes one entry at a time. - Hole-punching: less effective for zip because we hole-punch per-entry rather than continuously. Still useful for very large single entries; degrades gracefully.
- Checkpoint format gains a
ZipState { entries_completed: Vec<u32>, current_entry: Option<u32>, current_entry_offset: u64 }.
Demo: peel https://.../release.zip -C ./out extracts a real
multi-MB zip (e.g., a GitHub release artifact). Crash mid-entry
resumes correctly. Crash mid-CD-fetch falls back to retrying the CD.
After Phase A lands, peel handles a much wider input set. Phase B
makes the extraction loop itself better — observability, throughput,
robustness — without expanding the format set further.
What: replace the single redrawn line with a multi-field progress display that shows, at minimum:
- Percent complete (overall — see "what does percent mean").
- Compressed bytes downloaded out of total.
- Decompressed bytes written out of estimated total (we do not always know decompressed size; use the actual value for tar / uncompressed total when known; otherwise show "extracted: 156 MiB, unknown total").
- Download rate (rolling 5 s average).
- Disk write rate (rolling 5 s average).
- ETA to completion (whichever of the two rates is the bottleneck).
- Active worker count (currently doing IO, not just spawned).
Why now: comes first in Phase B because most of the later phases add new state we'll want to surface. Doing the renderer once now — with a rate-tracking primitive that later phases plug into — costs less than rebuilding the renderer four times.
No new TUI dependency. crossterm/ratatui are not on the
allowlist and the requirement does not justify them. We render with
hand-rolled ANSI: cursor save/restore (\x1b7/\x1b8), erase-to-EOL
(\x1b[K), and \r for line returns. Three lines of output, redrawn
in place. On a non-TTY stdout (detected via IsTerminal), fall back
to periodic structured log lines instead.
Sketch:
progress.rs: aProgressStatestruct holding atomics (bytes_downloaded,bytes_extracted,total_size,active_workers) and a small ring buffer for rate computation.ProgressRenderertrait with two implementations:TtyRenderer(the multi-line redrawn block) andLogRenderer(writestracing::info!lines at a configurable interval). The coordinator already accepts aprogress: Option<...>callback; exposeProgressStateand let the renderer choose.- ETA:
min(remaining_to_download / rate_dl, remaining_to_extract / rate_ex)— whichever bottleneck dominates. Display--:--until we have ≥ 5 s of rate data. - "Percent complete" heuristic: if the decoder reports a known
uncompressed total, use extraction progress; otherwise use download
progress. Document the choice next to the field so users aren't
surprised when a
.tar.zstshows download-percent. - Worker activity tracking: download workers
fetch_add/fetch_subonactive_workersaround their per-chunk read loops.
Demo: against a 1 GiB .tar.zst, the progress block redraws
smoothly without flicker, all six fields populate, ETA converges to a
sensible number, the non-TTY fallback logs every 2 s.
What: replace blocking pwrite_at / pread_at on the sparse
file with io_uring submission queues on Linux when the kernel
supports it, with a safe automatic fallback to the existing blocking
path when it doesn't.
Scope split (added 2026-04-29). §7 covers file IO only
(pwrite_at / pread_at). The download path's TCP connect /
send / recv are the subject of §7b, which builds on §7's trait,
IO thread, and capability probe. Splitting the work keeps each diff
focused and lets the file-IO half land and bake before we touch the
hand-rolled HTTP client + rustls byte transport. Together §7 and §7b
deliver O.2 from OPTIMIZATIONS.md.
Why now: it sits behind §8 (adaptive chunk size — adaptive sizing makes more sense once IO concurrency can actually scale) and §9 (mmap sparse file — io_uring registered buffers compose with mmap'd regions). Doing io_uring before either of those is a forcing function to keep the IO abstraction clean.
Dependency: io-uring (the tokio-rs/io-uring crate; raw
bindings, no async runtime). Approved 2026-04-29 and added to
ENGINEERING_STANDARDS.md §2.2.
Hard rule for this phase: no async runtime. We use io-uring's
sync API (Submitter::submit_and_wait) on a dedicated IO thread.
Anything that asks us to add tokio is rejected.
Sketch:
- New
IoBackendtrait wrapping the file-IO operations the sparse file currently calls directly:pwrite_at,pread_at, plussync_all. Default implementation is the existing blocking path;UringBackendis a Linux-only implementation gated on#[cfg(target_os = "linux")]. The trait is shaped to grow the socket operations §7b introduces without breaking either backend. - Capability probe at startup: open a ring with
IoUring::new(MIN_DEPTH). If construction fails (kernel too old, container without uring, seccomp blocking), log awarn!to stderr ("io_uring unavailable: ; using blocking IO") and fall back. IfRLIMIT_MEMLOCKis below the threshold required for our default ring depth, log awarn!and reduce ring depth (still uring, just smaller). - ulimit checks:
RLIMIT_NOFILEshould be ≥ workers × 2; below that, log awarn!with the recommendedulimit -nvalue. - Submission strategy: workers post
writeSQEs into a per-worker ring slot, then drain CQEs in batches. The IO thread owns the ring; workers communicate with it via a bounded channel of submission requests. - Tests:
IoBackend-level unit tests stub both impls; integration tests run end-to-end on a real ring (gated behind alinux-uringfeature flag in CI so non-Linux runners skip).
Demo: download of a 5 GiB file completes with measurable
throughput improvement on a localhost mock under high parallelism
(N=64 workers). Flag --io-backend blocking forces the old path for
A/B comparison. Falls back cleanly on a kernel < 5.6 and prints the
expected stderr warning.
Status: drafted 2026-04-29, not yet started. Added to the plan after §7 was scoped down to file IO. §7b assumes §7's trait, IO thread, capability probe, and
--io-backendCLI flag are already in tree.
What: extend the [IoBackend] trait from §7 to cover the
download workers' network IO — TCP connect, send, recv — via
io_uring, sharing the dedicated IO thread and capability probe with
the file-IO path. rustls continues to drive the TLS state machine;
the uring path swaps the byte transport beneath it.
Why now: §7 establishes the trait, the IO thread, the bounded submission channel, and the capability probe. Doing the network half in a separate phase keeps each diff focused, lets the file-IO half ship and bake before we touch the wire path, and lets us A/B the two halves independently against the same demo (the §7 demo runs with network blocking; this phase's demo runs with both halves on uring).
Dependency: no new crate. Uses the same io-uring dependency
approved in §7.
Hard rule for this phase: still no async runtime. Each socket
operation submits an SQE on the per-process IO thread and the calling
worker thread blocks on a per-op completion notifier. The ring
batches across workers — one ring, N workers — but each individual
Read::read / Write::write call is synchronous from the worker's
point of view, so rustls and the hand-rolled HTTP client work
unchanged.
Sketch:
IoBackend(introduced in §7) gains three methods:connect(addr: SocketAddr) -> io::Result<UringSocket>, plussendandrecvoperating on a backend-owned socket handle. Returning a typed handle (rather than a rawRawFd) keeps the ownership story crisp: the backend owns the socket lifecycle, the caller borrows aRead + Writeadapter.UringSocketis a thin wrapper around the kernel-side socket plus a back-reference to theIoBackend. It implementsstd::io::Readandstd::io::Writeby submittingrecv/sendSQEs and waiting on the per-op completion. This is the surfacerustls::StreamOwnedplugs into in place ofTcpStream.http::client::Clientgains a backend handle (Arc<dyn IoBackend>), defaulting to the blocking backend it has today. Connect / send / recv go through the backend; the rest of the client (request framing, response parsing, redirects, pool) is unchanged.- TLS:
rustls::StreamOwned<ClientConnection, UringSocket>Just Works becauseUringSocket: Read + Write. We do not interpose below the TLS state machine — the rustls /webpki-rootsexception inENGINEERING_STANDARDS.md§2.4 covers the unchanged bits. - Worker
cancelflag: a worker that's blocked inside a uringrecvcompletion wait must respond to cancellation. Either (a) submit therecvSQE with a linked timeout SQE and re-checkcancelon timeout, or (b) submit a uringcancelSQE from the scheduler when it flips the flag. (a) is simpler; pick that and document the worst-case latency (1× timeout) in the trait's contract. - Capability is shared with §7's probe. A
--io-backend uringselection that hits a kernel without ring support falls back to both halves blocking, with the samewarn!line. There is no "uring file IO + blocking sockets" mixed mode — the trait stays coherent across halves. - Tests: a localhost mock server (reusing the existing test harness) plus the same crash-test path with the uring backend forced. Run the property-style "identical extraction output across backends for a fixed seed" test from §7 against the network half too.
Demo: the §7 demo, repeated under simulated network latency
(tc qdisc add dev lo root netem delay 50ms on Linux runners). At
N=64 workers and 50 ms RTT, the blocking-sockets variant pays a
context switch per recv; the uring variant batches recvs across
workers and finishes faster. The exact speedup is hardware- and
kernel-dependent; the demo bar is "uring is not slower than blocking
on the same workload, and is meaningfully faster at high parallelism
- high RTT."
What: tune chunk size during download based on observed throughput and connection RTT, instead of the fixed 4 MiB.
Why now: §7 makes chunk size matter more (deep IO queues benefit from larger units; shallow queues from smaller). Without io_uring, any adaptive policy hits the blocking-IO ceiling first.
Sketch:
- Track per-worker chunk completion times in a small ring buffer
(already added to
ProgressStatein §6 — reuse it). - Policy:
- When all workers consistently complete chunks in < 1 s and the bitmap reports ≥ 2 × workers chunks remaining, double chunk size up to a 64 MiB cap.
- When latency spikes (p95 > 5 s) or retry rate exceeds 10 %, halve down to a 1 MiB floor.
- Hysteresis: do not size-change more than once every 30 s.
- The bitmap and checkpoint format encode chunk size; resume must honor the size that was active when the checkpoint was written (do not re-plan). New chunks added past the checkpoint can use the new size — but the simpler implementation just freezes chunk size at the size present at first checkpoint and doesn't re-tune across resumes. Document the limitation.
- CLI:
--chunk-size <N>continues to force a specific size (disabling adaptive);--no-adaptive-chunk-sizedisables adaptive behavior while keeping the default starting size.
Demo: against a localhost mock that throttles per-connection, chunk size grows when bandwidth is plentiful and shrinks when the mock injects 503s; logged size changes match the policy.
What: mmap the partial download file so workers write into
memory and the kernel handles flushing; replace fallocate(PUNCH_HOLE)
with madvise(MADV_REMOVE) for hole release.
Why now: real benefit appears only at high IO concurrency (per
O.3), which §7 + §8 together unlock. This is also the phase where
we revisit the puncher abstraction — MADV_REMOVE semantics differ
from fallocate and the existing trait shouldn't have to lie about
that.
Sketch:
- New
MmapSparseFilealongsideSparseFile. Selection is via config (coordinator.rs), not auto-detection — we don't want a silent backend switch. - New
Puncher::madv_remove(addr, len)variant. The existingLinuxPunchergains afor_mmap()constructor that issuesmadvise(MADV_REMOVE)instead offallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE). Filesystem support varies (tmpfs and ext4 support MADV_REMOVE; some others don't); fallback semantics mirror the existing ENOTSUP path. - Worker write path becomes a memcpy into the mapped region. We
still flush via
msync(MS_ASYNC)at frame boundaries to bound the dirty-page window the kernel is holding. - Safety: each
mmapcall lives in anunsafeblock with a// SAFETY:comment documenting alignment, length, and lifetime. The exposed surface issafe. New unsafe code requires human review before merging (per standards doc §4). - CLI:
--io-backend mmapselects this path; default stays blocking pwrite for now until §9 has been benchmarked in production. - Tests: parallel-write correctness, sparseness verification after
MADV_REMOVE, behavior on filesystems that rejectMADV_REMOVE(the test runs on tmpfs which supports it; assertion-skip on FS that doesn't).
Demo: 1 GiB synthetic download, 16 workers, both backends — same
extraction result, mmap backend's time shows lower kernel CPU.
What: --sha256 <hex> flag verifies the assembled compressed
file's hash. Hash is updated incrementally as bytes are consumed by
the decoder; verification runs at clean completion. A mismatch is a
hard error: extraction completed but the source did not match the
expected hash.
Why now: independent of §6–§9 in scope, but it benefits from §6's
ProgressState (the running hash makes a nice progress field).
Implementation: hand-rolled SHA-256 (~150 LOC, FIPS 180-4
reference). Standards doc §2.1 already nudges toward hand-rolling
trivial primitives; the deciding factor here is that resumable
hashing requires serializable internal state, and the sha2 crate
does not expose it without unsafe transmute against private fields.
Pure-Rust SHA-256 measures 300–500 MiB/s on a single core, well
above the network-bound ceiling we operate under, so the asm /
AVX2 acceleration sha2 would give us is irrelevant. sha2 is
added as a dev-dependency only ([dev-dependencies] in
Cargo.toml) so unit tests can cross-check our implementation
against a known-correct reference; the runtime binary does not
link it.
Sketch:
- New
hash/sha256.rs:Sha256 { state: [u32; 8], buffer: [u8; 64], buffer_len: u8, bytes_processed: u64 }. Public API mirrors the familiar shape:new() -> Self,update(&mut self, &[u8]),finalize(self) -> [u8; 32]. The serialized form is the struct layout above (113 bytes; padded to 120 for alignment), with a stable wire format independent of struct layout. - Wire the hasher into the source-byte path inside the extractor —
every byte that flows through
decode_step's sourceReadis forwarded to the hasher. The hasher does not participate in the punching critical path; it processes bytes before they're punched. - CLI:
--sha256 <hex>(64-hex-char). On clean completion, finalize the hasher and compare; mismatch returns a newIntegrityError::HashMismatch { expected, got }. Friendly message inmain.rs("the file you got is not the file you expected; this usually means the source changed during download or the expected hash is for a different file"). - Resume semantics — supported. The
Checkpointstruct gains ahash_state: Option<Sha256State>field, written atomically with the rest of the checkpoint at every quiescent frame boundary (samewrite_temp+fsync+renamediscipline as today; no new atomicity machinery). The invariant: at any frame boundary, the saved hash state is the SHA-256 of source bytes0..decoder_position. On resume, deserialize the state and continue feeding bytes from the resume cursor. The whole-file hash at clean completion equals the SHA-256 of the original compressed source — byte-identical tosha256sumon the source file. Bumpsformat_version. - Tests:
- NIST FIPS 180-4 test vectors (byte-string and bit-string varieties) for raw correctness.
proptest-driven 1000 random inputs, cross-checked againstsha2::Sha256(dev-dependency) — catches divergence the NIST vectors miss.- Round-trip: serialize the state mid-stream, deserialize, feed the remaining bytes; resulting digest must equal the digest of the same input fed in one pass.
- Per-byte-boundary: feed N random inputs split at every possible boundary, all yield the same digest (chunking-invariance).
Demo: peel --sha256 abc... URL succeeds for a correct hash and
fails cleanly for an incorrect one with an exit code distinct from
generic IO failure (per cli.rs exit-code conventions). A
mid-download kill -9 followed by resume produces a digest equal to
a clean-run digest (verified by the crash-test harness, extended to
cover this).
What: tighten the existing ETag/Last-Modified guard (PLAN.md
§5.3) and extend it to detect source drift in cases where ETag is
weak, missing, or wrong. Build on the streaming hasher from §10.
Why now: §10 added a per-byte hasher; with very little additional work we get a much stronger drift detector than ETag. Doing it in a separate phase keeps §10's diff small and the contract clear — §10 is "did we get what we expected?", §11 is "did the bytes we already have remain consistent with the bytes still coming?".
Sketch:
- Per-chunk fingerprints: as each chunk is downloaded, compute a
small fingerprint (CRC32C is cheap and good enough —
O(1) MiB/son a single core). Persist the per-chunk fingerprints in the bitmap structure, expanded toVec<(chunk_state, crc32c)>. Bumpsformat_version. - Resume verification: on resume, before accepting any persisted
chunk as complete, re-fetch a small overlap probe (a single byte
range from a random already-complete chunk) and recompute its
CRC32C. Mismatch ⇒
SourceChangedSinceCheckpoint, abort and force a clean restart. - Mid-flight verification: every
Nth chunk (configurable; defaultN = 32) is re-requested as a probe by a different worker; re-fetched bytes must match the original CRC32C. Mismatch ⇒SourceChangedDuringDownload, abort. - CRC32C implementation: hand-rolled (
O(150 LOC), table-driven). No new dependency. Standards doc §2.1 prefers this for trivial primitives. - Tighten ETag check: today the response is validated against the ETag captured at HEAD. Add Last-Modified comparison and a strong/weak ETag distinction (weak ETags can change without a content change; treat as advisory only).
Demo: a mock server that swaps the file under us mid-download
triggers SourceChangedDuringDownload within at most N chunks. A
mock that returns a different file on resume triggers
SourceChangedSinceCheckpoint on the very first probe.
What: implement the PunchHole trait for macOS (APFS supports
F_PUNCHHOLE).
Why now: independent of §6–§11. We do it in this position so multi-mirror and bandwidth limiting (which are independent of the OS) follow it; macOS users have something to test the streaming pipeline with sooner.
Sketch:
MacosPunchercallingfcntl(fd, F_PUNCHHOLE, &fpunchhole_arg)per the existing Python prototype reference. Directlibc::fcntlcall;F_PUNCHHOLEconstant is99on macOS but we look it up vialibcto avoid hard-coding.- Same ENOTSUP/EINVAL graceful-degrade pattern as
LinuxPuncher. default_puncher()selects bycfg(target_os).- Tests: integration test that exercises the macOS puncher on APFS
(
tempfile+statx-equivalent — macOS usesfstatand reportsst_blockslike Linux). Skip on Linux runners. - Unblock the rest of the binary on macOS: §7 (io_uring) is Linux- only and remains so; the macOS build uses the blocking backend and the macOS puncher.
Demo: full extraction round-trip on a macOS runner; on-disk source footprint shrinks during extraction same as on Linux.
What: accept multiple URLs for the same file (with the same expected hash); download chunks from whichever mirror responds fastest, fall back on per-mirror failure.
Why now: builds cleanly on §10 (we already have a hash to verify the resulting file regardless of mirror), §11 (we already verify cross-chunk consistency), and §8 (per-mirror chunk-size adaptation falls out of the same machinery). Doing multi-mirror without those would mean re-deriving consistency invariants twice.
Sketch:
- CLI:
--mirror URLrepeatable; the positionalurlbecomes the primary mirror. - Scheduler: on
start, runHEADagainst every mirror in parallel, verify size + ETag agreement (or hash agreement if--sha256is set; ETag-disagreeing mirrors are dropped with awarn!). At least one mirror must agree with the primary. - Worker assignment: each worker picks the fastest-responding live mirror for each new chunk request, tracks per-mirror health (success rate, latency p95), and avoids mirrors with elevated failure rates for a backoff window.
- Per-mirror failure ⇒ that mirror is excluded for
30 sand another mirror is used; if all mirrors are excluded, the scheduler stalls until one's backoff expires (don't fail the whole download just because every mirror has had one transient failure). - Per-chunk consistency with §11's CRC32C still applies — but now between mirrors as well as across resume.
Demo: two mock servers, one fast and one slow; chunks preferentially route to the fast one; killing the fast mirror mid-download routes the rest to the slow one and the download completes.
What: --max-bandwidth <RATE> flag (e.g., 10MB/s, 1.5GB/s).
Why now: small phase; depends on nothing except the existing download path. Last in Phase B because it's the most "operational nicety" of the bunch.
Sketch:
download/rate_limit.rs: a token-bucket rate limiter shared across workers viaArc. Tokens are bytes; refill rate is the configured limit; bucket capacity ismax(1 MiB, rate × 250 ms)to absorb bursts.- Each worker calls
acquire(bytes_about_to_read).wait()before each socketrecv; the limiter blocks the calling thread (syncCondvar) until tokens are available. - CLI:
--max-bandwidthparses standard suffixes (K/M/G= 1000-based per network convention;Ki/Mi/Gi= 1024-based). - Interaction with §7 (io_uring): the limiter sits above the IO backend — workers acquire tokens, then dispatch through whatever backend is configured. Same limiter works for both.
- Interaction with §13 (multi-mirror): the limit is aggregate across all mirrors, not per-mirror. Document this.
Demo: a 1 GiB download with --max-bandwidth 10MB/s completes
in ≈ 100 s, with download rate (from §6's progress UI) hovering at
the configured limit ± 5 %. Removing the flag uses full bandwidth.
All of the following are true:
- Each phase's demo has been recorded (screen capture or reproducible test) and reviewed.
- The crash-test harness (the random
kill -9test fromPLAN.md§10) has been extended to cover every new format and every new IO backend variant. Resumes still produce byte-identical output. - CI gates listed in
ENGINEERING_STANDARDS.md§CI remain green; coverage thresholds (80 % overall, 95 % on critical paths) continue to hold across the new modules. OPTIMIZATIONS.mdhas been amended to mark itemsO.1,O.2(delivered jointly by §7 + §7b),O.3,O.6,O.7,O.8,O.10,O.11,O.13,O.14,O.19as "delivered in PLAN_v2 §"; the rest stay deferred for a future round.- README has been updated with the new format coverage matrix and the new flags.
There is still no schedule. The plan is sequenced; do it in order, do each phase completely. Phase A's five phases are gating dependencies for one another (§1 first; §5 last). Phase B's ten phases (§6, §7, §7b, §8–§14) are loosely dependent in the order written and should not be parallelized across sessions; in particular §7b builds on §7 and must not start until §7 has landed and demoed.
Resolved 2026-04-29 (this section is the audit trail; the substantive change lands in the relevant phase above):
- Crate approvals.
xz2(§3),lz4_flex(§4),io-uring(§7) approved and added toENGINEERING_STANDARDS.md§2.2.sha2was not approved as a runtime dependency; §10 hand-rolls SHA-256 to make resumable hashing tractable, andsha2is added as adev-dependencyonly for cross-checking the hand-rolled implementation in tests. - Magic-byte / suffix disagreement. Hard error by default
(
FormatMismatch);--force-format-from-magicopts in to trusting the magic, and--format <name>forces a specific decoder regardless of either signal. The two override flags are mutually exclusive. - Resume + integrity verification. Resume preserves SHA-256
state. The
Checkpointformat gains ahash_statefield serialized atomically with the rest of the checkpoint; the resumed run produces a digest byte-identical to a clean run. Hand-rolling SHA-256 (#1) is what makes this clean. - xz frame granularity. Per-Stream for round-one. Per-Block is
filed as
O.6binOPTIMIZATIONS.mdafter §3 lands. The limitation is real but mostly affects multi-threaded / concatenated xz files, which are the minority of published.tar.xz; default-encoded xz is single-Block and no implementation can checkpoint within it. - Zip scope. STORED + DEFLATE + zstd entries for round-one.
AES / password-protected entries, Zip64, multi-disk, and other
PKWARE extensions are filed as
O.8bafter §5 lands.
When §3 / §5 land they should append the corresponding O.6b /
O.8b entries to OPTIMIZATIONS.md before the phase is marked
done.