Skip to content

Latest commit

 

History

History
62 lines (37 loc) · 5.36 KB

File metadata and controls

62 lines (37 loc) · 5.36 KB

Design Philosophy

Bring your own dependencies

One of the ways that rawzip is designed to reach maturity is by forcing users to bring their own dependencies for compression.

A zip file can have entries compressed in a number of different formats. To support these different formats will require depending on an ecosystem of Rust crates. Even just a single compression method like deflate, have a half dozen implementations (#1, #2, #3, #4, #5, #6). And these are all high quality implementations where one might want to even use them interchangeably (eg: use zune-inflate for inflation and depending if the data needs to be streamed use libdeflater).

There'll always be a new deflate implementation or a new version release tomorrow, and one goal of this crate is to have releases slow as maturity is reached. Hence why control is inverted and pushed onto the user to provide.

This is not a novel idea, go's archive/zip allows one to bring their own decompressor too, and optionally not decompress at all by opening the zip file entry in raw mode.

Other dependencies have been omitted as they have been reimplemented in code. Calculating the CRC of decompressed files is a good example. A performant implementation can be done in 50 lines of code; an implementation that slows synthethic benchmarks by only a few percentage points.

Efficiency

Iterating over the central directory of a 171GB network attached zip file with over 200k entries.

rawzip:     267ms
rc-zip:     466ms
async-zip:  681ms
zip-rs:    1058ms

Iterating over the central directory of a 5GB zip file with 10k entries:

rawzip:     0.274ms
rc-zip:     5.300ms
async-zip: 17.300ms
zip-rs:    22.100ms

That's quite the spread! rawzip is 20x faster than the next fastest and 80x faster than zip-rs.

This is not to disparage the other zip implementations. While there is always room for efficiency improvements for them, most of the performance can be explained away by the amount of work each one does. Since rawzip favors lazy computations and zero copy parsing, it follows the do not pay for what you do not use philosophy.

The efficiency doesn't stop at the central directory. One area where rawzip is unique (and a natural consequence of "bring-your-own-dependencies") is that it allows the decoupling of IO and decompression. The raw bytes for a zip entry can be slurped up and ferried to worker threads for decompression and application processing. This maximizes the amount of time the IO thread is spent performing IO, with linear scalability until the IO ceiling is hit. In a benchmark, rawzip was able to outperform other zip extraction tools by 6x and best even those with parallel processing by 3x.

All this is achieved with a bog standard seqeuntial file I/O with some seeks. Nothing fancy with async, sans-io, or io-uring. This may seem short sighted but zip files are really driven by the central directory located at the end of the file. In benchmarks where zip entries are extracted from a stream, rawzip will be slower as it requires the file to be buffered into memory or disk, but the consistency it allows in processing can't be discounted.

Out of Scope

This Rust crate won't cover all use cases, but I encourage others to start a discussion if they feel strongly about it.

Streaming

While individual Zip archive entries can be decompressed in a streaming fashion, opening a Zip archive requires something seekable, as the source of truth for a Zip archive is often the central directory located at the end of the file. While it is possible to extract entries from a stream from a technical standpoint, I don't recommend it even for networked applications, otherwise discrepancies between the central directory file headers and the local file headers can rear their ugly head. Better to bear this pain and write to the file system temporarily as necessary.

Async

There are two forms of async workloads with Zip archives.

  • The data is hosted elsewhere and must be downloaded
  • Local file reads can be completed asynchronously with io_uring

For remote data, it's worth reiterating that we want to avoid parsing the Zip file in a streaming fashion. So if the entire archive will be read, the simplest solution would be to completely download the Zip archive into an ephemeral location.

What if there is not enough disk space for the Zip archive or a fraction of the Zip is required? Network attached storage and blob services like S3 support positioned I/O and can be mounted as a local file system. The read operations on this mounted file system will have higher latency and are more amenable to being offloaded to an I/O threadpool.

There is little need for reading local files asynchronously with io_uring. File APIs already provide offset reads without requiring exclusive access, so they can be done concurrently to maximize IO efficiency even though the individual file reads are synchronous. And with compression being purely CPU-bound, it's hard to see where async should be introduced, and thus not worth the extra complexity and dependencies.