-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Summary
I would like to produce a diff file (quickly) of the changes from some binary file A to B (where B is a changed version of A). (i managed to do this with zstd, but, see below: "What I tried" section).
It could be said (from my naive viewpoint) that finding differences between files is somewhat the domain of dictionary-based compression programs. So why re-invent the wheel (and create yet another new software)?
Reasoning
For starters, there currently seems to be a lack of stand-alone utilities that do this. All of them seem to be tied to something else. Be it zsync (unmaintained, as far as I see) being tied to urls and http, bsdiff taking nearly 30 seconds to generate the diff file (whereas zstd does this in under a second). Then bigger tools such as casync require all-or-nothing adoption of their way of doing things.
Secondly, most(?) Linux distros provide package updates as a totally new files-to-be-downloaded. There are major bandwidth (and monetary) savings that could be had here if an efficient (and easy / stand alone) binary diff could be had.
.. And again, since zstd needs to find repetitions and their positions in files, exposing functionality that supports using all of this to produce and use diff files (or at least stapling this functionality to public api) could be a good fit here.
What I tried
I managed to use zstd to produce a very small diff file of changes from binary file A to B (with very fast creation time; less than 1sec) of around 1-2KB for both test cases ("simple" and "complex"). This small diff file was then given to zstd as the-file-to-be-decompressed and the original binary file (A) was given to zstd to use as dictionary. This procedure was able to reproduce the binary file B.
[Click to show transcript of the commands used]
preparing the file that's being used in this experiment: ```fish $ cp /bin/qemu-system-x86_64 bin ```splitting the binary file in two and showing that the when combined, the splits are equal to the original
$ split -n 2 -d bin bin.split-
$ cat bin.split-00 bin.split-01 | diff -s bin /dev/stdin
Files bin and /dev/stdin are identicalputting second half of the file in place
$ cat bin.split-01 bin.split-00 > binrev
$ cat bin.split-01 > bintailhalflisting current state of directory
$ lf
16227960 ./bin
8113980 ./bin.split-00
8113980 ./bin.split-01
16227960 ./binrev
8113980 ./bintailhalftest #1 ("simple"): compressing, decompressing and comparing (using the original binary file as dictionary)
# compressing
$ zstd -D bin --long=31 --zstd=ldmHashRateLog=25,chainLog=28 -vv -f -3 binrev -o binrev.bindict.zstd
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Loading bin as dictionary
binrev : 0.01% (16227960 => 1521 bytes, binrev.bindict.zstd)
binrev : Completed in 0.16 sec (cpu load : 98%)
# decompressing
$ zstd -D bin -vv -d -o binrev.bindict.zstd.decompressed binrev.bindict.zstd
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Loading bin as dictionary
binrev.bindict.zstd : 16227960 bytes
# comparing to original
$ diff -s binrev binrev.bindict.zstd.decompressed
Files binrev and binrev.bindict.zstd.decompressed are identicaltest #2 ("complex"): compressing, decompressing and comparing (using the original binary file as dictionary)
# compressing
$ zstd -D bin --long=31 --zstd=ldmHashRateLog=25,chainLog=28 -vv -f -3 bintailhalf -o bintailhalf.bindict.zstd
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Loading bin as dictionary
bintailhalf : 0.01% (8113980 => 742 bytes, bintailhalf.bindict.zstd)
bintailhalf : Completed in 0.13 sec (cpu load : 100%)
# decompressing
$ zstd -D bin -vv -d -o bintailhalf.bindict.zstd.decompressed bintailhalf.bindict.zstd
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Loading bin as dictionary
bintailhalf.bindict.zstd: 8113980 bytes
# comparing to original
$ diff -s bintailhalf bintailhalf.bindict.zstd.decompressed
Files bintailhalf and bintailhalf.bindict.zstd.decompressed are identicallisting current state of directory
$ lf
16227960 ./bin
8113980 ./bin.split-00
8113980 ./bin.split-01
16227960 ./binrev
8113980 ./bintailhalf
742 ./bintailhalf.bindict.zstd
8113980 ./bintailhalf.bindict.zstd.decompressed
1521 ./binrev.bindict.zstd
16227960 ./binrev.bindict.zstd.decompressedQuestion #1:
As can be witnessed, I found that I can give to zstd any file to be used as a dictionary. zstd happily ingests it, even if the given dictionary file was not generated using the zstd's own --train argument.
.. Question: Is this allowed? Can I rely on zstd allowing me to do this in the future?
Question #2:
Would it be reasonable to expose this functionality via the API in some way, so that the "otherwise unnecessary parts" (whatever they are) could be avoided?
Question #3:
Currently I ran against a wall when trying this procedure on files bigger than 32MB. zstd refuses to use dictionaries bigger than this:
..the error
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Loading /usr/bin/docker as dictionary
zstd: error 32 : Dictionary file /usr/bin/docker is too large (> 32 MB)