My own CLIP-based image deduplication toolkit born from dissatisfaction with off-the-shelf solutions. (most of them are either slow, or not suitable for processing my own image directories) It's purely command-line, batch processing (not interactive), designed to handle image datasets so large, that typing ls inside the directory will take more than 5 seconds for the listing to be done.
Because of its simplicity (less than 1k lines of Python), processing is all done in memory, which limits how many images it can handle in low-memory systems. (it takes about 5KiB of VRAM and system RAM for each image for a single FP32 embedding)
- High performance (as fast as humanly possible on image encoding and matching)
- Incremental embedding DB (filesystem-as-db) with mtime-based updates and orphan cleanup
- Multiple duplicate-keeping strategies (newest, largest, highest-quality, pic-dir, can be extended quite easily)
- GPU support, image embedding comparison is more than 10x faster on GPU (memory-BW bound)
- async batched model inference (about 1.5x speedup) and multiprocessing DB loading (about 2x speedup).
Minimum requirements:
- Python 3.11+
- More than 8GiB of free RAM
- A working PyTorch install (CUDA optional but recommended if you have a GPU)
Git clone, uv pip install...you know the drill.
$ git clone https://github.com/NeoChen1024/clip-image-deduperThen install it inside venv, I recommend using uv to manage it (it's going take quite a bit of space because of PyTorch):
$ cd clip-image-deduper
$ uv venv
$ source .venv/bin/activate
$ uv pip install -e .If you don't use uv, a plain virtualenv + pip flow also works:
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -e .It installs the following commands:
- clip-image-deduper: The default deduper implementation, for deduping a image directory with itself.
- clip-image-import-deduper: Alternative deduper implementation, for deduping a "importing" image directory with a "base" dir.
- clip-image-deduper-db-test: Test DB encoding and loading speed
- clip-image-encoding-test: Test a set of images' euclidean distance with each other
Dedupe images in a directory:
$ clip-image-deduper -i pic-dir -d db-dir -t trash-dirDedupe images in an "importing" dir with "base" dir (will remove images from "importing" when same image is found in "base"):
$ clip-image-import-deduper -bi pic-dir -bd pic-db-dir -ii importing -id import-db-dir -t trash-dirOnly the most important flags are listed here; run clip-image-deduper --help for the full reference.
-i, --image-dir: Directory containing images to process.-d, --db-dir: Directory to store the embedding database files (mirrors image-dir structure).-t, --trash-dir: Where duplicates are moved. If omitted, files are not moved.--threshold, -th: Euclidean distance threshold for considering images as duplicates. Default:0.1(lower = stricter).--keeping-logic, -kl: Which copy to keep among duplicates:newest,largest,highest-quality, orpic-dir.--device, -c: Device to run the CLIP model on, e.g.cudaorcpu. Defaults tocudaif available.--batch-size, -b: Batch size for image encoding. Adjust based on VRAM.--dry-run, -n: Show what would be moved without actually modifying any files.
The "db" is a directory containing image embeddings that mirrors the image directory structure.
For each image file, there is a corresponding .npz file containing the embedding, with key "clip_embedding". The .npz extension is added to the original image filename, e.g. picdir/dir-a/image.jpg -> dbdir/dir-a/image.jpg.npz.
It uses euclidean distance to calculate similarity (in FP32). (extremely low arithmetic intensity, memory-BW bound)
- Find more ways to save memory
- Switch to more usable inference library to replace Open CLIP (it has almost no documentations, and gives a ton of linter error)
- Train custom model to optimize for anime image comparison?
- Clean-up?
Test platform:
Python 3.12 on Arch Linux, AMD Ryzen 7 5700X3D + NVIDIA RTX4080
Image encoding: about 15 image/s
Dedupe: main.py: ~900 image/s for 60k images dataset