Skip to content

Conversation

@korjavin
Copy link

@korjavin korjavin commented Jan 17, 2026

Similar Images: CLIP-based visual similarity detection

Summary

Implements a Similar Images feature that helps users find and manage visually similar photos using CLIP embeddings and HNSW indexing for efficient nearest-neighbor search.

Key capabilities:

  • Automatic detection of visually similar images using AI embeddings
  • Category filtering (Close/Similar/Related) based on similarity distance
  • Smart deletion with "Best Photo" logic (protects favorites, captions, larger files)
  • Fast performance using HNSW index with IndexedDB persistence

UI Integration

Added to Sidebar → Free up space → Similar Images submenu, alongside:

  • Deduplicate Files (exact duplicates)
  • Large Files (storage cleanup)
  • Similar Images (visual similarity) ← NEW

This logical grouping makes it easy for users to discover cleanup tools in one place.

Technical Implementation

Similarity Detection

  • CLIP embeddings: Uses existing ML infrastructure for semantic image understanding
  • HNSW index: Hierarchical Navigable Small World algorithm for O(log n) search
    • First load: ~7 minutes to build index
    • Subsequent loads: ~2-5 seconds (loaded from IndexedDB)
    • Incremental updates when files added/removed
  • IndexedDB + IDBFS: Persists HNSW index for fast restarts

Category Thresholds

Matches mobile implementation:

  • Close: ≤ 0.001 distance (nearly identical)
  • Similar: 0.001 - 0.02 distance (visually similar)
  • Related: > 0.02 distance (related but distinct)

Based on CLIP cosine distance where 0 = identical, 2 = opposite.

Smart Deletion Logic

"Best Photo" selection prioritizes keeping:

  1. Favorited files (in favorites collection)
  2. Files with captions or edited metadata
  3. Larger file sizes (better quality)
  4. Alphabetical (tie-breaker)

The first item in each group (best photo) is automatically protected from deletion. Users can also manually select/deselect individual items.

Collection Handling

Properly handles files in multiple collections:

  • Aggregates all collection IDs per file
  • Preserves all memberships during deletion
  • Shows collection name in UI for context

Changes

New Files

  • web/packages/new/photos/services/similar-images.ts - Core similarity detection (855 lines)
  • web/packages/new/photos/services/similar-images-types.ts - Type definitions (89 lines)
  • web/packages/new/photos/services/similar-images-delete.ts - Deletion logic (173 lines)
  • web/packages/new/photos/pages/similar-images.tsx - UI page component (1203 lines)
  • web/packages/new/photos/services/ml/hnsw.ts - HNSW index wrapper (456 lines)
  • web/packages/new/photos/services/__tests__/similar-images.test.ts - Test suite (569 lines)
  • web/apps/photos/src/pages/similar-images.tsx - Next.js route (10 lines)

Modified Files

  • web/packages/new/photos/services/ml/db.ts - Added similar-images cache schema
  • web/packages/new/photos/services/ml/clip.ts - Exported getCLIPIndexes alias
  • web/packages/new/photos/components/Tiles.tsx - Added LargeFileTileOverlay export
  • web/apps/photos/src/components/Sidebar.tsx - Added submenu integration
  • web/packages/new/photos/services/search/types.ts - Added freeUpSpace.similarImages action

Performance Characteristics

Initial Analysis (first time):

  • Builds HNSW index from all CLIP embeddings
  • Time: ~7 minutes for large library (80k photos)
  • Saves index to IndexedDB for reuse

Subsequent Analyses:

  • Loads HNSW index from IndexedDB
  • Time: ~2-5 seconds
  • Applies incremental updates if files added/removed

Search Performance:

  • O(log n) with HNSW vs O(n²) with brute force
  • For 80k images: ~1.3M comparisons vs ~6.4B (~5000x faster)

@korjavin korjavin closed this Jan 17, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7ad2d210e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@korjavin korjavin reopened this Jan 17, 2026
@korjavin
Copy link
Author

@codex review

@korjavin korjavin marked this pull request as draft January 17, 2026 11:37
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5df9d5db48

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@korjavin
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31d47fa1f9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +922 to +925
// Reset cache when expanded groups or data changes
React.useEffect(() => {
listRef.current?.resetAfterIndex(0);
}, [expandedGroups, similarImageGroups]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Reset list size cache when layout params change

The row height calculation depends on layoutParams (columns/itemHeight/gap), but the cache reset only runs when expandedGroups or similarImageGroups change. If the container width changes (resize, split view, sidebar toggle), layoutParams changes without a reset, so VariableSizeList keeps stale item sizes and rows render with incorrect heights until another expansion/data change occurs. Consider adding layoutParams (or width/height) to the reset effect so resizing recalculates sizes.

Useful? React with 👍 / 👎.

}
}

onProgress?.(65);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Avoid regressing progress after index build

When the index is rebuilt from scratch, progress is advanced to ~95% during the build/save path, but this unconditional onProgress(65) forces the UI backwards before search starts. Users will see the progress bar jump from near-complete back to 65% on cache misses. This should only run when the index was loaded (not rebuilt), or be replaced with a monotonic update.

Useful? React with 👍 / 👎.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31d47fa1f9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@korjavin
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5084dcbb9d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +214 to +218
const cacheKey = `si_${distanceThreshold.toFixed(3)}_${hashFileIDs(
fileIDs,
)}`;
const cacheEntry: CachedSimilarImages = {
id: cacheKey,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use cache key format expected by loader

The cache is saved under si_${distanceThreshold.toFixed(3)}_${hashFileIDs(fileIDs)}, but loadSimilarImagesCache() builds keys via getSimilarImagesCacheKey which appends _v${clipIndexingVersion} (see ml/db.ts). That mismatch means the saved entries are never retrieved, so every run is a cache miss and triggers full recomputation/rebuild even when nothing changed. Saving with the same key builder as the loader (including the version suffix) would make cache hits work.

Useful? React with 👍 / 👎.

Comment on lines 552 to 556
if (!_clipHNSWIndex) {
console.log(`[HNSW] Creating new index with capacity: ${capacity}`);
_clipHNSWIndex = new HNSWIndex(
512, // CLIP embedding dimension
capacity,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recreate index when skipInit is requested

skipInit only takes effect when a new instance is created. If Similar Images is opened again in the same SPA session, _clipHNSWIndex is already initialized but callers still request getCLIPHNSWIndex(..., true) and then call loadIndex(). loadIndex() explicitly throws when the index is initialized, so the caller clears metadata and rebuilds every time. Consider destroying/recreating the index when skipInit is true on an existing instance, or bypassing loadIndex() when the in-memory index is already valid.

Useful? React with 👍 / 👎.

@korjavin
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 03ef360b69

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@korjavin korjavin force-pushed the feature/deduplication-similar branch from f282ad6 to c9f4558 Compare January 18, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant