Skip to content

Add image complexity filter to prune blank/simple images#59

Merged
nicpottier merged 1 commit intomainfrom
nicpottier/image-complexity-filter
Feb 17, 2026
Merged

Add image complexity filter to prune blank/simple images#59
nicpottier merged 1 commit intomainfrom
nicpottier/image-complexity-filter

Conversation

@nicpottier
Copy link
Contributor

Summary

  • Adds a grayscale standard deviation filter (min_stddev) that prunes visually simple images (blank backgrounds, solid fills) during image classification, running after existing size filters
  • Exposes the complexity threshold in the v2 Extract Settings UI under "Image Filters"
  • Fixes a brief "Loading pages..." spinner flash when starting a rerun from settings across all v2 step views

Details

Images from PDFs often include blank or near-blank artifacts (solid backgrounds, white boxes) that pass size filters but add no value. This mirrors the is_blank_image approach from adt-press.

Pipeline changes:

  • New grayscaleStdDev() function in image-complexity.ts — pure JS decoding via jpeg-js and pngjs
  • min_stddev added to ImageFilters Zod schema
  • classifyPageImages accepts an optional getImageBytes accessor; complexity check only runs on images that pass size filters
  • Default threshold of 2 added to all config presets (storybook, reference, textbook)

UI changes:

  • Complexity input added to v2 ExtractSettings under "Image Filters" heading
  • Loading guard fix (isLoading && !stepRunning) applied to all 7 v2 step views

Test plan

  • New unit tests for grayscaleStdDev (solid PNGs, gradient, JPEG, unsupported format)
  • New integration tests for complexity filter in classifyPageImages (prune blank, keep complex, skip when unconfigured, skip when no bytes accessor, skip already-pruned)
  • Run pipeline on a PDF with known blank images, verify they're pruned with stddev reason
  • Verify no loading spinner flash when starting a rerun from settings

Adds a grayscale standard deviation filter that runs after the existing
size filters during image classification. Images with stddev below the
threshold are pruned as visually simple (blank backgrounds, solid fills).

- New `grayscaleStdDev()` in image-complexity.ts (pure JS: jpeg-js + pngjs)
- `min_stddev` field added to ImageFilters schema
- classifyPageImages accepts optional getImageBytes accessor
- Default threshold of 2 in all config presets
- Complexity setting exposed in v2 ExtractSettings UI
- Fix loading spinner flash on step rerun across all v2 step views
@nicpottier nicpottier merged commit edbfacb into main Feb 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant