Add image complexity filter to prune blank/simple images#59
Merged
nicpottier merged 1 commit intomainfrom Feb 17, 2026
Merged
Conversation
Adds a grayscale standard deviation filter that runs after the existing size filters during image classification. Images with stddev below the threshold are pruned as visually simple (blank backgrounds, solid fills). - New `grayscaleStdDev()` in image-complexity.ts (pure JS: jpeg-js + pngjs) - `min_stddev` field added to ImageFilters schema - classifyPageImages accepts optional getImageBytes accessor - Default threshold of 2 in all config presets - Complexity setting exposed in v2 ExtractSettings UI - Fix loading spinner flash on step rerun across all v2 step views
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
min_stddev) that prunes visually simple images (blank backgrounds, solid fills) during image classification, running after existing size filtersDetails
Images from PDFs often include blank or near-blank artifacts (solid backgrounds, white boxes) that pass size filters but add no value. This mirrors the
is_blank_imageapproach from adt-press.Pipeline changes:
grayscaleStdDev()function inimage-complexity.ts— pure JS decoding viajpeg-jsandpngjsmin_stddevadded toImageFiltersZod schemaclassifyPageImagesaccepts an optionalgetImageBytesaccessor; complexity check only runs on images that pass size filters2added to all config presets (storybook, reference, textbook)UI changes:
isLoading && !stepRunning) applied to all 7 v2 step viewsTest plan
grayscaleStdDev(solid PNGs, gradient, JPEG, unsupported format)classifyPageImages(prune blank, keep complex, skip when unconfigured, skip when no bytes accessor, skip already-pruned)