Skip to content

fix(docx-core): add Jaccard fallback for word-level inplace changes#79

Merged
stevenobiajulu merged 1 commit intomainfrom
fix/inplace-word-level-lcs
Apr 4, 2026
Merged

fix(docx-core): add Jaccard fallback for word-level inplace changes#79
stevenobiajulu merged 1 commit intomainfrom
fix/inplace-word-level-lcs

Conversation

@stevenobiajulu
Copy link
Copy Markdown
Member

@stevenobiajulu stevenobiajulu commented Apr 4, 2026

Summary

Fixes #78 — inplace mode showed full paragraph replacement instead of word-level inline changes when documents had few paragraphs.

Root cause: TF-IDF cosine similarity degenerates with few paragraphs. Each paragraph is a "document" in the IDF corpus. With 2 paragraphs, every shared word gets IDF = log(2/2) = 0, making cosine similarity ≈ 0 regardless of actual content overlap. The algorithm then skips atom-level LCS and marks entire paragraphs as deleted+inserted.

Fix: Add gap-scoped Jaccard word-overlap fallback after TF-IDF. Jaccard uses absolute word overlap (immune to corpus size). Also remove the redundant TF-IDF similarity recheck on already-matched groups.

Design

Before / After

Before

pr77-after-spaces-preserved

After

pr79-after

Calibration data

Pair Jaccard TF-IDF (2 paras) TF-IDF (10 paras)
Minimal change (3 words) 0.647 0.000 0.566
Moderate change 0.682 0.000 0.609
Heavy change 0.095 0.000 0.003
Completely different 0.000 0.000 0.000

TF-IDF with 2 paragraphs = 0.000 for everything. Jaccard correctly catches A and B.

Test plan

…DF degenerates

TF-IDF cosine similarity fails with few paragraphs: shared words get
IDF = log(N/N) = 0, making cosine similarity ≈ 0 even for paragraphs
that share most of their content. This caused entire paragraphs to be
marked as deleted+inserted instead of producing word-level inline
tracked changes.

Add gap-scoped Jaccard word-overlap fallback that runs after TF-IDF
for any unmatched groups. Jaccard uses absolute word overlap, immune
to corpus size. Also remove the redundant TF-IDF recheck gate on
matched groups — once groups are matched, always run atom-level LCS.

- TF-IDF first pass (handles 10+ paragraph documents well)
- Jaccard fallback for leftovers (catches few-paragraph degenerate case)
- Both gap-scoped (preserves document order, prevents #61 regression)
- Both at 0.25 threshold
- Add unit test for Jaccard fallback
- Add integration test asserting word-level granularity on multiple-changes

Closes #78
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
site Ready Ready Preview, Comment Apr 4, 2026 8:48pm

Request Review

@github-actions github-actions bot added the fix label Apr 4, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 4, 2026

Codecov Report

❌ Patch coverage is 97.61905% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ocx-core/src/baselines/atomizer/hierarchicalLcs.ts 97.61% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@stevenobiajulu stevenobiajulu changed the title fix(docx-core): Jaccard fallback for word-level inplace changes when TF-IDF degenerates fix(docx-core): Jaccard fallback for word-level inplace changes Apr 4, 2026
@stevenobiajulu stevenobiajulu changed the title fix(docx-core): Jaccard fallback for word-level inplace changes fix(docx-core): add Jaccard fallback for word-level inplace changes Apr 4, 2026
@stevenobiajulu stevenobiajulu self-assigned this Apr 4, 2026
@stevenobiajulu stevenobiajulu merged commit 5757679 into main Apr 4, 2026
30 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: inplace mode shows full paragraph replacement instead of word-level inline changes

1 participant