Remap legacy FM Sinhala fonts to Unicode during PDF extraction#62
Merged
nicpottier merged 2 commits intomainfrom Feb 17, 2026
Merged
Remap legacy FM Sinhala fonts to Unicode during PDF extraction#62nicpottier merged 2 commits intomainfrom
nicpottier merged 2 commits intomainfrom
Conversation
PDFs using FM-family fonts (FMSamantha, FMAbhaya, etc.) map Sinhala glyphs to Latin/ASCII codepoints, producing garbled text on extraction. Detect FM fonts per-span via muPDF structured text and apply a 1585-entry longest-match-first conversion table to produce correct Sinhala Unicode. Non-FM pages are unaffected (unchanged code path).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
FM-family fonts (FMSamantha, FMAbhaya, FM Malithi, etc.) are legacy Sri Lankan fonts that map Sinhala glyphs to Latin/ASCII codepoints. PDFs using these fonts render correctly but text extraction returns garbled Latin characters instead of proper Sinhala Unicode.
This PR detects FM fonts at the per-span level in muPDF's structured text and applies a 1585-entry longest-match-first conversion table from gavi-tharaka/sinhala_convertor to produce correct Unicode. Non-FM pages remain unaffected (zero behavior change).
Changes
extractTextFromStructuredText()instead ofstext.asText()All 522 tests pass.