Skip to content

Remap legacy FM Sinhala fonts to Unicode during PDF extraction#62

Merged
nicpottier merged 2 commits intomainfrom
nicpottier/sinhala-pdf-extract
Feb 17, 2026
Merged

Remap legacy FM Sinhala fonts to Unicode during PDF extraction#62
nicpottier merged 2 commits intomainfrom
nicpottier/sinhala-pdf-extract

Conversation

@nicpottier
Copy link
Contributor

Summary

FM-family fonts (FMSamantha, FMAbhaya, FM Malithi, etc.) are legacy Sri Lankan fonts that map Sinhala glyphs to Latin/ASCII codepoints. PDFs using these fonts render correctly but text extraction returns garbled Latin characters instead of proper Sinhala Unicode.

This PR detects FM fonts at the per-span level in muPDF's structured text and applies a 1585-entry longest-match-first conversion table from gavi-tharaka/sinhala_convertor to produce correct Unicode. Non-FM pages remain unaffected (zero behavior change).

Changes

  • fm-sinhala.ts: Font detection and conversion logic (walk structured text, remap per-span)
  • fm-sinhala-data.ts: 1585 FM→Unicode mappings sorted longest-first
  • extract.ts: Use extractTextFromStructuredText() instead of stext.asText()
  • fm-sinhala.test.ts: 7 unit tests

All 522 tests pass.

PDFs using FM-family fonts (FMSamantha, FMAbhaya, etc.) map Sinhala
glyphs to Latin/ASCII codepoints, producing garbled text on extraction.
Detect FM fonts per-span via muPDF structured text and apply a 1585-entry
longest-match-first conversion table to produce correct Sinhala Unicode.
Non-FM pages are unaffected (unchanged code path).
@nicpottier nicpottier merged commit 907bdf6 into main Feb 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant