Remap legacy FM Sinhala fonts to Unicode during PDF extraction by nicpottier · Pull Request #62 · unicef/adt-studio

nicpottier · 2026-02-17T23:16:04Z

Summary

FM-family fonts (FMSamantha, FMAbhaya, FM Malithi, etc.) are legacy Sri Lankan fonts that map Sinhala glyphs to Latin/ASCII codepoints. PDFs using these fonts render correctly but text extraction returns garbled Latin characters instead of proper Sinhala Unicode.

This PR detects FM fonts at the per-span level in muPDF's structured text and applies a 1585-entry longest-match-first conversion table from gavi-tharaka/sinhala_convertor to produce correct Unicode. Non-FM pages remain unaffected (zero behavior change).

Changes

fm-sinhala.ts: Font detection and conversion logic (walk structured text, remap per-span)
fm-sinhala-data.ts: 1585 FM→Unicode mappings sorted longest-first
extract.ts: Use extractTextFromStructuredText() instead of stext.asText()
fm-sinhala.test.ts: 7 unit tests

All 522 tests pass.

PDFs using FM-family fonts (FMSamantha, FMAbhaya, etc.) map Sinhala glyphs to Latin/ASCII codepoints, producing garbled text on extraction. Detect FM fonts per-span via muPDF structured text and apply a 1585-entry longest-match-first conversion table to produce correct Sinhala Unicode. Non-FM pages are unaffected (unchanged code path).

nicpottier added 2 commits February 17, 2026 15:12

Normalize extracted whitespace in FM Sinhala text path

5bf0b16

nicpottier merged commit 907bdf6 into main Feb 17, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remap legacy FM Sinhala fonts to Unicode during PDF extraction#62

Remap legacy FM Sinhala fonts to Unicode during PDF extraction#62
nicpottier merged 2 commits intomainfrom
nicpottier/sinhala-pdf-extract

nicpottier commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicpottier commented Feb 17, 2026

Summary

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant