Skip to content

Hebrew combining diacritics aren't positioned correctly #549

@marcstober

Description

@marcstober

Thank you for keeping this open source project going!

I can't get Hebrew combining diacritics ("vowels") to appear correctly, even after looking into the some solutions proposed for similar issues.

For example, here is a Hebrew letter BET with a DAGESH (dot in the middle): בּ

And here is a screen shot from Word:

image

I've seen some proposed workarounds to similar issues in #490 and experimented with them, as seen in the following code. Here are the results and here's why I think they don't work and this should be tracked as a separate bug:

image

  1. One part of the solution in added info about an arabic script fix, fixed typo #490 is to use arabic_reshaper. I don't think this hurts, but I also don't think it affects Hebrew.
  2. Another part of the solution in added info about an arabic script fix, fixed typo #490 is to use bidi.algorithm.get_display. This reverses the order of the characters. I don't think it's actually correct to reverse the order of combining diacritics; they should still come after their base character in the string, even in RTL languages. (This might be something to fix in get_display.) This appears to be what causes the DAGESH to move from being misplaced on one side to being misplaced on the other side of the BET.
  3. There's also a proposed solution in added info about an arabic script fix, fixed typo #490 of using Unicode normalization. However, this doesn't work for Hebrew. Hebrew is excluded from the Unicode composition algorithm (see here). Moreover, while the example of BET WITH DAGESH happens to have a composed character, there are very limited basic composed characters (my guess is only what's needed for Yiddish). Most of the combinations of Hebrew with diacritics needed for Biblical and other historic/literary/educational Hebrew purposes do not have composed characters. So, there's still a need to render combining diacritics correctly, and not rely on normalization to solve this.

In theory I'd love to contribute a fix to this but I'm not sure I have the time or knowledge; maybe someone can point me in the right direction? In particular, I wonder if this an issue in FPDF2 itself, or with the font subsetting from fonttools? From what I can tell, the PDF doesn't contain the X and Y position of each diacritic explicitly; rather, it contains the string and the font, and logic in the embedded font provides the exact position within the string. Is that correct?

Here's my sample code. Thanks in advance for your help!

import os
import unicodedata

from fpdf import FPDF

from arabic_reshaper import reshape
from bidi.algorithm import get_display

def debug_string(s, desc):
    print(f"*** {desc} ***")
    for c in s:
        print(c, ord(c), unicodedata.name(c))

def fix_text(some_text):
    debug_string(some_text, "original")

    # Try fixes from discussion on https://github.com/PyFPDF/fpdf2/pull/490
    some_text = unicodedata.normalize('NFC', some_text)

    debug_string(some_text, "normalized (NFC)")

    some_text = get_display(reshape(some_text))

    debug_string(some_text, "reshaper and bidi alorithm fixed")

    return some_text

pdf = FPDF(unit="in", format="Letter")
pdf.add_font("SBL_Hbrw", fname="SBL_Hbrw.ttf")
pdf.set_font("SBL_Hbrw", "", 30)

pdf.add_page()

some_text = "בּ"

pdf.set_xy(1, 1)
pdf.cell(1, 4, some_text)

some_text = fix_text(some_text)

pdf.set_xy(1, 2)
pdf.cell(1, 4, some_text)

filename = "hebrew.pdf"
pdf.output(filename)
os.startfile(filename)  # windows only

Environment

  • Windows
  • Python version 3.10.5
  • fpdf2 version 2.5.7

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions