Skip to content

fix: Preserve Unicode whitespace in inline formatting tags#148

Merged
Spenhouet merged 4 commits intoSpenhouet:mainfrom
jhogstrom:fix/unicode-whitespace-upstream
Mar 8, 2026
Merged

fix: Preserve Unicode whitespace in inline formatting tags#148
Spenhouet merged 4 commits intoSpenhouet:mainfrom
jhogstrom:fix/unicode-whitespace-upstream

Conversation

@jhogstrom
Copy link
Copy Markdown
Contributor

Problem

Unicode whitespace characters (like   / \xa0) inside inline formatting tags are being stripped during HTML-to-Markdown conversion, causing words to run together in the output.

Example

Confluence HTML:

property<em>&nbsp;JungerRoot</em> .

Expected markdown:

property *JungerRoot* .

Actual markdown (before fix):

property*JungerRoot* .

❌ Missing space - words run together!

Impact

  • 2,600+ instances in one medium-sized Confluence space alone (642 leading nbsp, 1991 trailing nbsp)
  • Affects all inline formatting tags: <em>, <strong>, <code>, <i>, <b>
  • Makes exported documentation harder to read
  • Confluence commonly uses &nbsp; for spacing control in formatted text

Root Cause

  1. BeautifulSoup correctly converts HTML entities: &nbsp;\xa0 (non-breaking space Unicode character)
  2. markdownify's chomp() function handles leading/trailing whitespace:
    • Regular space (' '): Moved to prefix/suffix ✅
    • Unicode whitespace (\xa0, \u2000-\u200a, etc.): Stripped entirely
  3. Result: Space is completely lost

Evidence:

from markdownify import chomp

# Regular space - preserved in prefix
chomp(' text')  # → (' ', '', 'text')  ✅

# Non-breaking space - lost completely
chomp('\xa0text')  # → ('', '', 'text')  ❌

Solution

Normalize Unicode whitespace to regular ASCII spaces before passing text to parent converter methods. This allows markdownify's chomp() to handle the spaces correctly.

Implementation

  1. Added helper method _normalize_unicode_whitespace(text: str) -> str

    • Converts all Unicode whitespace to regular spaces
    • Preserves semantic whitespace (\n, \r, \t)
    • Handles: \xa0 (nbsp), \u2000-\u200a (various spaces), \u2028 (line separator), etc.
  2. Overrode 5 converter methods to normalize text parameter:

    • convert_em()
    • convert_strong()
    • convert_code()
    • convert_i()
    • convert_b()
  3. Comprehensive unit tests: 17 test cases covering:

    • Leading/trailing nbsp in all tag types
    • Multiple nbsp in sequence
    • Other Unicode whitespace (EM SPACE, THIN SPACE)
    • Preserves newlines and tabs
    • Real-world Confluence examples

Code Changes

File: confluence_markdown_exporter/confluence.py

def _normalize_unicode_whitespace(self, text: str) -> str:
    """Normalize Unicode whitespace to regular spaces."""
    normalized = text
    for char in text:
        if char.isspace() and char not in " \n\r\t":
            normalized = normalized.replace(char, " ")
    return normalized

def convert_em(self, el: BeautifulSoup, text: str, parent_tags: list[str]) -> str:
    """Convert <em> tags, preserving spaces from Unicode whitespace entities."""
    text = self._normalize_unicode_whitespace(text)
    return super().convert_em(el, text, parent_tags)

# Similar overrides for convert_strong, convert_code, convert_i, convert_b

Testing

All 17 unit tests passing

pytest tests/unit/test_nbsp_fix.py -v
# 17 passed in 0.06s

Test coverage:

  • Leading nbsp: <em>&nbsp;text</em>*text* (space preserved in context)
  • Trailing nbsp: <em>text&nbsp;</em>*text* (space preserved in context)
  • Both sides: <em>&nbsp;text&nbsp;</em>*text* (both preserved)
  • All tag types: <em>, <strong>, <code>, <i>, <b>
  • Real-world: property<em>&nbsp;JungerRoot</em> .property *JungerRoot* .
  • Multiple nbsp: <em>&nbsp;&nbsp;text</em>* text* (all spaces preserved)
  • Unicode spaces: EM SPACE (\u2003), THIN SPACE (\u2009) normalized correctly
  • Preserves newlines/tabs: <em>text\nmore</em> → unchanged
  • No modification when no Unicode whitespace present

Performance Impact

None - Simple character replacement, O(n) on text content within tags only.

Backward Compatibility

  • ✅ No API changes
  • ✅ No configuration changes
  • ✅ Output format improved (spaces now preserved)
  • ✅ All existing functionality preserved
  • ✅ Only affects previously broken output

Files Changed

  • confluence_markdown_exporter/confluence.py (+57 lines)
  • tests/unit/test_nbsp_fix.py (+177 lines, new file)

Total: +234 lines

Alternative Approaches Considered

  1. Preprocess HTML globally - Would normalize all text, not just inline tags
  2. Post-process markdown - Would need complex regex to avoid false positives
  3. Override process_tag - Higher-level interception but more complex

Chosen approach (normalize text parameter in convert methods):

  • ✅ Minimal, surgical fix
  • ✅ Only affects inline formatting tags where issue occurs
  • ✅ Delegates to parent class (follows existing pattern)
  • ✅ Easy to test and maintain

References

jhogstrom and others added 4 commits February 22, 2026 18:17
Fixes issue where &nbsp; and other Unicode whitespace inside inline
tags (<em>, <strong>, <code>, <i>, <b>) were being stripped during
HTML-to-Markdown conversion, causing words to run together.

Root cause: markdownify's chomp() function strips Unicode whitespace
(\xa0, \u2000-\u200a, etc.) entirely instead of preserving it like
regular spaces. This affected 2600+ instances in MOSART space alone.

Solution: Normalize Unicode whitespace to regular ASCII spaces before
passing text to parent converter methods. This allows chomp() to
handle the spaces correctly.

Changes:
- Add _normalize_unicode_whitespace() helper method
- Override convert_em/strong/code/i/b to normalize text parameter
- Add comprehensive unit tests (17 test cases)

Example fix:
  Before: property<em>&nbsp;JungerRoot</em> . → property*JungerRoot* .
  After:  property<em>&nbsp;JungerRoot</em> . → property *JungerRoot* .
- Use r-string for docstring with backslashes (D301)
- Add from __future__ import annotations for forward references
- Add type annotations to all test functions (ANN201, ANN001)
- Add type annotation for MockPage.__init__ (ANN204)
- Fix Yoda condition (SIM300)
- Add F821 to test per-file-ignores for forward references

All 17 tests still passing.
@Spenhouet Spenhouet merged commit 21aed94 into Spenhouet:main Mar 8, 2026
1 check passed
@Spenhouet
Copy link
Copy Markdown
Owner

@jhogstrom Thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants