fix: Preserve Unicode whitespace in inline formatting tags by jhogstrom · Pull Request #148 · Spenhouet/confluence-markdown-exporter

jhogstrom · 2026-02-22T17:18:14Z

Problem

Unicode whitespace characters (like   / \xa0) inside inline formatting tags are being stripped during HTML-to-Markdown conversion, causing words to run together in the output.

Example

Confluence HTML:

property<em>&nbsp;JungerRoot</em> .

Expected markdown:

property *JungerRoot* .

Actual markdown (before fix):

property*JungerRoot* .

❌ Missing space - words run together!

Impact

2,600+ instances in one medium-sized Confluence space alone (642 leading nbsp, 1991 trailing nbsp)
Affects all inline formatting tags: , , <code>, , 
Makes exported documentation harder to read
Confluence commonly uses   for spacing control in formatted text

Root Cause

BeautifulSoup correctly converts HTML entities:   → \xa0 (non-breaking space Unicode character)
markdownify's chomp() function handles leading/trailing whitespace:
- Regular space (' '): Moved to prefix/suffix ✅
- Unicode whitespace (\xa0, \u2000-\u200a, etc.): Stripped entirely ❌
Result: Space is completely lost

Evidence:

from markdownify import chomp

# Regular space - preserved in prefix
chomp(' text')  # → (' ', '', 'text')  ✅

# Non-breaking space - lost completely
chomp('\xa0text')  # → ('', '', 'text')  ❌

Solution

Normalize Unicode whitespace to regular ASCII spaces before passing text to parent converter methods. This allows markdownify's chomp() to handle the spaces correctly.

Implementation

Added helper method _normalize_unicode_whitespace(text: str) -> str
- Converts all Unicode whitespace to regular spaces
- Preserves semantic whitespace (\n, \r, \t)
- Handles: \xa0 (nbsp), \u2000-\u200a (various spaces), \u2028 (line separator), etc.
Overrode 5 converter methods to normalize text parameter:
- convert_em()
- convert_strong()
- convert_code()
- convert_i()
- convert_b()
Comprehensive unit tests: 17 test cases covering:
- Leading/trailing nbsp in all tag types
- Multiple nbsp in sequence
- Other Unicode whitespace (EM SPACE, THIN SPACE)
- Preserves newlines and tabs
- Real-world Confluence examples

Code Changes

File: confluence_markdown_exporter/confluence.py

def _normalize_unicode_whitespace(self, text: str) -> str:
    """Normalize Unicode whitespace to regular spaces."""
    normalized = text
    for char in text:
        if char.isspace() and char not in " \n\r\t":
            normalized = normalized.replace(char, " ")
    return normalized

def convert_em(self, el: BeautifulSoup, text: str, parent_tags: list[str]) -> str:
    """Convert <em> tags, preserving spaces from Unicode whitespace entities."""
    text = self._normalize_unicode_whitespace(text)
    return super().convert_em(el, text, parent_tags)

# Similar overrides for convert_strong, convert_code, convert_i, convert_b

Testing

✅ All 17 unit tests passing

pytest tests/unit/test_nbsp_fix.py -v
# 17 passed in 0.06s

Test coverage:

Leading nbsp:  text → *text* (space preserved in context)
Trailing nbsp: text  → *text* (space preserved in context)
Both sides:  text  → *text* (both preserved)
All tag types: , , <code>, , 
Real-world: property JungerRoot . → property *JungerRoot* . ✅
Multiple nbsp:   text → * text* (all spaces preserved)
Unicode spaces: EM SPACE (\u2003), THIN SPACE (\u2009) normalized correctly
Preserves newlines/tabs: text\nmore → unchanged
No modification when no Unicode whitespace present

Performance Impact

None - Simple character replacement, O(n) on text content within tags only.

Backward Compatibility

✅ No API changes
✅ No configuration changes
✅ Output format improved (spaces now preserved)
✅ All existing functionality preserved
✅ Only affects previously broken output

Files Changed

confluence_markdown_exporter/confluence.py (+57 lines)
tests/unit/test_nbsp_fix.py (+177 lines, new file)

Total: +234 lines

Alternative Approaches Considered

Preprocess HTML globally - Would normalize all text, not just inline tags
Post-process markdown - Would need complex regex to avoid false positives
Override process_tag - Higher-level interception but more complex

Chosen approach (normalize text parameter in convert methods):

✅ Minimal, surgical fix
✅ Only affects inline formatting tags where issue occurs
✅ Delegates to parent class (follows existing pattern)
✅ Easy to test and maintain

References

Confluence uses   extensively for formatting control
markdownify chomp() source: https://github.com/matthewwithanm/python-markdownify/blob/develop/markdownify/__init__.py#L60-L70
Unicode whitespace characters: https://en.wikipedia.org/wiki/Whitespace_character#Unicode

Fixes issue where   and other Unicode whitespace inside inline tags (, , <code>, , ) were being stripped during HTML-to-Markdown conversion, causing words to run together. Root cause: markdownify's chomp() function strips Unicode whitespace (\xa0, \u2000-\u200a, etc.) entirely instead of preserving it like regular spaces. This affected 2600+ instances in MOSART space alone. Solution: Normalize Unicode whitespace to regular ASCII spaces before passing text to parent converter methods. This allows chomp() to handle the spaces correctly. Changes: - Add _normalize_unicode_whitespace() helper method - Override convert_em/strong/code/i/b to normalize text parameter - Add comprehensive unit tests (17 test cases) Example fix: Before: property JungerRoot . → property*JungerRoot* . After: property JungerRoot . → property *JungerRoot* .

- Use r-string for docstring with backslashes (D301) - Add from __future__ import annotations for forward references - Add type annotations to all test functions (ANN201, ANN001) - Add type annotation for MockPage.__init__ (ANN204) - Fix Yoda condition (SIM300) - Add F821 to test per-file-ignores for forward references All 17 tests still passing.

Spenhouet · 2026-03-08T21:05:18Z

@jhogstrom Thanks for your work!

jhogstrom and others added 4 commits February 22, 2026 18:17

Fix typing

962bb8c

Merge branch 'main' into pr/jhogstrom/148

51f6f07

Spenhouet merged commit 21aed94 into Spenhouet:main Mar 8, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Preserve Unicode whitespace in inline formatting tags#148

fix: Preserve Unicode whitespace in inline formatting tags#148
Spenhouet merged 4 commits intoSpenhouet:mainfrom
jhogstrom:fix/unicode-whitespace-upstream

jhogstrom commented Feb 22, 2026

Uh oh!

Uh oh!

Spenhouet commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jhogstrom commented Feb 22, 2026

Problem

Example

Impact

Root Cause

Solution

Implementation

Code Changes

Testing

Performance Impact

Backward Compatibility

Files Changed

Alternative Approaches Considered

References

Uh oh!

Uh oh!

Spenhouet commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants