fix: Preserve Unicode whitespace in inline formatting tags#148
Merged
Spenhouet merged 4 commits intoSpenhouet:mainfrom Mar 8, 2026
Merged
fix: Preserve Unicode whitespace in inline formatting tags#148Spenhouet merged 4 commits intoSpenhouet:mainfrom
Spenhouet merged 4 commits intoSpenhouet:mainfrom
Conversation
Fixes issue where and other Unicode whitespace inside inline tags (<em>, <strong>, <code>, <i>, <b>) were being stripped during HTML-to-Markdown conversion, causing words to run together. Root cause: markdownify's chomp() function strips Unicode whitespace (\xa0, \u2000-\u200a, etc.) entirely instead of preserving it like regular spaces. This affected 2600+ instances in MOSART space alone. Solution: Normalize Unicode whitespace to regular ASCII spaces before passing text to parent converter methods. This allows chomp() to handle the spaces correctly. Changes: - Add _normalize_unicode_whitespace() helper method - Override convert_em/strong/code/i/b to normalize text parameter - Add comprehensive unit tests (17 test cases) Example fix: Before: property<em> JungerRoot</em> . → property*JungerRoot* . After: property<em> JungerRoot</em> . → property *JungerRoot* .
- Use r-string for docstring with backslashes (D301) - Add from __future__ import annotations for forward references - Add type annotations to all test functions (ANN201, ANN001) - Add type annotation for MockPage.__init__ (ANN204) - Fix Yoda condition (SIM300) - Add F821 to test per-file-ignores for forward references All 17 tests still passing.
Owner
|
@jhogstrom Thanks for your work! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Unicode whitespace characters (like
/\xa0) inside inline formatting tags are being stripped during HTML-to-Markdown conversion, causing words to run together in the output.Example
Confluence HTML:
Expected markdown:
Actual markdown (before fix):
❌ Missing space - words run together!
Impact
<em>,<strong>,<code>,<i>,<b> for spacing control in formatted textRoot Cause
→\xa0(non-breaking space Unicode character)chomp()function handles leading/trailing whitespace:' '): Moved to prefix/suffix ✅\xa0,\u2000-\u200a, etc.): Stripped entirely ❌Evidence:
Solution
Normalize Unicode whitespace to regular ASCII spaces before passing text to parent converter methods. This allows markdownify's
chomp()to handle the spaces correctly.Implementation
Added helper method
_normalize_unicode_whitespace(text: str) -> str\n,\r,\t)\xa0(nbsp),\u2000-\u200a(various spaces),\u2028(line separator), etc.Overrode 5 converter methods to normalize text parameter:
convert_em()convert_strong()convert_code()convert_i()convert_b()Comprehensive unit tests: 17 test cases covering:
Code Changes
File:
confluence_markdown_exporter/confluence.pyTesting
✅ All 17 unit tests passing
pytest tests/unit/test_nbsp_fix.py -v # 17 passed in 0.06sTest coverage:
<em> text</em>→*text*(space preserved in context)<em>text </em>→*text*(space preserved in context)<em> text </em>→*text*(both preserved)<em>,<strong>,<code>,<i>,<b>property<em> JungerRoot</em> .→property *JungerRoot* .✅<em> text</em>→* text*(all spaces preserved)\u2003), THIN SPACE (\u2009) normalized correctly<em>text\nmore</em>→ unchangedPerformance Impact
None - Simple character replacement, O(n) on text content within tags only.
Backward Compatibility
Files Changed
confluence_markdown_exporter/confluence.py(+57 lines)tests/unit/test_nbsp_fix.py(+177 lines, new file)Total: +234 lines
Alternative Approaches Considered
Chosen approach (normalize text parameter in convert methods):
References
extensively for formatting controlchomp()source: https://github.com/matthewwithanm/python-markdownify/blob/develop/markdownify/__init__.py#L60-L70