Description
The docx skill's validate.py and pack.py scripts fail on Windows with UnicodeEncodeError / UnicodeDecodeError because they rely on the default system encoding (cp1252 on Windows) instead of explicitly using UTF-8.
There are two independent failure modes:
1. Output encoding crash (print statements with Unicode characters)
validators/docx.py lines 249 and 431 use the → (U+2192) arrow character in print statements:
# Line 249
print(f"\nParagraphs: {original_count} → {new_count} ({diff_str})")
# Line 431
f" Repaired: {xml_file.name}: durableId {durable_id} → {new_id}"
On Windows (cp1252 default encoding), this crashes with:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 17: character maps to <undefined>
2. Input encoding crash (reading XML files)
validators/base.py line 763 opens XML files without specifying encoding:
with open(xml_file, "r") as f:
xml_doc = lxml.etree.parse(f)
When an XML file contains bytes outside the cp1252 range (e.g., numbering.xml generated by docx-js), this crashes with:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1410: character maps to <undefined>
Expected behavior
Scripts should work on Windows without requiring PYTHONIOENCODING to be set. XML files should be read as UTF-8 (per the XML spec).
Suggested fix
-
Line 763 of validators/base.py: Add encoding="utf-8" to the open() call:
with open(xml_file, "r", encoding="utf-8") as f:
-
Lines 249, 431 of validators/docx.py: Either replace → with ->, or add encoding="utf-8" to any file-based output. (The PYTHONIOENCODING=utf-8 env var workaround fixes this for terminal output, but the root cause is using non-ASCII in print statements without ensuring the output stream supports it.)
Workarounds
- Set
PYTHONIOENCODING=utf-8 in the environment (fixes issue 1)
- Use
--validate false on pack.py (bypasses issue 2)
Environment
- OS: Windows 11 Enterprise (Git Bash)
- Python: 3.x (Anaconda, skill-docx conda env)
- Skills repo commit:
b0cbd3d
- Document created with: docx-js (Node.js)
Description
The docx skill's
validate.pyandpack.pyscripts fail on Windows withUnicodeEncodeError/UnicodeDecodeErrorbecause they rely on the default system encoding (cp1252 on Windows) instead of explicitly using UTF-8.There are two independent failure modes:
1. Output encoding crash (print statements with Unicode characters)
validators/docx.pylines 249 and 431 use the→(U+2192) arrow character in print statements:On Windows (cp1252 default encoding), this crashes with:
2. Input encoding crash (reading XML files)
validators/base.pyline 763 opens XML files without specifying encoding:When an XML file contains bytes outside the cp1252 range (e.g.,
numbering.xmlgenerated by docx-js), this crashes with:Expected behavior
Scripts should work on Windows without requiring
PYTHONIOENCODINGto be set. XML files should be read as UTF-8 (per the XML spec).Suggested fix
Line 763 of
validators/base.py: Addencoding="utf-8"to theopen()call:Lines 249, 431 of
validators/docx.py: Either replace→with->, or addencoding="utf-8"to any file-based output. (ThePYTHONIOENCODING=utf-8env var workaround fixes this for terminal output, but the root cause is using non-ASCII in print statements without ensuring the output stream supports it.)Workarounds
PYTHONIOENCODING=utf-8in the environment (fixes issue 1)--validate falseonpack.py(bypasses issue 2)Environment
b0cbd3d