Skip to content

[Bug] docx skill: validate.py and pack.py crash on Windows due to missing UTF-8 encoding in file I/O and print output #712

@akcd1

Description

@akcd1

Description

The docx skill's validate.py and pack.py scripts fail on Windows with UnicodeEncodeError / UnicodeDecodeError because they rely on the default system encoding (cp1252 on Windows) instead of explicitly using UTF-8.

There are two independent failure modes:

1. Output encoding crash (print statements with Unicode characters)

validators/docx.py lines 249 and 431 use the (U+2192) arrow character in print statements:

# Line 249
print(f"\nParagraphs: {original_count}{new_count} ({diff_str})")
# Line 431
f"  Repaired: {xml_file.name}: durableId {durable_id}{new_id}"

On Windows (cp1252 default encoding), this crashes with:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 17: character maps to <undefined>

2. Input encoding crash (reading XML files)

validators/base.py line 763 opens XML files without specifying encoding:

with open(xml_file, "r") as f:
    xml_doc = lxml.etree.parse(f)

When an XML file contains bytes outside the cp1252 range (e.g., numbering.xml generated by docx-js), this crashes with:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1410: character maps to <undefined>

Expected behavior

Scripts should work on Windows without requiring PYTHONIOENCODING to be set. XML files should be read as UTF-8 (per the XML spec).

Suggested fix

  1. Line 763 of validators/base.py: Add encoding="utf-8" to the open() call:

    with open(xml_file, "r", encoding="utf-8") as f:
  2. Lines 249, 431 of validators/docx.py: Either replace with ->, or add encoding="utf-8" to any file-based output. (The PYTHONIOENCODING=utf-8 env var workaround fixes this for terminal output, but the root cause is using non-ASCII in print statements without ensuring the output stream supports it.)

Workarounds

  • Set PYTHONIOENCODING=utf-8 in the environment (fixes issue 1)
  • Use --validate false on pack.py (bypasses issue 2)

Environment

  • OS: Windows 11 Enterprise (Git Bash)
  • Python: 3.x (Anaconda, skill-docx conda env)
  • Skills repo commit: b0cbd3d
  • Document created with: docx-js (Node.js)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions