Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions packages/markitdown/src/markitdown/converters/_docx_converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,15 @@ def convert(

style_map = kwargs.get("style_map", None)
pre_process_stream = pre_process_docx(file_stream)
return self._html_converter.convert_string(
mammoth.convert_to_html(pre_process_stream, style_map=style_map).value,
**kwargs,
)

# Patch: handle missing styleId safely
try:
html = mammoth.convert_to_html(pre_process_stream, style_map=style_map).value
except KeyError as e:
if str(e) == "'w:styleId'":
# Ignore missing style IDs and convert anyway
html = mammoth.convert_to_html(pre_process_stream, style_map=style_map, ignore_empty_styles=True).value
else:
raise

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        # Patch: handle missing styleId safely
        try:
            html = mammoth.convert_to_html(pre_process_stream, style_map=style_map).value
        except KeyError as e:
            if str(e) == "'w:styleId'":
                # Ignore missing style IDs and convert anyway
                html = mammoth.convert_to_html(pre_process_stream, style_map=style_map, ignore_empty_styles=True).value
            else:
                raise

这段代码通过 try/except 块处理了可能因缺少 w:styleId 而导致的 KeyError。这是个不错的防御性编程实践。不过,可以考虑以下几点改进:

  1. 更明确的异常处理:检查异常信息是否为 'w:styleId' 可能会因语言环境或 mammoth 版本变化而失效。如果可能,尝试通过检查元素属性来确定是否缺少样式 ID。
  2. 日志记录:建议添加一些日志记录,以便在处理异常情况时能够跟踪发生了什么。
  3. 文档:虽然代码中有注释,但在 docstring 或开发者文档中描述这种特殊情况的处理方式会更有帮助。

总体来说,这个修复是有效的,提高了代码的健壮性。


return self._html_converter.convert_string(html, **kwargs)