Skip to content

Conversation

@skytin1004
Copy link
Collaborator

@skytin1004 skytin1004 commented Sep 15, 2025

Purpose

Improve detection and handling of fenced code blocks in Markdown so that:

  • Code chunks are never split during chunking.
  • Placeholder insertion/restoration is stable and order-preserving.
  • Single/unmatched fences (e.g., a lone ``` in text) do not corrupt placeholder order or chunk boundaries.

Description

This PR replaces regex-based code-block detection with a spec-compliant parser using markdown-it-py.

  • Regex r"```[\s\S]*?```" previously used in replace_code_blocks() and indirectly in chunking fails with unmatched fences and various edge cases (e.g., ~~~, variable fence length, info strings).
  • Using markdown-it-py yields consistent, spec-conformant behavior and fixes placeholder order issues and accidental code splits.

Before:

  • A single unmatched ``` in the document could derail placeholder ordering and lead to broken chunking.

After:

  • Only well-formed fenced blocks are replaced and treated as atomic during chunking.
  • Unmatched/partial fences remain as plain text and no longer cause placeholder desynchronization.

Related Issue

Does this introduce a breaking change?

  • Yes
  • No

Notes:

  • Behavior changes are strictly bug fixes and robustness improvements.
  • Consumers should run dependency install to fetch markdown-it-py.

Type of change

  • Bugfix
  • Refactoring (no functional API changes)
  • Feature
  • Code style update
  • Documentation content changes
  • Other... Please describe:

Checklist

  • I have thoroughly tested my changes
  • All existing tests pass
  • I have added new tests (if applicable)
  • I have followed the Co-op Translator coding conventions
  • I have documented my changes (if applicable)

Additional context

  • The parser-based approach also unlocks future enhancements:
    • Language-aware handling via token.info (e.g., mode-specific rules for python, bash, etc.).
    • Optional handling for ~~~ fences and variable-length fences without extra complexity.
  • Suggested follow-up: add targeted tests for:
    • Single unmatched fence,
    • Mixed ``` and ~~~ fences,
    • Variable fence lengths,
    • Very long code blocks to ensure atomic chunking.

@github-actions github-actions bot added the build Related to the build process, dependency management, and CI/CD configurations label Sep 15, 2025
@skytin1004 skytin1004 marked this pull request as ready for review September 15, 2025 09:21
@skytin1004
Copy link
Collaborator Author

I have reviewed the changes and everything looks good.

@skytin1004 skytin1004 merged commit c588663 into Azure:main Sep 15, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Related to the build process, dependency management, and CI/CD configurations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

``` not handled correctly (en→ja)

1 participant