-
-
Notifications
You must be signed in to change notification settings - Fork 98
Fix snippet generation failure in some cases with Chinese characters #447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes snippet generation failures for text files containing Chinese characters and other special characters. The issue was caused by the \R regex pattern in preg_split corrupting characters during line splitting, which prevented proper snippet generation for affected files.
- Replaces the
\Rregex pattern with explicit line break patterns (/\r\n|\n|\r/) in the text preview formatter - Resolves character corruption issues that prevented snippet generation for files with Chinese characters
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Could this be a case where I don't see why a regular split by newline characters could break a multibyte UTF string, unless some Unicode characters use one of the newline codes as their second byte? I'm not sure where the full pattern definition can be found, but if my duck.ai answer is correct it seems like |
|
Also, while I think about it, can we get some example text that causes issues in case someone ever writes tests for this? |
I tested it just now, and mb_split('\R', ...) works fine. Should I apply this to the PR?
Here's an example to reproduce the issue: |
Yes, please use |
|
Changed as requested. |
Fixes #0000
Changes proposed in this pull request:
In some uncertain cases, uploaded text files containing Chinese(and possibly other special characters) will fail to generate snippets. I tried to track the procedure and found that the preg_split \R will result in some corrupted characters in $lines.
The reason is not clear to me, but by changing it to /\r\n|\n|\r/ it works well as intended.
Reviewers should focus on:
I'm not sure why this fix could work, since the new expression should be equal with \R.
Screenshot


Before fix:
After fix:
Confirmed
composer test).Required changes: