Skip to content

Conversation

@edgeinfinity1
Copy link
Contributor

@edgeinfinity1 edgeinfinity1 commented Sep 10, 2025

Fixes #0000

Changes proposed in this pull request:
In some uncertain cases, uploaded text files containing Chinese(and possibly other special characters) will fail to generate snippets. I tried to track the procedure and found that the preg_split \R will result in some corrupted characters in $lines.
The reason is not clear to me, but by changing it to /\r\n|\n|\r/ it works well as intended.

Reviewers should focus on:
I'm not sure why this fix could work, since the new expression should be equal with \R.

Screenshot
Before fix:
OX9O) ~2XP~4Z}I4ILY((HG
After fix:
S2UZRIQ~(V69W3DGO(BEK64

Confirmed

  • Frontend changes: tested on a local Flarum installation.
  • Backend changes: tests are green (run composer test).

Required changes:

@edgeinfinity1 edgeinfinity1 requested a review from a team as a code owner September 10, 2025 17:59
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes snippet generation failures for text files containing Chinese characters and other special characters. The issue was caused by the \R regex pattern in preg_split corrupting characters during line splitting, which prevented proper snippet generation for affected files.

  • Replaces the \R regex pattern with explicit line break patterns (/\r\n|\n|\r/) in the text preview formatter
  • Resolves character corruption issues that prevented snippet generation for files with Chinese characters

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@clarkwinkelmann
Copy link
Member

Could this be a case where mb_split should be used?

I don't see why a regular split by newline characters could break a multibyte UTF string, unless some Unicode characters use one of the newline codes as their second byte?

I'm not sure where the full pattern definition can be found, but if my duck.ai answer is correct it seems like \R also matches \x0B, \f, \u{2028} and \u{2029}. Could one of those codes specifically be causing issue?

@clarkwinkelmann
Copy link
Member

Also, while I think about it, can we get some example text that causes issues in case someone ever writes tests for this?

@edgeinfinity1
Copy link
Contributor Author

edgeinfinity1 commented Sep 13, 2025

Could this be a case where mb_split should be used?

I tested it just now, and mb_split('\R', ...) works fine. Should I apply this to the PR?

Also, while I think about it, can we get some example text that causes issues in case someone ever writes tests for this?

Here's an example to reproduce the issue:
ftbe.txt

@imorland
Copy link
Member

Could this be a case where mb_split should be used?

I tested it just now, and mb_split('\R', ...) works fine. Should I apply this to the PR?

Also, while I think about it, can we get some example text that causes issues in case someone ever writes tests for this?

Here's an example to reproduce the issue: ftbe.txt

Yes, please use mb_split - once that's done this is good to merge. Thanks for your contribution!

@edgeinfinity1
Copy link
Contributor Author

Changed as requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants