HtmlHelper.toPlainText: correctly decode character references#1604
Open
bschwarzent wants to merge 1 commit intoreleases/25.2from
Open
HtmlHelper.toPlainText: correctly decode character references#1604bschwarzent wants to merge 1 commit intoreleases/25.2from
bschwarzent wants to merge 1 commit intoreleases/25.2from
Conversation
Both named and decimal character references should be decoded to their original character when converting HTML to plain text. While the numeric references can be decoded automatically, the list of named references is fixed and specified in an RFC. The previously supported list of characters was not complete. This is now fixed. Additionally, special handling of non-breaking spaces was removed. If they are present in HTML, they should still be present in the plain text as their unicode equivalent \u00A0, because they are not collapsed when the HTML is rendered. 413114
a506348 to
9da4d1a
Compare
nsteger
reviewed
Jul 29, 2025
| expect(encoder.encode('one\ntwo')).toBe('one\ntwo'); | ||
| expect(encoder.encode('one\r\ntwo')).toBe('one\ntwo'); | ||
| expect(encoder.encode('one\rtwo')).toBe('one\ntwo'); | ||
| expect(encoder.encode('one two')).toBe('one two'); |
Member
There was a problem hiding this comment.
are these special spaces? in this case may add a comment? otherwise I would expect that duplicated spaces are deleted..
| decoded = decodeNumericCharacterReference(encoded); // Numeric character reference | ||
| } | ||
| if (decoded == null) { | ||
| start = end + 1; |
Member
There was a problem hiding this comment.
what if string is for example "&ä" ? I would expect the result "&ä", but I think the algorithm will return "&ä" ?
Member
There was a problem hiding this comment.
=> yes, I added assertEquals("&ä", entities.unescapeAll("&ä")); in HtmlEntitiesTest, which fails
|
|
||
| @SuppressWarnings({"ConcatenationWithEmptyString", "SpellCheckingInspection", "TextBlockMigration"}) | ||
| @RunWith(PlatformTestRunner.class) | ||
| public class HtmlEntitiesTest { |
Member
There was a problem hiding this comment.
may add some of these tests to ts spec? (plain text encoder should do same transformations)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Both named and decimal character references should be decoded to their original character when converting HTML to plain text. While the numeric references can be decoded automatically, the list of named references is fixed and specified in an RFC. The previously supported list of characters was not complete. This is now fixed.
Additionally, special handling of non-breaking spaces was removed. If they are present in HTML, they should still be present in the plain text as their unicode equivalent \u00A0, because they are not collapsed when the HTML is rendered.
413114