Skip to content

HtmlHelper.toPlainText: correctly decode character references#1604

Open
bschwarzent wants to merge 1 commit intoreleases/25.2from
features/bsh/25.2/413114_html_entities
Open

HtmlHelper.toPlainText: correctly decode character references#1604
bschwarzent wants to merge 1 commit intoreleases/25.2from
features/bsh/25.2/413114_html_entities

Conversation

@bschwarzent
Copy link
Copy Markdown
Member

Both named and decimal character references should be decoded to their original character when converting HTML to plain text. While the numeric references can be decoded automatically, the list of named references is fixed and specified in an RFC. The previously supported list of characters was not complete. This is now fixed.

Additionally, special handling of non-breaking spaces was removed. If they are present in HTML, they should still be present in the plain text as their unicode equivalent \u00A0, because they are not collapsed when the HTML is rendered.

413114

@bschwarzent bschwarzent self-assigned this Jul 21, 2025
Both named and decimal character references should be decoded to their
original character when converting HTML to plain text. While the numeric
references can be decoded automatically, the list of named references
is fixed and specified in an RFC. The previously supported list of
characters was not complete. This is now fixed.

Additionally, special handling of non-breaking spaces was removed. If
they are present in HTML, they should still be present in the plain text
as their unicode equivalent \u00A0, because they are not collapsed when
the HTML is rendered.

413114
@bschwarzent bschwarzent force-pushed the features/bsh/25.2/413114_html_entities branch from a506348 to 9da4d1a Compare July 21, 2025 16:14
@bschwarzent bschwarzent requested a review from nsteger July 21, 2025 16:36
expect(encoder.encode('one\ntwo')).toBe('one\ntwo');
expect(encoder.encode('one\r\ntwo')).toBe('one\ntwo');
expect(encoder.encode('one\rtwo')).toBe('one\ntwo');
expect(encoder.encode('one two')).toBe('one two');
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these special spaces? in this case may add a comment? otherwise I would expect that duplicated spaces are deleted..

decoded = decodeNumericCharacterReference(encoded); // Numeric character reference
}
if (decoded == null) {
start = end + 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if string is for example "&ä" ? I would expect the result "&ä", but I think the algorithm will return "&ä" ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> yes, I added assertEquals("&ä", entities.unescapeAll("&ä")); in HtmlEntitiesTest, which fails


@SuppressWarnings({"ConcatenationWithEmptyString", "SpellCheckingInspection", "TextBlockMigration"})
@RunWith(PlatformTestRunner.class)
public class HtmlEntitiesTest {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may add some of these tests to ts spec? (plain text encoder should do same transformations)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants