HtmlHelper.toPlainText: correctly decode character references by bschwarzent · Pull Request #1604 · eclipse-scout/scout.rt

bschwarzent · 2025-07-21T15:45:05Z

Both named and decimal character references should be decoded to their original character when converting HTML to plain text. While the numeric references can be decoded automatically, the list of named references is fixed and specified in an RFC. The previously supported list of characters was not complete. This is now fixed.

Additionally, special handling of non-breaking spaces was removed. If they are present in HTML, they should still be present in the plain text as their unicode equivalent \u00A0, because they are not collapsed when the HTML is rendered.

413114

Both named and decimal character references should be decoded to their original character when converting HTML to plain text. While the numeric references can be decoded automatically, the list of named references is fixed and specified in an RFC. The previously supported list of characters was not complete. This is now fixed. Additionally, special handling of non-breaking spaces was removed. If they are present in HTML, they should still be present in the plain text as their unicode equivalent \u00A0, because they are not collapsed when the HTML is rendered. 413114

nsteger · 2025-07-29T07:24:58Z

eclipse-scout-core/test/encoder/PlainTextEncoderSpec.ts

+    expect(encoder.encode('one\ntwo')).toBe('one\ntwo');
+    expect(encoder.encode('one\r\ntwo')).toBe('one\ntwo');
+    expect(encoder.encode('one\rtwo')).toBe('one\ntwo');
+    expect(encoder.encode('one   two')).toBe('one   two');


are these special spaces? in this case may add a comment? otherwise I would expect that duplicated spaces are deleted..

nsteger · 2025-07-29T07:57:03Z

...eclipse.scout.rt.platform/src/main/java/org/eclipse/scout/rt/platform/html/HtmlEntities.java

+        decoded = decodeNumericCharacterReference(encoded); // Numeric character reference
+      }
+      if (decoded == null) {
+        start = end + 1;


what if string is for example "&ä" ? I would expect the result "&ä", but I think the algorithm will return "&ä" ?

=> yes, I added assertEquals("&ä", entities.unescapeAll("&ä")); in HtmlEntitiesTest, which fails

nsteger · 2025-07-29T08:50:05Z

...cout.rt.platform.test/src/test/java/org/eclipse/scout/rt/platform/html/HtmlEntitiesTest.java

+
+@SuppressWarnings({"ConcatenationWithEmptyString", "SpellCheckingInspection", "TextBlockMigration"})
+@RunWith(PlatformTestRunner.class)
+public class HtmlEntitiesTest {


may add some of these tests to ts spec? (plain text encoder should do same transformations)

bschwarzent self-assigned this Jul 21, 2025

bschwarzent force-pushed the features/bsh/25.2/413114_html_entities branch from a506348 to 9da4d1a Compare July 21, 2025 16:14

bschwarzent requested a review from nsteger July 21, 2025 16:36

nsteger reviewed Jul 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HtmlHelper.toPlainText: correctly decode character references#1604

HtmlHelper.toPlainText: correctly decode character references#1604
bschwarzent wants to merge 1 commit intoreleases/25.2from
features/bsh/25.2/413114_html_entities

bschwarzent commented Jul 21, 2025

Uh oh!

nsteger Jul 29, 2025

Uh oh!

nsteger Jul 29, 2025

Uh oh!

nsteger Jul 29, 2025

Uh oh!

nsteger Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bschwarzent commented Jul 21, 2025

Uh oh!

nsteger Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

nsteger Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

nsteger Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

nsteger Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants