Skip to content

Conversation

@jmecn
Copy link
Contributor

@jmecn jmecn commented Nov 27, 2020

Fix the code error when checking UTF-8 data

let content = [0xE4, 0x8A, 0xBC], let b = 0xE4 (1110 0100)

See this part:

                if (b < 0x80) {
                    // good
                }
                else if ((b & 0xC0) == 0xC0) {//   (0xE4 & 0xC0) == 0xC0     =====>  true
                    utf8State = UTF8_2BYTE;
                }
                else if ((b & 0xE0) == 0xE0) {//   (0xE4 & 0xE0) == 0xE0      =====>  true
                    utf8State = UTF8_3BYTE_1;
                }
                else {
                    utf8State = UTF8_ILLEGAL;
                }

3 bytes UTF-8 data while always be treated as 2 bytes UTF-8 data.

@stephengold
Copy link
Member

stephengold commented Nov 28, 2020

Thank you for providing the fix. Please change "is" to "are" in 3 places. Other than that, this looks great.

@jmecn
Copy link
Contributor Author

jmecn commented Nov 30, 2020

It looks better that always treat String data as UTF-8.
write it UTF-8, read it UTF-8

see https://hub.jmonkeyengine.org/t/code-error-on-checking-utf-8-data/43909

Let a 3 bytes UTF-8 data = [0xE4, 0x8A, 0xBC], when b = 0xE4 (1110 0100), it will be treated as 2 bytes.

See this part:

```java
            if (b < 0x80) {
                // good
            }
            else if ((b & 0xC0) == 0xC0) {//   (0xE4 & 0xC0) == 0xC0     =====>  true
                utf8State = UTF8_2BYTE;
            }
            else if ((b & 0xE0) == 0xE0) {//   (0xE4 & 0xE0) == 0xE0      =====>  true
                utf8State = UTF8_3BYTE_1;
            }
            else {
                utf8State = UTF8_ILLEGAL;
            }
```

3 bytes UTF-8 data while always be treated as 2 bytes UTF-8 data.

It's better that always treat String data as UTF-8 now.

see https://hub.jmonkeyengine.org/t/code-error-on-checking-utf-8-data/43909
@riccardobl
Copy link
Member

Can you use StandardCharsets.UTF_8 for the charset?

@riccardobl riccardobl self-requested a review December 2, 2020 13:33
@jmecn
Copy link
Contributor Author

jmecn commented Dec 3, 2020

Can you use StandardCharsets.UTF_8 for the charset?

OK, but it's better to use it in both input and output.

https://github.com/jMonkeyEngine/jmonkeyengine/blob/2196e4c/jme3-core/src/plugins/java/com/jme3/export/binary/BinaryOutputCapsule.java#L688

Copy link
Member

@riccardobl riccardobl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect.

@stephengold stephengold merged commit d4a7ad7 into jMonkeyEngine:master Dec 5, 2020
@stephengold stephengold modified the milestones: Future Release, v3.4.0 Mar 13, 2021
@stephengold stephengold linked an issue Mar 16, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Code error on checking UTF-8 data

3 participants