Skip to content

Conversation

@LilithSilver
Copy link
Contributor

@LilithSilver LilithSilver commented Jul 29, 2022

Currently, there is a bug with parsing UTF-8 or ASCII Extended characters: the C call isspace() doesn't accept negative char values. The simple fix is to cast the value to an unsigned char, which is fine because no ASCII spaces can appear in the negatives of a char anyways.

This PR also adds a test based on a modified version of Markus Kuhn's UTF-8 Demo Page, to ensure that it can parse a variety of characters. The demo is under the CC BY license which allows unrestricted use with attribution, and the attribution is at the top of the file, so we should be good there.

@JBenda
Copy link
Owner

JBenda commented Jul 29, 2022

Oh, is this really the only thing that breaks with utf-8? quity handy.

I will try it my self this weekend, but it looks promissing.

Thanks for the input

@LilithSilver
Copy link
Contributor Author

Yep, I was surprised as well, but it makes sense considering that UTF-8 was designed for full ASCII compatibility!

Note that if you want the UTF-8 to display properly, you'll have to reinterpret the byte data as UTF-8. Visual Studio for example doesn't support UTF-8 and outputs strings as garbled ASCII extended. But the test confirms that the byte data produced by ink is indeed correct.

@JBenda JBenda merged commit b8b36b8 into JBenda:master Jul 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants