-
Notifications
You must be signed in to change notification settings - Fork 50
Unicode correctness #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode correctness #119
Conversation
b8291d8 to
0b605a7
Compare
|
|
||
| -- | Combinator which returns both the result of a parse and the portion of | ||
| -- | the input that was consumed while it was being parsed. | ||
| match :: forall m a. Monad m => ParserT String m a -> ParserT String m (Tuple String a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Attoparsec and Megaparsec this combinator is named match, but I've always felt it was misnamed and that it should be named capture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems similar to JS in that matching returns captures as part of the result. Maybe a capture helper would discard the result and only return the consumed portion.
39b3a10 to
f948996
Compare
f948996 to
2c6d0bc
Compare
2c6d0bc to
9392607
Compare
9392607 to
9846f24
Compare
9846f24 to
37f2619
Compare
60ade0c to
ab104c3
Compare
| ParseState remainder | ||
| (updatePosString position str) | ||
| true | ||
| put $ ParseState remainder (updatePosString position str) true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to the previous implementation as well, but if passed the empty string won't this mark the input as consumed without actually consuming anything?
This is not feedback, just a discussion point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, I'm not sure what the correct behavior should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a new issue for this.
| updatePosString :: Position -> String -> Position | ||
| updatePosString pos str = case uncons str of | ||
| Nothing -> pos | ||
| Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive | |
| Just { head, tail } -> updatePosString (updatePosSingle pos head) tail |
Since it's optimised by the compiler I don't think the comment is necessary.
src/Text/Parsing/Parser/String.purs
Outdated
| -- The CodePoint newtype constructor is not exported, so here's a helper. | ||
| -- This will break at runtime if the definition of CodePoint ever changes | ||
| -- to something other than `newtype CodePoint = CodePoint Int`. | ||
| deconstructCodePoint :: CodePoint -> Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about unCodePoint, since it strips the type-level abstraction to expose the runtime representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| simpleSpace :: forall m . Monad m => ParserT String m Unit | ||
| simpleSpace = skipMany1 (satisfyCP isSpace) | ||
| simpleSpace = skipMany1 (satisfyCodePoint isSpace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can satisfyCP be removed, or is it still being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still being used.
Correctly handle UTF-16 surrogate pairs in `String`s. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Split dev dependencies into spago-dev.dhall. Add benchmark suite. Add astral UTF-16 test. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
ab104c3 to
a1413b1
Compare
|
Some more historical context: the |
Resolves #109
Correctly handle UTF-16 surrogate pairs in
Strings. This is intended to be a conservative change to the package. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail.All prior tests pass with no modifications. Add a few new tests.
If merged, this PR will allow easy solutions for #110 and others.
Non-breaking changes
Add primitive parsers
anyCodePointandsatisfyCodePointfor parsingCodePoints.Add the
matchcombinator.Breaking changes
Move
updatePosStringto theText.Parsing.Parser.Stringmodule and don'texport it.Change the definition of
whiteSpaceandskipSpacestoData.CodePoint.Unicode.isSpace.To make this library handle Unicode correctly, it is necessary to either alter the
StringLikeclass or delete it. We decided to delete it. TheStringmodule will now operate only on inputs of the concreteStringtype.StringLikehas no laws, and during the five years of its life, no-one on Github has ever written another instance ofStringLike.https://github.com/search?l=&q=StringLike+language%3APureScript&type=code
Breaking changes which won’t be caught by the compiler
Fundamentally, we change the way we consume the next input character from
Data.String.CodeUnits.unconstoData.String.CodePoints.uncons.anyCharwill no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parseranyCodePointwill always succeed.We are not quite “making the default
CodePoint”, as was discussed in#76 (comment) .
Rather we are keeping most of the current API and making it work properly with astral Unicode.
We keep the
Charparsers for backward compatibility. We also keep theCharparsers for ergonomic reasons. For example the parserchar :: forall m. Monad m => Char -> ParserT String m Char. This parser is usually called with a literal likechar 'a'. It would be annoying to call this parser withchar (codePointFromChar 'a').Benchmarks
For Unicode correctness, we're now consuming characters with
Data.String.CodePoints.unconsinstead ofData.String.CodeUnits.uncons. If that were going to effect performance, then the effect would show up in therunParser parse23benchmark, but it doesn’t.Before
After