Skip to content

Commit ad6dc17

Browse files
Fix some bugs in the diffWords regex (and errors & ambiguities in the comment above it) (#635)
1 parent 3e1774a commit ad6dc17

3 files changed

Lines changed: 103 additions & 17 deletions

File tree

release-notes.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
## Future 8.0.3 release
44

55
- [#631](https://github.com/kpdecker/jsdiff/pull/631) - **fix support for using an `Intl.Segmenter` with `diffWords`**. This has been almost completely broken since the feature was added in v6.0.0, since it would outright crash on any text that featured two consecutive newlines between a pair of words (a very common case).
6+
- [#635](https://github.com/kpdecker/jsdiff/pull/635) - **small tweaks to tokenization behaviour of `diffWords`** when used *without* an `Intl.Segmenter`. Specifically, the soft hyphen (U+00AD) is no longer considered to be a word break, and the multiplication and division signs (`×` and `÷`) are now treated as punctuation instead of as letters / word characters.
67

78
## 8.0.2
89

src/diff/word.ts

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,25 @@ import { longestCommonPrefix, longestCommonSuffix, replacePrefix, replaceSuffix,
44

55
// Based on https://en.wikipedia.org/wiki/Latin_script_in_Unicode
66
//
7-
// Ranges and exceptions:
8-
// Latin-1 Supplement, 0080–00FF
9-
// - U+00D7 × Multiplication sign
10-
// - U+00F7 ÷ Division sign
11-
// Latin Extended-A, 0100–017F
12-
// Latin Extended-B, 0180–024F
13-
// IPA Extensions, 0250–02AF
14-
// Spacing Modifier Letters, 02B0–02FF
15-
// - U+02C7 ˇ ˇ Caron
16-
// - U+02D8 ˘ ˘ Breve
17-
// - U+02D9 ˙ ˙ Dot Above
18-
// - U+02DA ˚ ˚ Ring Above
19-
// - U+02DB ˛ ˛ Ogonek
20-
// - U+02DC ˜ ˜ Small Tilde
21-
// - U+02DD ˝ ˝ Double Acute Accent
22-
// Latin Extended Additional, 1E00–1EFF
23-
const extendedWordChars = 'a-zA-Z0-9_\\u{C0}-\\u{FF}\\u{D8}-\\u{F6}\\u{F8}-\\u{2C6}\\u{2C8}-\\u{2D7}\\u{2DE}-\\u{2FF}\\u{1E00}-\\u{1EFF}';
7+
// Chars/ranges counted as "word" characters by this regex are as follows:
8+
//
9+
// + U+00AD Soft hyphen
10+
// + 00C0–00FF (letters with diacritics from the Latin-1 Supplement), except:
11+
// - U+00D7 × Multiplication sign
12+
// - U+00F7 ÷ Division sign
13+
// + Latin Extended-A, 0100–017F
14+
// + Latin Extended-B, 0180–024F
15+
// + IPA Extensions, 0250–02AF
16+
// + Spacing Modifier Letters, 02B0–02FF, except:
17+
// - U+02C7 ˇ ˇ Caron
18+
// - U+02D8 ˘ ˘ Breve
19+
// - U+02D9 ˙ ˙ Dot Above
20+
// - U+02DA ˚ ˚ Ring Above
21+
// - U+02DB ˛ ˛ Ogonek
22+
// - U+02DC ˜ ˜ Small Tilde
23+
// - U+02DD ˝ ˝ Double Acute Accent
24+
// + Latin Extended Additional, 1E00–1EFF
25+
const extendedWordChars = 'a-zA-Z0-9_\\u{AD}\\u{C0}-\\u{D6}\\u{D8}-\\u{F6}\\u{F8}-\\u{2C6}\\u{2C8}-\\u{2D7}\\u{2DE}-\\u{2FF}\\u{1E00}-\\u{1EFF}';
2426

2527
// Each token is one of the following:
2628
// - A punctuation mark plus the surrounding whitespace

test/diff/word.js

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,89 @@ describe('WordDiff', function() {
5959
'.'
6060
]);
6161
});
62+
63+
// Test for various behaviours discussed at
64+
// https://github.com/kpdecker/jsdiff/issues/634#issuecomment-3381707327
65+
// In particular we are testing that:
66+
// 1. single code points representing accented characters (most of range
67+
// U+00C0 thru U+00FF) are treated as word characters
68+
// 2. soft hyphens are treated as part of the word they appear in
69+
// 3. the multiplication and division signs are punctuation
70+
// 4. currency signs are punctuation
71+
// 5. section symbol is punctuation
72+
// 6. reserved trademark symbol is punctuation
73+
// 7. fractions are punctuation
74+
// The behaviour being tested for in points 4 thru 7 above is of debatable
75+
// correctness; it is not totally obvious whether we SHOULD treat those
76+
// things as punctuation characters or as word characters. Nonetheless, we
77+
// have this test to help document the current behaviour.
78+
it('should handle the 0080-00FF range the way we expect', () => {
79+
expect(
80+
wordDiff.tokenize(
81+
'My daugh\u00adter, Am\u00E9lie, is 1½ years old and works for ' +
82+
'Google® for £6 per hour (equivalently £6÷60=£0.10 per minute, or ' +
83+
'£6×8=£48 per day), in violation of § 123 of the Child Labour Act.'
84+
)
85+
).to.deep.equal([
86+
'My ',
87+
' daugh\u00adter',
88+
', ',
89+
' Am\u00E9lie',
90+
', ',
91+
' is ',
92+
' 1',
93+
'½ ',
94+
' years ',
95+
' old ',
96+
' and ',
97+
' works ',
98+
' for ',
99+
' Google',
100+
'® ',
101+
' for ',
102+
' £',
103+
'6 ',
104+
' per ',
105+
' hour ',
106+
' (',
107+
'equivalently ',
108+
' £',
109+
'6',
110+
'÷',
111+
'60',
112+
'=',
113+
'£',
114+
'0',
115+
'.',
116+
'10 ',
117+
' per ',
118+
' minute',
119+
', ',
120+
' or ',
121+
' £',
122+
'6',
123+
'×',
124+
'8',
125+
'=',
126+
'£',
127+
'48 ',
128+
' per ',
129+
' day',
130+
')',
131+
', ',
132+
' in ',
133+
' violation ',
134+
' of ',
135+
' § ',
136+
' 123 ',
137+
' of ',
138+
' the ',
139+
' Child ',
140+
' Labour ',
141+
' Act',
142+
'.'
143+
]);
144+
});
62145
});
63146

64147
describe('#diffWords', function() {

0 commit comments

Comments
 (0)