Skip to content

Commit 8b1171c

Browse files
committed
Clarify the slicing operation.
1 parent ee0def2 commit 8b1171c

1 file changed

Lines changed: 80 additions & 5 deletions

File tree

text/0000-os-str-pattern.md

Lines changed: 80 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -227,8 +227,7 @@ assert_eq!(parts.next(), None);
227227
[reference-level-explanation]: #reference-level-explanation
228228

229229
It is trivial to apply the pattern API to `OsStr` on platforms where it is just an `[u8]`. The main
230-
difficulty is on Windows where it is an `[u16]` encoded as WTF-8. This RFC thus focuses on Windows
231-
only.
230+
difficulty is on Windows where it is an `[u16]` encoded as WTF-8. This RFC thus focuses on Windows.
232231

233232
We will generalize the encoding of `OsStr` to specify these two capabilities:
234233

@@ -262,13 +261,44 @@ representing the high surrogate by the first 3 bytes, and the low surrogate by t
262261
"\u{10000}"[2..] = 90 80 80
263262
```
264263
264+
The index splitting the surrogate pair will be positioned at the middle of the 4-byte sequence
265+
(index "2" in the above example).
266+
265267
Note that this means:
266268
267269
1. `x[..i]` and `x[i..]` will have overlapping parts. This makes `OsStr::split_at_mut` (if exists)
268270
unable to split a surrogate pair in half. This also means `Pattern<&mut OsStr>` cannot be
269271
implemented for `&OsStr`.
270272
2. The length of `x[..n]` may be longer than `n`.
271273
274+
### Platform-agnostic guarantees
275+
276+
If an index points to an invalid position (e.g. `\u{1000}[1..]` or `"\u{10000}"[1..]` or
277+
`"\u{10000}"[3..]`), a panic will be raised, similar to that of `str`. The following are guaranteed
278+
to be valid positions on all platforms:
279+
280+
* `0`.
281+
* `self.len()`.
282+
* The returned indices from `find()`, `rfind()`, `match_indices()` and `rmatch_indices()`.
283+
* The returned ranges from `find_range()`, `rfind_range()`, `match_ranges()` and `rmatch_ranges()`.
284+
285+
Index arithmetic is wrong for `OsStr`, i.e. `i + n` may not produce the correct index (see
286+
[Drawbacks](#drawbacks)).
287+
288+
For WTF-8 encoding on Windows, we define:
289+
290+
* boundary of a character or surrogate byte sequence is Valid.
291+
* middle (byte 2) of a 4-byte sequence is Valid.
292+
* interior of a 2- or 3-byte sequence is Invalid.
293+
* byte 1 or 3 of a 4-byte sequence is Invalid.
294+
295+
Outside of Windows where the `OsStr` consists of arbitrary bytes, all numbers within
296+
`0 ..= self.len()` are considered a valid index. This is because we want to allow
297+
`os_str.find(OsStr::from_bytes(b"\xff"))`, and thus cannot use UTF-8 to reason with a Unix `OsStr`.
298+
299+
Note that we have never guaranteed the actual `OsStr` encoding, these should only be considered an
300+
implementation detail.
301+
272302
## Comparison and storage
273303
274304
All `OsStr` strings with sliced 4-byte sequence can be converted back to proper WTF-8 with an O(1)
@@ -284,7 +314,9 @@ We can this transformation “*canonicalization*”.
284314
All owned `OsStr` should be canonicalized to contain well-formed WTF-8 only: `Box<OsStr>`,
285315
`Rc<OsStr>`, `Arc<OsStr>` and `OsString`.
286316
287-
Two `OsStr` are compared equal if they have the same canonicalization.
317+
Two `OsStr` are compared equal if they have the same canonicalization. This may slightly reduce the
318+
performance with a constant overhead, since there would be more checking involving the first and
319+
last three bytes.
288320
289321
## Matching
290322
@@ -423,7 +455,9 @@ match self.matcher.next_match() {
423455
# Rationale and alternatives
424456
[alternatives]: #alternatives
425457

426-
This is the only design which allows borrowing a sub-slice of a surrogate code point from a
458+
## Indivisible surrogate pair
459+
460+
This RFC is the only design which allows borrowing a sub-slice of a surrogate code point from a
427461
surrogate pair.
428462

429463
An alternative is keep using the vanilla WTF-8, and treat a surrogate pair as an atomic entity:
@@ -446,7 +480,48 @@ There are two potential implementations when we want to match with an unpaired s
446480
Note that, for consistency, we need to make `"\u{10000}".starts_with("\u{d800}")` return `false` or
447481
panic.
448482

483+
## Slicing at real byte offset
484+
485+
The current RFC defines the index that splits a surrogate pair into half at byte 2 of the 4-byte
486+
sequence. This has the drawback of `"\u{10000}"[..2].len() == 3`, and caused index arithmetic to be
487+
wrong.
488+
489+
```
490+
"\u{10000}" = f0 90 80 80
491+
"\u{10000}"[..2] = f0 90 80
492+
"\u{10000}"[2..] = 90 80 80
493+
```
494+
495+
The main advantage of this scheme is we could use the same number as the start and end index.
496+
497+
```rust
498+
let s = OsStr::new("\u{10000}");
499+
assert_eq!(s.len(), 4);
500+
let index = s.find('\u{dc00}').unwrap();
501+
let right = &s[index..]; // [90 80 80]
502+
let left = &s[..index]; // [f0 90 80]
503+
```
504+
505+
An alternative make the index refer to the real byte offsets:
506+
507+
```
508+
"\u{10000}" = f0 90 80 80
509+
"\u{10000}"[..3] = f0 90 80
510+
"\u{10000}"[1..] = 90 80 80
511+
```
512+
513+
However the question would be, what should `s[..1]` do?
514+
515+
* **Panic** — But this means we cannot get `left`. We could inspect the raw bytes of `s` itself and
516+
perform `&s[..(index + 2)]`, but we never explicitly exposed the encoding of `OsStr`, so we
517+
cannot read a single byte and thus impossible to do this.
518+
519+
* **Treat as same as `s[..3]`** — But then this inherits all the disadvantages of using 2 as valid
520+
index, plus we need to consider whether `s[1..3]` and `s[3..1]` should be valid.
521+
522+
Given these, we decided not to treat the real byte offsets as valid indices.
523+
449524
# Unresolved questions
450525
[unresolved]: #unresolved-questions
451526

452-
None yet.
527+
None yet.

0 commit comments

Comments
 (0)