-
-
Notifications
You must be signed in to change notification settings - Fork 14.2k
perf: remove loop from str::floor_char_boundary
#149466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,6 +15,7 @@ mod validations; | |
|
|
||
| use self::pattern::{DoubleEndedSearcher, Pattern, ReverseSearcher, Searcher}; | ||
| use crate::char::{self, EscapeDebugExtArgs}; | ||
| use crate::hint::assert_unchecked; | ||
| use crate::ops::Range; | ||
| use crate::slice::{self, SliceIndex}; | ||
| use crate::ub_checks::assert_unsafe_precondition; | ||
|
|
@@ -409,21 +410,44 @@ impl str { | |
| #[inline] | ||
| pub const fn floor_char_boundary(&self, index: usize) -> usize { | ||
| if index >= self.len() { | ||
| self.len() | ||
| return self.len(); | ||
| } | ||
|
|
||
| let bytes = self.as_bytes(); | ||
| let boundary_index = if bytes[index].is_utf8_char_boundary() { | ||
| index | ||
| } else { | ||
| let mut i = index; | ||
| while i > 0 { | ||
| if self.as_bytes()[i].is_utf8_char_boundary() { | ||
| break; | ||
| // SAFETY: `bytes[index]` is a UTF-8 continuation byte, therefore there must be a byte before it. | ||
| // Note: `index.unchecked_sub(1)` would be preferable, but it doesn't the remove bounds check | ||
| // from `bytes[previous_index]`. | ||
| let previous_index = unsafe { index.checked_sub(1).unwrap_unchecked() }; | ||
| if bytes[previous_index].is_utf8_char_boundary() { | ||
| previous_index | ||
| } else { | ||
| // SAFETY: `bytes[index - 1]` is a UTF-8 continuation byte, therefore there must be a byte before it | ||
| let previous_previous_index = unsafe { index.checked_sub(2).unwrap_unchecked() }; | ||
| if bytes[previous_previous_index].is_utf8_char_boundary() { | ||
| previous_previous_index | ||
| } else { | ||
| // UTF-8 character sequences are at most 4 bytes long. | ||
| // `bytes[index - 2]`, `bytes[index - 1]`, and `bytes[index]` are all continuation bytes, | ||
| // therefore `bytes[index - 3]` must be the 1st byte in a 4-byte UTF-8 character sequence. | ||
| debug_assert!(bytes[index - 3].is_utf8_char_boundary()); | ||
| index - 3 | ||
| } | ||
| i -= 1; | ||
| } | ||
| }; | ||
|
Comment on lines
+416
to
+439
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The deep nesting isn't ideal. You could try matching slice patterns: let bytes = &self.as_bytes()[..=index];
let boundary_index = match bytes {
[.., b] if b.is_utf8_char_boundary() => index,
[.., b, _] if b.is_utf8_char_boundary() => index - 1,
[.., b, _, _] if b.is_utf8_char_boundary() => index - 2,
[.., b, _, _, _] if b.is_utf8_char_boundary() => index - 3,
// SAFETY: TODO
_ => unsafe {
debug_assert!(bytes[index.saturating_sub(3)].is_utf8_char_boundary());
unreachable_unchecked()
}
};
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is much nicer. I'll check if it optimizes equally as well.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's definitely more elegant, but it's definitely non-obvious to me that it could optimize well, since each arm has bounds check requirements that cannot be deduced from the previous.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. However, a failed bounds check implies the following ones also fail. That means that any bounds check that fails leads to the fallback arm, which is
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ...oh, no, I was wrong. The (The scratch pad I was using: https://godbolt.org/z/46Msq6G85) |
||
|
|
||
| // The character boundary will be within four bytes of the index | ||
| debug_assert!(i >= index.saturating_sub(3)); | ||
|
|
||
| i | ||
| // Inform compiler that returned index is `<= index` and on a char boundary. | ||
| // This removes bounds check from e.g. `String::truncate` call after this. | ||
| // SAFETY: Calculations above only deduct from `index`, and cannot wrap around. | ||
| // `boundary_index` is a char boundary. | ||
| unsafe { | ||
| assert_unchecked(boundary_index <= index); | ||
| assert_unchecked(bytes[boundary_index].is_utf8_char_boundary()); | ||
| } | ||
|
|
||
| boundary_index | ||
| } | ||
|
|
||
| /// Finds the closest `x` not below `index` where [`is_char_boundary(x)`] is `true`. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me that it's better to do an
unwrap_uncheckedon the subtraction than just to do aget_uncheckedon the array, if the goal is to remove the bounds checking.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, this https://godbolt.org/z/d155x5hqj seems to work too:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember right, I tried that first, but
get_uncheckedis not const.