Optimize Fw::StringUtils::string_length with SWAR algorithm#4789
Optimize Fw::StringUtils::string_length with SWAR algorithm#4789vsoulgard wants to merge 3 commits intonasa:develfrom
Fw::StringUtils::string_length with SWAR algorithm#4789Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates Fw::StringUtils::string_length to use a SWAR (word-at-a-time) NUL detection approach to reduce iterations and improve performance for longer strings, and adds unit tests intended to cover additional edge cases.
Changes:
- Replaced byte-by-byte
string_lengthwith an aligned, word-at-a-time SWAR implementation (with a short-string fast path). - Introduced a
NO_ASANmacro and applied it tostring_length. - Added new unit tests for empty strings, various lengths, large strings, and unaligned pointers.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
Fw/Types/StringUtils.cpp |
Implements the SWAR-based string_length and applies NO_ASAN. |
Fw/Types/Assert.hpp |
Adds the NO_ASAN macro definition (compiler-attribute based). |
Fw/Types/test/ut/TypesTest.cpp |
Adds new unit tests for string_length edge cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| NO_ASAN | ||
| FwSizeType Fw::StringUtils::string_length(const CHAR* source, FwSizeType buffer_size) { | ||
| FwSizeType length = 0; | ||
| FW_ASSERT(source != nullptr); | ||
| for (length = 0; length < buffer_size; length++) { | ||
|
|
||
| // Fast path for short strings - avoids alignment overhead | ||
| if (buffer_size <= sizeof(std::size_t)) | ||
| { |
There was a problem hiding this comment.
string_length is annotated with NO_ASAN, which will suppress AddressSanitizer reports for this function. Since this implementation bounds all reads to buffer_size, ASan findings here are likely to indicate a real caller bug (e.g., passing an incorrect buffer_size) rather than a false positive; disabling ASan could hide those issues. Consider removing NO_ASAN or limiting it to a narrowly-scoped build/config where intentional over-reads actually occur (and document why).
There was a problem hiding this comment.
That is correct statement. But as far as i know even if the passed buffer_size is incorrect, the over-read will be safe, since aligned reading do not cross memory page boundaries.
The reason I added NO_ASAN macro is because an incorrect buffer_size passed in Fw::StringUtils::string_copy, which causes ASan finding when running unit tests:
char* Fw::StringUtils::string_copy(char* destination, const char* source, FwSizeType num) {
// Handle self-copy and 0 bytes copy
if (destination == source || num == 0) {
return destination;
}
FW_ASSERT(source != nullptr);
FW_ASSERT(destination != nullptr);
// Copying an overlapping range is undefined
FwSizeType source_len = string_length(source, num) + 1; // <-- HERE
FW_ASSERT(source + source_len <= destination || destination + num <= source);
char* returned = strncpy(destination, source, static_cast<size_t>(num));
destination[num - 1] = '\0';
return returned;
}destination length (num) passed as source length into Fw::StringUtils::string_length, which causes ASan finding in cases where destination length (num) is greater than the length of source buffer.
There was a problem hiding this comment.
It seems to me the sanitizer is finding a bug in the unit test code here, and we should fix the unit test, rather than disabling the sanitizer. If I understand what is going on, num should never be greater than the minimum of the sizes of the buffers pointed to by destination and source. The unit test code seems to violate this assumption. In general, if we allow this assumption to be violated by disabling the sanitizer, then we are opening the door to a buffer overrun.
There was a problem hiding this comment.
In general, disabling the sanitizer for an entire function is something we should avoid -- especially for a function like this that does low-level memory manipulation.
There was a problem hiding this comment.
For example, what if someone calls this function with an actual buffer size of 100 for destination and 1000 for source, and provides 1000 as num? We definitely don't want to disable sanitation in that case. I agree with Copilot here:
Since this implementation bounds all reads to buffer_size, ASan findings here are likely to indicate a real caller bug (e.g., passing an incorrect buffer_size) rather than a false positive; disabling ASan could hide those issues.
There was a problem hiding this comment.
Thank you for the review! You are right, ASan only triggers if we pass buffer_length greater than the actual buffer size. But that happens not only in Fw::StringUtils::string_copy, but also in StringBase::operator+=:
fprime/Fw/Types/StringBase.cpp
Lines 25 to 28 in ffa5629
It worked well with the original byte-by-byte comparison, but SWAR implementation causes an over-read here (optimized strnlen implementations behave similarly). To avoid breaking things, I suggest sticking with the original version, but adding SWAR implementation as second high-performance (for long strings) option for cases where buffer length is known in advance, if you'd be interested in that approach.
Like:
FwSizeType Fw::StringUtils::string_length_swar(const CHAR* source, FwSizeType max_buffer_size) There was a problem hiding this comment.
OK, thanks. It seems that the SWAR approach is pushing on an issue that has always been in the framework. In the case of the operator+= code cited above, the operation is not safe against a buffer overrun in all cases (even without SWAR), because (1) no buffer size is provided for src, and (2) there is no guarantee that the string pointed to by src is null terminated.
The F Prime team will need to decide how to proceed here.
Change Description
This PR implements the performance optimization proposed in #4788. It replaces the naive byte-by-byte string length calculation with a SWAR (SIMD Within A Register) approach using word-sized and bitwise NUL char detection.
Details:
Rationale
SWAR reduces the number of loop iterations and memory load instructions, thus improving performance. This technique is widely used in high-performance standard libraries.
Benchmarks:
(Tested on Linux x86_64 with GCC 14.2.0)
Benchmark Results (aligned)
New version is slightly slower for very short strings, but a lot faster for long ones:
N=2 - 1.28 vs. 2.01 (ns)
N=4 - 1.76 vs. 2.27 (ns)
N=8 - 3.06 vs. 3.01 (ns)
N=16 - 6.05 vs. 3.43 (ns)
N=32 - 19.30 vs. 4.31 (ns)
N=64 - 27.97 vs. 6.37 (ns)
N=128 - 53.45 vs. 10.36 (ns) (x5 faster)
N=256 - 100.27 vs. 22.29 (ns)
N=512 - 191.22 vs. 42.47 (ns)
Benchmark Results (unaligned)
Unaligned calculations are slower than aligned ones, because of head byte-by-byte cycle, but still faster than old version:
N=2 - 1.33 vs. 1.91 (ns)
N=4 - 2.03 vs. 2.26 (ns)
N=8 - 3.55 vs. 3.02 (ns)
N=16 - 6.65 vs. 5.18 (ns)
N=32 - 17.12 vs. 6.10 (ns)
N=64 - 30.34 vs. 8.44 (ns)
N=128 - 56.18 vs. 12.60 (ns)
N=256 - 103.61 vs. 24.40 (ns)
N=512 - 199.45 vs. 44.11 (ns)
Testing/Review Recommendations
Benchmark Source
AI Usage
AI was used for research and testing of ideas. All code implementation, algorithm design, benchmarking, and technical decisions were done by me.