common: add partial regex support#12808
Conversation
ngxson
left a comment
There was a problem hiding this comment.
I don't yet have time to look into details, but the code gives a good spirit 👍
Seems like this can be an opportunity to development some kind of fuzzing tool
| auto it = pattern.begin(); | ||
| const auto end = pattern.end(); | ||
|
|
||
| std::function<std::string()> process = [&]() { |
There was a problem hiding this comment.
Btw I think we should prevent using lambda function when possible (especially long functions), the bad thing is that it can increase compilation time by a lot.
A cleaner solution could be do have a struct reversed_partial_regex_builder for it, move lambda into struct member, and then having this regex_to_reversed_partial_regex construct a new instance of the struct each time you use it
common/common.h
Outdated
| } | ||
| // While we wait for C++20's std::string::ends_with... | ||
| bool string_ends_with(const std::string & str, const std::string & suffix); | ||
| size_t string_find_partial_stop(const std::string &str, const std::string &stop); |
There was a problem hiding this comment.
| size_t string_find_partial_stop(const std::string &str, const std::string &stop); | |
| size_t string_find_partial_stop(const std::string & str, const std::string & stop); |
common/regex-partial.cpp
Outdated
| while (it != end) { | ||
| if (*it == '\\' && (++it != end)) { | ||
| ++it; | ||
| } else if (*it == ']') { |
There was a problem hiding this comment.
If ++it == end in the previous branch, here we will dereference the end() iterator which is not allowed.
| } else if (*it == '\\' && (++it != end)) { | ||
| auto str = std::string("\\") + *it; | ||
| sequence->push_back(str); | ||
| ++it; | ||
| } else { | ||
| sequence->push_back(std::string(1, *it)); | ||
| ++it; | ||
| } | ||
| } |
There was a problem hiding this comment.
Similar problem with potentially dereferencing end()
| if (stop[char_index] == text_last_char) { | ||
| const std::string current_partial = stop.substr(0, char_index + 1); | ||
| if (string_ends_with(str, current_partial)) { | ||
| return str.size() - char_index - 1; | ||
| } | ||
| } |
There was a problem hiding this comment.
I think these sub-string copies can be avoided via a std::string_view overload of string_ends_with.
common/regex-partial.h
Outdated
|
|
||
| #include <regex> | ||
| #include <string> | ||
| #include "ggml.h" |
There was a problem hiding this comment.
Instead of including ggml.h, replace the GGML_ASSERT with throw.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This is a crude replacement for boost::match_partial, allowing to do partial find / match of regexes (if partial match is at the end of the tested string).
This was extracted from #12379 (streaming tool call support), in which it's used for partial parsers.
Rather than reimplementing an entire regex engine, I decided to take the lazy route of turning any pattern to a reverse pattern that matches from the string end, using the normal regex logic for the actual matching. The new
common_regexmay return full or partial matches. Full matches have their normal captures, while partial matches only have one capture group (the partial match).For instance, given the pattern
/abc/, the reverse pattern is/((?:(?:c)?b)?a)[\s\S]*/so it can match"fooo ab"(and capture"ab"; empty capture means no match at all).