Add support for tagged NFA; Use uint32_t to replace int for IDs.#42
Add support for tagged NFA; Use uint32_t to replace int for IDs.#42LinZhihao-723 merged 146 commits intoy-scope:mainfrom
uint32_t to replace int for IDs.#42Conversation
… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA
…for clairty that nothing is shared b/w tests
…egexASTgroup with min = 1 OR'd with RegexASTEmpty
…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()
…(); Add docstrign to RegexDFAStatePair
…NFA; Made add to nfa functions const
LinZhihao-723
left a comment
There was a problem hiding this comment.
For PR title, how about:
Add support for tagged NFA; Use `uint32_t` to replace `int` for IDs.
|
Sorry mb, let's fix the latest comments from coderabbit |
…dcoding to use bytes; Take state type instaed of transition type as template for transition classes.
int with uint32_t.uint32_t to replace int for IDs.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
src/log_surgeon/finite_automata/RegexNFA.hpp (2)
47-54: Consider adding@throwsdocumentation for serialize methods.The
serializemethods' documentation should indicate whether they can throw exceptions (e.g., fromfmt::format). This helps users handle potential exceptions appropriately.Also applies to: 71-78
Line range hint
177-361: Document thread safety guarantees.The class handles complex state management but lacks documentation about thread safety. Consider adding thread safety documentation to help users properly synchronize access in multi-threaded contexts.
| std::vector<PositiveTaggedTransition<RegexNFAState>> m_positive_tagged_transitions; | ||
| std::vector<NegativeTaggedTransition<RegexNFAState>> m_negative_tagged_transitions; |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Consider reserving capacity for transition vectors.
The vectors m_positive_tagged_transitions and m_negative_tagged_transitions might benefit from capacity reservation if you have an estimate of the typical number of transitions.
| return fmt::format( | ||
| "{}:{}byte_transitions={{{}}},epsilon_transitions={{{}}},positive_tagged_transitions={{" | ||
| "{}}},negative_tagged_transitions={{{}}}", | ||
| state_ids.at(this), | ||
| accepting_tag_string, | ||
| fmt::join(byte_transitions, ","), | ||
| fmt::join(epsilon_transitions, ","), | ||
| fmt::join(positive_tagged_transitions, ","), | ||
| fmt::join(negative_tagged_transitions, ",") | ||
| ); |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Consider using fmt::join with a custom formatter.
Instead of building multiple intermediate strings with fmt::join and then combining them, consider creating a custom formatter to handle the entire state serialization in one pass. This could improve performance by reducing memory allocations.
Example approach:
struct StateFormatter {
template<typename ParseContext>
constexpr auto parse(ParseContext& ctx) { return ctx.begin(); }
template<typename FormatContext>
auto format(const State& state, FormatContext& ctx) {
// Format all transitions in one pass
return fmt::format_to(ctx.out(), "{}:{}", ...);
}
};
References
Description
The way tagged-NFAs work are as follows:
Changes to implement tagged-NFA:
add_ast()updated to add tags to each rule's regex and to build the NFA with tags.add_with_negative_tags()implemented to add a rule to the NFA while considering negative tags. This requires two passes of the AST, first positive tags are added, then negative tags are added as they depend on knowing the positive tags of alternate paths in the the AST.add()functions calladd_with_negative_tags()such that the NFA recursively adds negative tags when traversing the AST.add_negative_tagged_transition()adds negative tags. Called at whichever AST node has negative tags.add_positive_tagged_transitions()adds a positive tag. Called for every capture group AST node.Changes as a result of tagged-NFA:
generate_reverse()commented out as it is currently unused (at least internally and in CLP) until it is fixed to work with tags.Validation
Summary by CodeRabbit
New Features
LexicalRuleclass to manage lexical rules in finite automata.Bug Fixes