Add support for tagged NFA; Use `uint32_t` to replace `int` for IDs. by SharafMohamed · Pull Request #42 · y-scope/log-surgeon

SharafMohamed · 2024-09-16T18:46:09Z

References

Depends on PR#41

Description

The way tagged-NFAs work are as follows:

Set of states that either match nothing or match a variable id (same as untagged-NFA).
A start state (same as untagged-NFA).
A out-going transition from each state for every possible input character (same as untagged-NFA).
Out-going epsilon-transitions that are always taken when in the state (same as untagged-NFA).
Positive-transitions that corresponds to a single capture group. It is treated as an epsilon transitions, except taking it indicates that the corresponding capture group has been matched(unique to tagged-NFAs).
Negative-transitions that corresponds to several capture groups. It is treated as an epsilon transitions, except taking it indicates that the corresponding capture groups are guaranteed to be unmatched (unique to tagged-NFAs).

Changes to implement tagged-NFA:

Lexer no longer ignores capture group rules in the AST when building the NFA.
add_ast() updated to add tags to each rule's regex and to build the NFA with tags.
add_with_negative_tags() implemented to add a rule to the NFA while considering negative tags. This requires two passes of the AST, first positive tags are added, then negative tags are added as they depend on knowing the positive tags of alternate paths in the the AST.
All AST node add() functions call add_with_negative_tags() such that the NFA recursively adds negative tags when traversing the AST.
add_negative_tagged_transition() adds negative tags. Called at whichever AST node has negative tags.
add_positive_tagged_transitions() adds a positive tag. Called for every capture group AST node.
Add structs to represent negative and positive tagged transitions.
Add negative and positive tagged transitions to the NFA, with setters and getters.

Changes as a result of tagged-NFA:

DFA currently treats tagged transitions as epsilon transitions (ignores the capture aspect of capture groups).
generate_reverse() commented out as it is currently unused (at least internally and in CLP) until it is fixed to work with tags.

Validation

Add NFA unit-tests.

Summary by CodeRabbit

New Features
- Introduced a new method for handling negative transitions during NFA construction.
- Updated token ID handling to use consistent data types across various components.
- Enhanced state management capabilities with new tagged transition classes.
- Added a LexicalRule class to manage lexical rules in finite automata.
- Simplified NFA construction by collecting lexical rules in a vector.
- Introduced a comprehensive test suite for NFA implementation, validating various transition types and state properties.
Bug Fixes
- Improved error reporting for regex patterns, providing detailed messages for invalid patterns.

… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA

… added tags

…for clairty that nothing is shared b/w tests

…egexASTgroup with min = 1 OR'd with RegexASTEmpty

…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()

…(); Add docstrign to RegexDFAStatePair

… functions const

…NFA; Made add to nfa functions const

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c3fb16d and 24a0efd.

📒 Files selected for processing (1)

src/log_surgeon/finite_automata/RegexNFA.hpp (6 hunks)

src/log_surgeon/finite_automata/RegexNFA.hpp

LinZhihao-723

For PR title, how about:

Add support for tagged NFA; Use `uint32_t` to replace `int` for IDs.

LinZhihao-723 · 2024-10-30T19:25:07Z

Sorry mb, let's fix the latest comments from coderabbit

…dcoding to use bytes; Take state type instaed of transition type as template for transition classes.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)

src/log_surgeon/finite_automata/RegexNFA.hpp (2)

47-54: Consider adding @throws documentation for serialize methods.

The serialize methods' documentation should indicate whether they can throw exceptions (e.g., from fmt::format). This helps users handle potential exceptions appropriately.

Also applies to: 71-78

Line range hint 177-361: Document thread safety guarantees.

The class handles complex state management but lacks documentation about thread safety. Consider adding thread safety documentation to help users properly synchronize access in multi-threaded contexts.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 24a0efd and edbcdb4.

📒 Files selected for processing (1)

src/log_surgeon/finite_automata/RegexNFA.hpp (6 hunks)

coderabbitai · 2024-10-30T20:06:17Z

src/log_surgeon/finite_automata/RegexNFA.hpp

+    std::vector<PositiveTaggedTransition<RegexNFAState>> m_positive_tagged_transitions;
+    std::vector<NegativeTaggedTransition<RegexNFAState>> m_negative_tagged_transitions;


🛠️ Refactor suggestion

Consider reserving capacity for transition vectors.

The vectors m_positive_tagged_transitions and m_negative_tagged_transitions might benefit from capacity reservation if you have an estimate of the typical number of transitions.

coderabbitai · 2024-10-30T20:06:17Z

src/log_surgeon/finite_automata/RegexNFA.hpp

+    return fmt::format(
+            "{}:{}byte_transitions={{{}}},epsilon_transitions={{{}}},positive_tagged_transitions={{"
+            "{}}},negative_tagged_transitions={{{}}}",
+            state_ids.at(this),
+            accepting_tag_string,
+            fmt::join(byte_transitions, ","),
+            fmt::join(epsilon_transitions, ","),
+            fmt::join(positive_tagged_transitions, ","),
+            fmt::join(negative_tagged_transitions, ",")
+    );


🛠️ Refactor suggestion

Consider using fmt::join with a custom formatter.

Instead of building multiple intermediate strings with fmt::join and then combining them, consider creating a custom formatter to handle the entire state serialization in one pass. This could improve performance by reducing memory allocations.

Example approach:

struct StateFormatter { template<typename ParseContext> constexpr auto parse(ParseContext& ctx) { return ctx.begin(); } template<typename FormatContext> auto format(const State& state, FormatContext& ctx) { // Format all transitions in one pass return fmt::format_to(ctx.out(), "{}:{}", ...); } };

SharafMohamed added 28 commits September 11, 2024 20:07

Bug-fix for unicode array sizes

a6274ec

Merge remote-tracking branch 'upstream/main' into nfa-cleanup-pr

186d239

Move LexicalRule to its own class; Change name to variable_id; Change…

4f122c6

… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA

Additional fix for swapping meaning of tag

c24f6e1

Another additional fix for swapping meaning of tag

33582da

Fix up some comments

3338ec7

Fix comment grammar

3cd3c0f

Add tags to AST; Serialize AST for testing; Add unit-test for testing…

e05acbb

… added tags

Use using to condense code; Use a unique schema object for each test …

5e61e83

…for clairty that nothing is shared b/w tests

Add has_capture_groups(); Add unit-test for has_capture_groups()

082090d

Create and use RegexASTEmpty to split RegexASTgroup with min=0 into R…

2c6d94e

…egexASTgroup with min = 1 OR'd with RegexASTEmpty

Add unit-test for 0 repetition regex

4e02f24

Add more tests for repetition regex

bb3c543

Return by value in literal getters; Use const instead of const& for l…

54027ad

…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()

Refactor new_state()

e58274f

Rename get_first_matching_variable_ids() to get_matching_variable_ids…

1321871

…(); Add docstrign to RegexDFAStatePair

Remove redundant docstrings

c904755

Remove has_capture_groups()

ffe9a0f

Const and auto changes

913ed1a

Add tagged-nfa

795add3

Clarify that the add functions are adding to the nfa; Make add to nfa…

6e45657

… functions const

Changed AST add functions to indicate the AST are being added to the …

7aa8a92

…NFA; Made add to nfa functions const

Merged with previous PR

d1d87e7

Merge branch 'tagged-ast' into pre-tagged-nfa-cleanup

f386a3b

Merge branch 'pre-tagged-nfa-cleanup' into regex-ast-empty

0c600d7

Change add in RegexASTEmpty to add_to_nfa

bedad75

Merge with previous PR

c78f79c

Fix and refactor NFA unit-test

cd54e64

SharafMohamed marked this pull request as draft September 26, 2024 17:12

SharafMohamed marked this pull request as ready for review October 8, 2024 07:56

coderabbitai bot reviewed Oct 30, 2024

View reviewed changes

src/log_surgeon/finite_automata/RegexNFA.hpp Outdated Show resolved Hide resolved

src/log_surgeon/finite_automata/RegexNFA.hpp Outdated Show resolved Hide resolved

LinZhihao-723 previously approved these changes Oct 30, 2024

View reviewed changes

Use template type everywhere internal to RegexNFA.hpp, instead of har…

2567708

…dcoding to use bytes; Take state type instaed of transition type as template for transition classes.

SharafMohamed dismissed LinZhihao-723’s stale review via 2567708 October 30, 2024 19:54

SharafMohamed changed the title ~~Add tagged-NFA; Replace usage of int with uint32_t.~~ Add support for tagged NFA; Use uint32_t to replace int for IDs. Oct 30, 2024

Auto linter.

edbcdb4

coderabbitai bot reviewed Oct 30, 2024

View reviewed changes

LinZhihao-723 approved these changes Oct 30, 2024

View reviewed changes

LinZhihao-723 merged commit 124b18d into y-scope:main Oct 30, 2024

This was referenced Oct 31, 2024

Constrain NFA states to contain only one negative transition. #46

Merged

Extract RegexNFAState and tagged transition classes into their own files. #47

Merged

Replace the integer capture group tag ID with a dedicated 'Tag' class. #48

Merged

coderabbitai bot mentioned this pull request Nov 18, 2024

feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. #50

Merged

coderabbitai bot mentioned this pull request Jan 23, 2025

Address fmtlib GCC compatibility issues #78

Closed

coderabbitai bot mentioned this pull request Aug 25, 2025

feat: Add Query class to compute QueryInterpretations from a user specified query. #152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add support for tagged NFA; Use `uint32_t` to replace `int` for IDs.#42

Add support for tagged NFA; Use `uint32_t` to replace `int` for IDs.#42
LinZhihao-723 merged 146 commits intoy-scope:mainfrom
SharafMohamed:tagged-nfa-new

SharafMohamed commented Sep 16, 2024 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

LinZhihao-723 left a comment

Uh oh!

LinZhihao-723 commented Oct 30, 2024

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 30, 2024

Uh oh!

coderabbitai bot Oct 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		std::vector<PositiveTaggedTransition<RegexNFAState>> m_positive_tagged_transitions;
		std::vector<NegativeTaggedTransition<RegexNFAState>> m_negative_tagged_transitions;

Comments

Conversation

SharafMohamed commented Sep 16, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

References

Description

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 commented Oct 30, 2024

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SharafMohamed commented Sep 16, 2024 •

edited by coderabbitai bot

Loading