Skip to content

refactor: Move epsilon_closure from Lexer to NfaState; Refactor nfa_to_dfa as Dfa's constructor.#71

Merged
SharafMohamed merged 480 commits intoy-scope:mainfrom
SharafMohamed:refactor-nfa-to-dfa
Jan 11, 2025
Merged

refactor: Move epsilon_closure from Lexer to NfaState; Refactor nfa_to_dfa as Dfa's constructor.#71
SharafMohamed merged 480 commits intoy-scope:mainfrom
SharafMohamed:refactor-nfa-to-dfa

Conversation

@SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Jan 8, 2025

References

  • Depends on PR#61.
  • To review in parallel with PR#61, diff against PR#61 locally. In the repo run:
git fetch upstream pull/61/head:pr-61
git fetch upstream pull/62/head:pr-71
git diff pr-61 pr-71

Description

  • Move epsilon_closure from Lexer class to NfaState class.
  • Move functionality from nfa_to_dfa in Lexer class to Dfa class constructor.

Validation performed

Previously existing tests succeed.

Summary by CodeRabbit

Release Notes

  • Refactor

    • Simplified DFA object handling and memory management
    • Updated method signatures to use modern C++ practices with auto type inference
    • Removed deprecated NFA to DFA conversion methods
  • New Features

    • Added epsilon_closure() method to compute reachable states via epsilon transitions
    • Introduced new constructor for DFA creation from NFA
  • Performance

    • Improved type management and smart pointer usage in finite automata processing

SharafMohamed and others added 30 commits November 14, 2024 10:08
Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/log_surgeon/finite_automata/NfaState.hpp (1)

97-100: Documentation could be more descriptive.

While the documentation indicates the return value, it would be beneficial to include:

  • The method's purpose
  • Time complexity
  • Whether the method modifies the state

Consider expanding the documentation:

 /**
+ * Computes the epsilon closure of the current state using depth-first traversal.
+ * The epsilon closure includes all states reachable through epsilon transitions and
+ * tagged transitions (as per TODO comment in implementation).
+ * 
+ * Time complexity: O(V + E) where V is the number of states and E is the number
+ * of transitions.
+ * 
+ * @note This method does not modify the current state.
  * @return The set of all states reachable from the current state via epsilon transitions.
  */
src/log_surgeon/Lexer.tpp (1)

382-382: Consider documenting the DFA construction behaviour.

The TODO comment indicates that the DFA ignores tags, which is an important implementation detail that should be documented more formally.

Consider adding a formal documentation comment:

+    /**
+     * Generates a DFA from the NFA, ignoring tags.
+     * @note Currently, tags are ignored in the DFA construction. For example,
+     * "capture:user=(?<user_id>\d+)" is treated as "capture:user=\d+"
+     */
     m_dfa = std::make_unique<finite_automata::Dfa<TypedDfaState>>(std::move(nfa));
src/log_surgeon/finite_automata/Dfa.hpp (2)

17-17: Consider accepting nfa by rvalue reference to improve performance

Currently, the constructor accepts Nfa<NfaStateType> nfa by value, which may introduce unnecessary copying. If the intention is to transfer ownership of nfa, consider accepting it by rvalue reference (Nfa<NfaStateType>&& nfa) and moving it to avoid unnecessary copies.


46-46: Use using instead of typedef for type aliasing

To adhere to modern C++ conventions, consider replacing typedef with using for type aliasing:

-        typedef std::set<TypedNfaState const*> StateSet;
+        using StateSet = std::set<TypedNfaState const*>;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51d11cc and 239872a.

📒 Files selected for processing (13)
  • examples/intersect-test.cpp (3 hunks)
  • src/log_surgeon/Buffer.hpp (1 hunks)
  • src/log_surgeon/Lalr1Parser.tpp (26 hunks)
  • src/log_surgeon/Lexer.hpp (0 hunks)
  • src/log_surgeon/Lexer.tpp (10 hunks)
  • src/log_surgeon/LogEvent.cpp (2 hunks)
  • src/log_surgeon/LogParser.cpp (1 hunks)
  • src/log_surgeon/Parser.tpp (2 hunks)
  • src/log_surgeon/ParserInputBuffer.cpp (2 hunks)
  • src/log_surgeon/Schema.cpp (1 hunks)
  • src/log_surgeon/SchemaParser.cpp (2 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (2 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (3 hunks)
💤 Files with no reviewable changes (1)
  • src/log_surgeon/Lexer.hpp
✅ Files skipped from review due to trivial changes (6)
  • src/log_surgeon/LogEvent.cpp
  • src/log_surgeon/Schema.cpp
  • src/log_surgeon/SchemaParser.cpp
  • src/log_surgeon/LogParser.cpp
  • src/log_surgeon/Buffer.hpp
  • src/log_surgeon/ParserInputBuffer.cpp
🧰 Additional context used
📓 Path-based instructions (3)
src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (ubuntu-latest, Release)
  • GitHub Check: build (ubuntu-latest, Debug)
🔇 Additional comments (7)
src/log_surgeon/finite_automata/NfaState.hpp (2)

8-8: LGTM!

The addition of the <stack> header is necessary for the new epsilon_closure() method implementation.


174-207: Review the TODO comment and tagged transitions handling.

The implementation treats tagged transitions as epsilon transitions (line 189), but this is marked as a TODO. This design decision should be reviewed as it might affect the correctness of the epsilon closure computation.

Let's verify the usage of tagged transitions:

Additionally:

  1. The implementation correctly handles cycles using a visited set (closure_set).
  2. The use of false == stack.empty() and false == closure_set.insert(current_state).second follows the coding guidelines for boolean comparisons.
✅ Verification successful

The treatment of tagged transitions as epsilon transitions is intentional and correct.

The TODO comment appears to be a documentation note rather than a pending task. This behaviour is:

  • Part of the regex capture group implementation design
  • Well-documented in RegexAST.hpp
  • Consistently implemented across the codebase
  • Validated by tests
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other implementations or tests that rely on this behaviour
rg -A 5 "tagged.*transition.*epsilon" --type cpp

Length of output: 619


Script:

#!/bin/bash
# Search for tests involving tagged transitions
rg -l "TaggedTransition" test/
# Search for documentation about tagged transitions
fd README.md | xargs rg "tagged.*transition" -A 3
# Search for related discussions in comments
rg "tagged.*transition" --type cpp -g "!test/*" -A 3

Length of output: 25622

src/log_surgeon/Lexer.tpp (2)

22-22: LGTM! Consistent use of trailing return types.

The change to use auto -> void for method return types is consistent with modern C++ practices and improves readability.

Also applies to: 69-72, 78-81, 86-89, 153-153, 163-163, 213-214, 270-270, 302-302, 379-380, 416-416, 422-422, 428-428, 645-645


45-45: LGTM! Consistent use of auto for variable declarations.

The change to use auto for variable declarations improves maintainability and follows modern C++ type inference practices.

Also applies to: 69-71, 85-85, 177-177, 201-201, 241-241, 256-256, 277-278, 283-286, 303-304, 321-322, 335-338, 348-351, 366-368, 382-383, 390-393, 407-408, 429-432, 477-477, 519-519, 559-559, 575-579, 582-583, 594-594, 612-614, 629-629, 638-638, 661-661, 666-666, 691-692, 714-714, 735-736

src/log_surgeon/Lalr1Parser.tpp (1)

69-72: LGTM! Consistent modernization of C++ syntax.

The changes to use trailing return types and auto for variable declarations are consistent with modern C++ practices and improve code maintainability.

Also applies to: 78-81, 86-89, 153-153, 163-166, 173-177, 182-184, 213-214, 220-220, 241-243, 270-270, 277-278, 283-286, 302-304, 308-308, 321-322, 335-338, 348-351, 366-368, 379-380, 383-383, 390-393, 407-408, 416-416, 422-422, 428-430, 432-432, 440-440, 477-477, 519-519, 530-530, 559-559, 575-579, 582-583, 594-594, 612-614, 629-629, 638-638, 645-645, 661-661, 666-666, 691-692, 714-714, 735-736

src/log_surgeon/Parser.tpp (1)

33-36: Use of auto with trailing return type -> void is acceptable

While using auto with a trailing return type for functions returning void may be unconventional, it's acceptable if it aligns with the project's coding style. Consistency across the codebase can enhance readability.

examples/intersect-test.cpp (1)

44-44: Confirm safety of moving nfa into dfa2

At line 44, nfa is moved into dfa2 using std::move(nfa). Please ensure that nfa is not used after this operation to prevent undefined behaviour.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise looks good

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/log_surgeon/finite_automata/Dfa.hpp (3)

50-56: Consider optimizing lambda captures.

The lambda create_dfa_state captures this, dfa_states, and unmarked_sets by reference. Since this is only used for calling new_state, consider capturing it by value for better clarity of ownership.

-    auto create_dfa_state
-            = [this, &dfa_states, &unmarked_sets](StateSet const& set) -> TypedDfaState* {
+    auto create_dfa_state
+            = [this = this, &dfa_states, &unmarked_sets](StateSet const& set) -> TypedDfaState* {

64-74: Consider extracting transition mapping logic.

The nested loops for building the transitions map could be extracted into a separate method for better readability and maintainability, especially when UTF-8 support is added.

Consider creating a helper method:

template <typename TypedNfaState>
auto build_transitions_map(StateSet const& set) 
    -> std::map<uint32_t, StateSet>
{
    std::map<uint32_t, StateSet> ascii_transitions_map;
    for (auto const* s0 : set) {
        for (uint32_t i = 0; i < cSizeOfByte; i++) {
            for (auto* const s1 : s0->get_byte_transitions(i)) {
                StateSet closure = s1->epsilon_closure();
                ascii_transitions_map[i].insert(closure.begin(), closure.end());
            }
        }
    }
    return ascii_transitions_map;
}

73-73: Track UTF-8 support requirements.

Multiple TODO comments indicate pending UTF-8 support implementation. These should be tracked in the issue tracker.

Would you like me to create GitHub issues to track the UTF-8 support implementation requirements?

Also applies to: 90-90

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 239872a and f7e1718.

📒 Files selected for processing (1)
  • src/log_surgeon/finite_automata/Dfa.hpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)

10-10: LGTM! Well-structured constructor declaration.

The constructor is properly templated and takes advantage of move semantics by accepting the NFA by value. The include statement is appropriately placed.

Also applies to: 16-17

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

refactor: Move `epsilon_closure` from `Lexer` to `NfaState`; Refactor `nfa_to_dfa` as `Dfa`'s constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants