Skip to content

feat: Add RegisterOperation for TDFA into DFA transitions.#89

Merged
SharafMohamed merged 32 commits intoy-scope:mainfrom
SharafMohamed:add-register-operations
Mar 10, 2025
Merged

feat: Add RegisterOperation for TDFA into DFA transitions.#89
SharafMohamed merged 32 commits intoy-scope:mainfrom
SharafMohamed:add-register-operations

Conversation

@SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Feb 3, 2025

Description

  • Register operations introduced to the DFA as the parallel to tag operations in the NFA.
  • DFA transitions introduced to tie register operations to destination states.
  • Update the Lexer, DFA, and DFA states to use the register operations and DFA transitions.

Validation performed

  • Add test-dfa to test untagged DFAs.

Summary by CodeRabbit

  • New Features

    • Enhanced finite automata functionality with improved state management, acceptance operations, and DFA serialization using breadth-first traversal.
    • Introduced new classes and methods for managing transitions and register operations, including serialization capabilities.
    • Added new header files for transition and register operations, expanding the finite automata module.
  • Tests

    • Added comprehensive test cases validating DFA construction and serialization to ensure dependable outcomes.
    • Included new test scenarios for both simple and complex DFAs to verify correct serialization outputs.

@coderabbitai
Copy link

coderabbitai bot commented Feb 3, 2025

Walkthrough

This pull request enhances the finite automata module by adding new header files to support DFA transitions and register operations, modifying DFA state representations to include serialization and improved transition handling, and updating mapping structures in the Lexer module. The RegisterHandler has been updated to use a new register ID type for consistency. Additionally, the build configuration and test suite have been modified to incorporate these new sources and tests for DFA functionality.

Changes

File(s) Change Summary
CMakeLists.txt, tests/CMakeLists.txt Updated build configuration to include new finite automata sources and tests; added headers like DfaTransition.hpp and RegisterOperation.hpp, included test-dfa.cpp, and removed redundant test-capture.cpp.
src/log_surgeon/Lexer.hpp Modified tag ID handling by replacing m_tag_to_reg_id (using std::unordered_map) with m_tag_to_final_reg_id (using std::map) and updated the associated retrieval method.
src/log_surgeon/finite_automata/Dfa.hpp, src/log_surgeon/finite_automata/DfaState.hpp Added serialization methods (serialize, get_bfs_traversal_order), updated transition representation in constructors, renamed get_dest_state to get_transition, and introduced add_accepting_op in DFA state management.
src/log_surgeon/finite_automata/RegisterHandler.hpp Updated method signatures to use reg_id_t instead of register_id_t to ensure type consistency across register operations.
src/log_surgeon/finite_automata/DfaTransition.hpp, src/log_surgeon/finite_automata/RegisterOperation.hpp Introduced new headers: DfaTransition.hpp defines a template class for DFA transitions with serialization and accessor methods; RegisterOperation.hpp defines an enum and class for register operations including constructors and a serialization method.
tests/test-dfa.cpp Added new Catch2 test cases for simple and complex DFA serialization; tests validate the serialized output of the DFA against expected string representations.

Possibly related PRs

Suggested reviewers

  • LinZhihao-723

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b397e2 and 11cb2a7.

📒 Files selected for processing (1)
  • src/log_surgeon/Lexer.hpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/log_surgeon/Lexer.hpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@SharafMohamed SharafMohamed changed the title feature: Add register operations and DFA transitions. feat: Add register operations and DFA transitions. Feb 3, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (20)
src/log_surgeon/finite_automata/DfaState.hpp (3)

34-38: Consider simplifying your constructor’s loop.
You could use std::fill in an initializer list or directly in the constructor body to initialize m_bytes_transition with the same default object, improving clarity.

 DfaState() {
-    for (auto& transition : m_bytes_transition) {
-        transition = DfaTransition<state_type>{{}, nullptr};
-    }
+    std::fill(std::begin(m_bytes_transition),
+              std::end(m_bytes_transition),
+              DfaTransition<state_type>{{}, nullptr});
 }

61-67: Documenting the new serialize method is helpful.
The docstring clarifies usage by devs working in parallel on DFAs.


101-141: Consider caching the accepting state check in serialize.
You call is_accepting() multiple times; consider assigning the result to a local boolean for minor performance and clarity benefits. Otherwise, this method is elegantly structured for debugging or logging.

+ bool accepting = is_accepting();
  auto const accepting_tags_string = 
-     is_accepting()
+     accepting
      ? fmt::format("accepting_tags={{{}}},", fmt::join(m_matching_variable_ids, ","))
      : "";
src/log_surgeon/Lexer.hpp (2)

128-130: Returning a const reference to a unique_ptr.
Although returning a const reference prevents modification of the pointer, consider documenting that callers must check for null in usage scenarios.


151-157: Minimal overhead approach.
Using contains followed by at is concise. Should performance become a concern, consider retrieving an iterator directly to avoid a double lookup.

src/log_surgeon/finite_automata/Dfa.hpp (2)

28-31: Documenting the new serialize method is beneficial.
This helps readers quickly understand that a string-based snapshot of DFA structure is provided.


128-146: Future introspection in get_intersect.
The method currently handles ASCII transitions. You have a TODO for UTF-8 support; consider scheduling it soon to avoid partial coverage for multi-byte encodings.

Would you like help with a follow-up patch to handle UTF-8 transitions?

src/log_surgeon/UniqueIdGenerator.hpp (1)

1-14: Straightforward unique ID generation.
The incrementing logic is correct for single-threaded use; however, consider a synchronization mechanism or atomic operations if generating IDs concurrently.

tests/test-capture.cpp (1)

9-36: Test coverage could be enhanced.

While the current test cases cover basic functionality, consider adding tests for:

  • String validation (if any restrictions exist on capture names)
  • Error cases (if any)
  • Comparison operations (if implemented)
src/log_surgeon/LexicalRule.hpp (1)

28-30: Consider adding documentation for get_captures().

Add documentation to describe:

  • The purpose of the method
  • The return value semantics (ownership, lifetime)
src/log_surgeon/finite_automata/RegisterOperation.hpp (1)

36-38: Fix typo in documentation.

"opertion" should be "operation".

     /**
-     * @return A string representation of the register opertion.
+     * @return A string representation of the register operation.
      */
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)

31-31: Consider returning const reference for better performance.

The getter get_tag_ops() returns a copy of the vector. For better performance, consider returning a const reference if the caller doesn't need a copy.

-    [[nodiscard]] auto get_tag_ops() const -> std::vector<TagOperation> { return m_tag_ops; }
+    [[nodiscard]] auto get_tag_ops() const -> std::vector<TagOperation> const& { return m_tag_ops; }
src/log_surgeon/finite_automata/DfaTransition.hpp (2)

30-30: Consider returning const reference for better performance.

Similar to SpontaneousTransition, consider returning a const reference from get_reg_ops() if the caller doesn't need a copy.

-    [[nodiscard]] auto get_reg_ops() const -> std::vector<RegisterOperation> { return m_reg_ops; }
+    [[nodiscard]] auto get_reg_ops() const -> std::vector<RegisterOperation> const& { return m_reg_ops; }

57-64: Consider reserving vector capacity for better performance.

Pre-allocate the vector capacity to avoid potential reallocations during push_back operations.

     std::vector<std::string> transformed_ops;
+    transformed_ops.reserve(m_reg_ops.size());
     for (auto const& reg_op : m_reg_ops) {
tests/test-dfa.cpp (1)

28-69: Consider using SECTION for better test organization.

The test case could benefit from using Catch2's SECTION macro to organize different aspects of the test (e.g., setup, serialization, comparison).

 TEST_CASE("Test Untagged DFA", "[DFA]") {
-    Schema schema;
-    string const var_name{"capture"};
-    string const var_schema{var_name + ":" + "Z|(A[abcd]B\\d+C)"};
-    schema.add_variable(var_schema, -1);
+    SECTION("Test DFA serialization") {
+        Schema schema;
+        string const var_name{"capture"};
+        string const var_schema{var_name + ":" + "Z|(A[abcd]B\\d+C)"};
+        schema.add_variable(var_schema, -1);
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)

21-27: Consider reserving vector capacity for better performance.

Pre-allocate the vector capacity to avoid reallocations during emplace_back operations.

     std::vector<uint32_t> added_registers;
+    added_registers.reserve(num_reg_to_add);
     for (uint32_t i{0}; i < num_reg_to_add; ++i) {
tests/test-register-handler.cpp (1)

23-27: Consider simplifying the register addition logic.

The current implementation can be made more concise while maintaining the same functionality.

-        if (0 == i) {
-            handler.add_register();
-        } else {
-            handler.add_register(i);
-        }
+        handler.add_register(i > 0 ? i : std::nullopt);
tests/test-nfa.cpp (1)

38-38: Consider using std::move for rules.

The rules vector could be moved into the ByteNfa constructor to avoid unnecessary copying.

-    ByteNfa const nfa{rules};
+    ByteNfa const nfa{std::move(rules)};
tests/test-lexer.cpp (2)

93-126: Consider optimizing delimiter collection.

The loop to collect delimiters could be simplified by using a vector comprehension or std::copy_if.

-    vector<uint32_t> delimiters;
-    for (uint32_t i{0}; i < log_surgeon::cSizeOfByte; i++) {
-        if (lexer.is_delimiter(i)) {
-            delimiters.push_back(i);
-        }
-    }
+    vector<uint32_t> delimiters;
+    delimiters.reserve(log_surgeon::cSizeOfByte);
+    std::copy_if(
+        std::begin(cDelimiters), std::end(cDelimiters),
+        std::back_inserter(delimiters),
+        [&lexer](uint32_t i) { return lexer.is_delimiter(i); }
+    );

296-350: Consider adding edge cases to test suite.

While the current test cases cover basic functionality, consider adding tests for:

  • Empty input strings
  • Maximum length inputs
  • Invalid capture group names
  • Nested capture groups
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a513c5f and de5c324.

📒 Files selected for processing (31)
  • .github/workflows/build.yaml (3 hunks)
  • CMakeLists.txt (2 hunks)
  • examples/intersect-test.cpp (3 hunks)
  • lint-tasks.yml (2 hunks)
  • src/log_surgeon/Lexer.hpp (4 hunks)
  • src/log_surgeon/Lexer.tpp (10 hunks)
  • src/log_surgeon/LexicalRule.hpp (3 hunks)
  • src/log_surgeon/SchemaParser.cpp (4 hunks)
  • src/log_surgeon/UniqueIdGenerator.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Capture.hpp (2 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (6 hunks)
  • src/log_surgeon/finite_automata/DfaState.hpp (3 hunks)
  • src/log_surgeon/finite_automata/DfaStatePair.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DfaTransition.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
  • src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (19 hunks)
  • src/log_surgeon/finite_automata/Register.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegisterHandler.hpp (2 hunks)
  • src/log_surgeon/finite_automata/RegisterOperation.hpp (1 hunks)
  • src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1 hunks)
  • src/log_surgeon/finite_automata/TagOperation.hpp (1 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (0 hunks)
  • tests/CMakeLists.txt (2 hunks)
  • tests/test-capture.cpp (1 hunks)
  • tests/test-dfa.cpp (1 hunks)
  • tests/test-lexer.cpp (5 hunks)
  • tests/test-nfa.cpp (2 hunks)
  • tests/test-register-handler.cpp (2 hunks)
  • tests/test-tag.cpp (0 hunks)
💤 Files with no reviewable changes (2)
  • tests/test-tag.cpp
  • src/log_surgeon/finite_automata/TaggedTransition.hpp
✅ Files skipped from review due to trivial changes (1)
  • src/log_surgeon/finite_automata/Register.hpp
🧰 Additional context used
📓 Path-based instructions (23)
src/log_surgeon/UniqueIdGenerator.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/PrefixTree.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-capture.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-register-handler.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LexicalRule.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/SchemaParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-lexer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-nfa.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Capture.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/TagOperation.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegisterOperation.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-dfa.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegisterHandler.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lexer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (5)
src/log_surgeon/finite_automata/Capture.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegisterHandler.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T21:56:13.425Z
Learning: In the `log_surgeon` project, header guards in C++ header files should include `_HPP` at the end to match the filename. For example, in `RegisterHandler.hpp`, the header guard should be `LOG_SURGEON_FINITE_AUTOMATA_REGISTER_HANDLER_HPP`.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (136)
src/log_surgeon/finite_automata/DfaState.hpp (8)

7-10: Includes look appropriate.
These added headers align with the new functionalities (e.g. string manipulation and unordered maps).


13-14: Good choice of fmt library.
Using fmt for string formatting and logging is a solid approach that improves readability and performance.


17-18: New includes for transitions and register operations are coherent.
The newly included headers resolve dependencies for the transition logic and the register-based operations.


52-55: Updated transition signatures enhance clarity.
Shifting from pointer-type arguments to strong typed DfaTransition objects promotes type safety and maintainability.


57-58: Adding accepting operations is a good extension.
This approach cleanly decouples the register logic from the state’s acceptance condition.


70-72: Clear doc for get_dest_state.
Documenting the difference between byte vs. UTF-8 transitions helps future maintainers.


76-77: New data members for transition ops are well-placed.
Storing m_accepting_ops alongside m_bytes_transition ensures transitions and acceptance logic remain cohesive.


84-99: Verify indexing in get_dest_state.
Confirm that character will never exceed cSizeOfByte when state_type == StateType::Byte; otherwise, this could invoke out-of-bounds indexing.

Would you like to add upper boundary checks or rely on upstream validation?

src/log_surgeon/Lexer.hpp (8)

7-7: New includes establish the foundation for register/tag logic.
Including <map>, NfaState.hpp, Register.hpp, and TagOperation.hpp is consistent with the added capturing and register functionalities.

Also applies to: 16-16, 18-18, 19-19


25-26: Defining custom IDs helps maintain clarity.
Using register_id_t, tag_id_t, and symbol_id_t clarifies domain-specific identifiers rather than relying on bare integer types.

Also applies to: 28-28


133-140: Ensure coverage for get_capture_ids_for_var_id.
This function is crucial for linking variables to captures. Please confirm it is tested, especially for edge cases with no captures.


142-149: Check the mapping consistency in get_tag_ids_for_capture_id.
Ensure no collisions or missing data for capture IDs, as returning std::nullopt can lead to subtle bugs if unhandled downstream.


159-170: Graceful composition in get_registers_for_capture.
Composing your get_* methods (tags → regs) is a neat cascade. Ensure thorough unit tests for all paths that might return std::nullopt.


172-173: Symbol ID mappings are clearly placed.
Exposing m_symbol_id and m_id_symbol fosters straightforward lookups. They seem consistent with the rest of the design.


195-195: m_dfa pointer usage.
Confirmed that using a unique_ptr is safe and ensures ownership remains within the lexer. This is a solid practice for resource management.


198-200: Capture-to-tag and tag-to-reg mappings.
The new data structures (m_var_id_to_capture_ids, m_capture_id_to_tag_ids, m_tag_to_final_reg_id) form a consistent chain for retrieving register ops.

src/log_surgeon/finite_automata/Dfa.hpp (11)

5-5: Appropriate new includes for extended features.
Including map, stack, Register, RegisterHandler, and TagOperation supports the new template parameters, transitions, and register handling logic.

Also applies to: 8-8, 11-12, 14-14, 17-20


23-23: Additional template parameter.
Introducing a second template parameter for TypedNfaState aligns the DFA with NFA expansions and boosts type safety.


26-26: Constructor taking a const reference.
Passing nfa by const reference is an efficient choice, preventing unnecessary copying of potentially large structures.


38-38: Improved new_state method.
Accepting a set of NFA states ensures clarity around the relationship between the newly created DFA state and its source NFA states.


51-51: Intersect documentation.
The doc clarifies how the method identifies which schema types are reachable in both DFAs. This is a nice explicit approach.


54-62: BFS traversal for prospective usage.
The doc describing a BFS order is relevant for debugging or advanced transformations. Good to have.


65-112: Methodical approach constructing the DFA.
Using a stack of sets and computing epsilon closures systematically is a robust technique. The usage of false == unmarked_sets.empty() also matches the style guidelines.


95-95: Transition object usage.
Passing {{}, dest_state} clarifies that no register ops are immediately triggered on this transition, reinforcing the new typed transition structure.


114-126: Populating matching_variable_ids.
Collecting variable IDs from each accepting NFA state ensures that the final DFA state knows the correct schema types. This is straightforward and effective.


150-179: BFS ordering is well-structured.
The queue-based approach systematically visits states. The block properly updates visited states to avoid infinite loops.


181-195: serialize method logic is clean.
Concise BFS-based ordering plus using a map of state IDs ensures consistent output across runs, which aids debugging and testing.

src/log_surgeon/finite_automata/NfaState.hpp (17)

4-4: New includes look appropriate.
No immediate concerns.

Also applies to: 12-12, 14-14, 17-17, 20-21


37-37: Explicit default constructor is clear.
No issues found.


39-41: Accepting constructor is concise.
The constructor eagerly marks the state as accepting and sets the matching variable ID. This looks intentional and correct.


43-49: Constructor chaining logic.
Invoking add_spontaneous_transition in the constructor ensures consistent initialization. This is a good pattern.


51-52: Spontaneous transition addition (no tag operations).
Function is straightforward, adding a transition without tag operations. Good clarity.


55-66: Spontaneous transition addition (with tag operations).
Collecting tag_id_t into TagOperation objects is concise. The logic properly reserves space based on tag_ids.size().


88-88: Accurate method documentation.
The updated docstring references forwarding SpontaneousTransition::serialize return value. Good clarity on failure conditions.


93-97: Accessors for acceptance & variable ID.
These accessors are clear and maintain good const-correctness.


98-98: Empty line insertion.
No functional impact.


99-102: Getter for spontaneous transitions.
Exposes internal vector as a constant reference. This is acceptable for read-only access.


104-106: Getter for byte transitions.
Returns a reference to a vector of NfaState*. Straightforward approach.


108-109: Getter for tree transitions.
Returns a Tree const& for UTF-8 states. This remains flexible for the Byte variant.


113-113: Spontaneous transitions member.
Storing them in a vector is consistent with usage.


170-180: epsilon_closure uses false == stack.empty().
Matches the coding guidelines and ensures we gather all reachable states from spontaneous transitions.


189-191: Optional acceptance string.
Constructing accepting_tag_string only when m_accepting is true is an efficient approach.


201-208: Serializing spontaneous transitions with guard checks.
Good handling of std::nullopt to indicate serialization issues.


210-211: Format string updated for spontaneous transitions.
The string now includes the set of compiled transitions, aligning with the new design.

Also applies to: 215-215

src/log_surgeon/finite_automata/Nfa.hpp (16)

4-4: New includes & optional usage.
Including <cstddef> and <optional> is consistent with changes in constructor arguments.

Also applies to: 7-7


19-21: Included Capture headers & TagOperation.
These additions reflect the new approach for capturing transitions.


23-23: UniqueIdGenerator usage.
Including <log_surgeon/UniqueIdGenerator.hpp> clarifies how unique IDs for tags are created.


26-33: Class-level docstring is thorough.
Explains NFA usage, acceptance of TypedNfaState, and capture group uniqueness assumptions.


39-39: Constructor accepts const& for rules.
Avoids unnecessary copies, which is beneficial for performance.


48-50: new_accepting_state creation.
Declares a state with matching_variable_id, enabling acceptance logic. Clear naming.

Also applies to: 52-52


57-57: new_state_for_negative_captures
Accepts captures, acquires respective tags, then constructs a negation transition. Clean approach to handle negative captures.

Also applies to: 61-63


67-68: new_start_and_end_states_for_capture
Sets start/end tags for captures. This approach ensures separation of concerns for capture-specific transitions.

Also applies to: 69-69


87-88: serialize docstring clarifies std::nullopt.
Matches the shift to optional returns in the NFA states.

Also applies to: 89-89


120-126: Constructor sets m_root.
Begins the NFA with a new blank root state. This sets a consistent baseline for building the automaton.


128-140: get_or_create_capture_tags method.
Ensures consistent creation/storage of tag IDs for each capture. The logic is well encapsulated.


149-153: new_accepting_state usage.
Confirms acceptance state creation is specialized with a matching variable ID.


156-171: new_state_for_negative_captures
Moves tags to the constructor of TypedNfaState with TagOperationType::Negate. Clear approach.


174-178: new_start_and_end_states_for_capture
Initializes the start tag on the root transition and the end tag on a dedicated new state. Good structuring.


188-189: serialize returns std::nullopt on failure.
Properly aligns with NFA states' optional serialization.

Also applies to: 222-223


232-237: Retaining BFS order in serialization.
Appending to serialized_states ensures a stable ordering of states in output.

src/log_surgeon/Lexer.tpp (12)

7-7: Runtime error usage.
Including <stdexcept> suggests robust exception handling as needed.


86-87: Re-checking state after newline.
Resetting to the root and matching at line breaks is consistent with the existence of delimiters.

Also applies to: 90-91, 92-92


98-98: Null destination state check.
Ends scanning if no transition exists. Straightforward.


217-218: Re-checking newlines in wildcard scan.
Matches the approach in scan(), re-initializing root transitions where needed.

Also applies to: 221-222


245-246: Wildcard expansions.
Ensures that ? and * behave consistently by traversing potential states or verifying acceptance. Implementation is thorough.

Also applies to: 280-282


337-340: prepend_start_of_file_char
Sets m_prev_state to the root’s transition on utf8::cCharStartOfFile, allowing scanning from a synthetic start token.


361-366: add_rule uses symbol_id_t.
Changing the function signature to accept symbol_id_t clarifies the variable ID usage.


369-370: get_rule matches by var_id.
Maintains readability and ensures the correct rule is retrieved.

Also applies to: 372-372


381-395: Looping over captures to ensure uniqueness.
Throws an exception if a duplicate capture name is discovered. This is crucial for correctness.


397-403: Populating m_capture_id_to_tag_ids.
Bridges the NFA’s captured tag IDs with the lexer’s symbol IDs. Good separation of concerns.


405-405: Creating the DFA from the updated NFA.
Note that captures are not fully utilized in the DFA yet, as included in the TODO.


408-414: Setting m_is_first_char.
Initializes state checks for potential transitions from the root, facilitating the scanning logic.

src/log_surgeon/finite_automata/RegexAST.hpp (30)

23-23: Includes are logically arranged.
No concerns. The inclusion of "Capture.hpp" aligns with the capture-based refactoring.


33-37: Clear explanation of capture usage.
The updated documentation clarifies how captures are maintained in each node’s subtree. This is a good improvement for future maintainers.


85-99: Ensure consistent capture aggregation.
Collecting subtree captures from child nodes and merging them is essential for correct AST composition. The approach here looks consistent. Might be worth confirming that older references to tags have been removed entirely.


101-103: Negative capture handling.
Setting negative captures is consistent with your broader design, ensuring AST nodes can indicate captures that must remain unmatched along a certain path.


110-122: Good use of negative-capture logic.
This helper ensures that negative captures are handled seamlessly before or in parallel with normal transitions. It's also using the recommended comparison style: if (false == m_negative_captures.empty()) {...}, matching your guidelines.


130-148: Serialization approach is well-structured.
Converting negative captures into a user-friendly string is straightforward, and making use of fmt simplifies the operation.


151-153: Subtree and negative captures stored clearly.
This layout of m_subtree_captures and m_negative_captures is intuitive. The m_subtree_capture_ids usage is not immediately evident, so ensure it’s properly maintained if used downstream.


630-630: Maintaining a non-null capture.
The documentation properly stresses that m_capture must never be null. This requirement helps maintain consistent semantics throughout the codebase.


641-642: Constructor argument clarity.
Making the capture argument explicit highlights its importance. Throwing an exception on a null pointer is a solid defensive design choice.


653-659: Building capture sets.
You accumulate existing subtree captures from m_group_regex_ast and then add the new m_capture. This logic ensures correct capture inheritance. Make sure any derived classes do the same if extended.


666-668: Copy constructor is straightforward.
Cloning both the group regex AST and capture works as intended, preserving the AST subtree fully.


707-707: Getter method is sufficiently clear.
Returning std::string_view is a good choice for performance, and naming the method get_group_name() is descriptive.


720-722: Serialization of empty node.
Yielding only negative-capture info is well handled, ensuring RegexASTEmpty aligns with the new approach.


738-739: Literal serialization.
Appending negative captures after the literal is consistent with the rest of the design. Looks good.


769-770: Integer serialization.
Excellent approach to ensure the integer digits and negative captures are merged into the final serialized output.


774-783: OR constructor merges capture information.
Ensuring each side is marked with the other’s subtree captures as negative captures is sophisticated but correct for capturing exclusive paths.


789-790: Adding to NFA with negative captures.
Both sides of the OR use add_to_nfa_with_negative_captures. This is consistent with the design.


794-794: Safe fallback on serialization.
Handles null pointers gracefully for each side of the OR.


799-800: Pub/sub on negative captures in string form.
Appended at the end of the format string to reflect captures that do not apply.


810-811: Concatenation constructor merges captures.
Setting the subtree captures by combining left and right is essential for consecutive expressions.


819-822: Concatenation NFA transitions.
Inserting an intermediate state for the left half, then resetting root, is a clear technique for combining subexpressions.


828-832: Concatenation serialization.
Combining each child’s serialization plus negative captures ensures a standard output format.


837-840: Consistent approach in Multiplication constructor.
Propagating subtree captures from the operand ensures repetition logic remains correct.


844-844: Subtree capture assignment.
This single-line approach is neat. No issues.


854-879: NFA construction for repeated expressions.
The logic properly handles edge cases for m_min == 0 and is_infinite(), ensuring robust transitions for repeated patterns. Good usage of negative captures in each iteration.


893-894: Multiplication serialization boundary.
Appending "{min, max}" with 'inf' when appropriate is a clean representation.


900-921: Capturing docstring improvements.
Fully annotated diagram clarifies the start and end capture transitions, especially with negative captures in mind. Good documentation.


932-932: Creation of start/end states.
Constructing distinct start/end states for the capture is a strong design, isolating the capture’s boundaries.


936-936: Delegating.
Calling m_group_regex_ast->add_to_nfa_with_negative_captures from within capture logic is consistent with the rest of the design.


942-950: Capture serialization format.
Inserting <capture_name> into the output is user-friendly, and the usage of std::u32string for wide support is wise.

src/log_surgeon/finite_automata/Capture.hpp (1)

1-20: Renaming from Tag to Capture is coherent.
The rename aligns with the new capture-based approach. The Capture class is succinct, and the constructor’s move semantics are standard.

src/log_surgeon/LexicalRule.hpp (2)

34-37: Address the TODO comment about making the returned pointer constant.

The comment indicates a need to make the returned pointer constant. This should be addressed to prevent unintended modifications.

Would you like me to propose a solution for making the returned pointer constant?


44-48: LGTM! Clean refactoring of add_to_nfa.

The changes simplify the code by combining state creation and configuration into a single operation.

src/log_surgeon/finite_automata/TagOperation.hpp (1)

24-30: LGTM! Efficient comparison implementation.

Good use of std::tie for implementing both operators.

src/log_surgeon/finite_automata/RegisterOperation.hpp (1)

42-44: Follow coding guidelines for boolean comparison.

As per coding guidelines, prefer false == <expression> over !<expression>.

The current implementation already follows this guideline correctly.

src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)

56-62: Great use of C++20 ranges for transforming operations!

The use of ranges for transforming tag operations is elegant and modern.

src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

72-76: Improved method naming and condition syntax!

The changes enhance readability through:

  1. More descriptive method name get_dest_state instead of next
  2. Clearer null pointer checks following the coding guideline of nullptr != x
src/log_surgeon/finite_automata/PrefixTree.hpp (2)

15-20: Documentation terminology updated appropriately!

The change from "tag" to "capture" terminology maintains clarity while aligning with the broader codebase changes.


28-28: Good addition of a named constant!

The cDefaultPos constant improves code clarity by providing a meaningful name for the default position value.

tests/test-register-handler.cpp (1)

40-40: Good improvement in const correctness!

Adding the const qualifier to empty_handler properly indicates that it shouldn't be modified after initialization.

examples/intersect-test.cpp (3)

24-24: Type signature properly updated for the new DFA implementation!

The addition of ByteNfaState as a template parameter improves type safety and clarity.


44-44: DFA construction updated consistently!

The template parameters in the DFA construction match the function signature changes.


82-82: DFA construction in main function properly updated!

The template parameters in the DFA construction are consistent with the other changes in the file.

tests/test-nfa.cpp (3)

8-10: LGTM! Header changes align with component restructuring.

The header changes appropriately reflect the shift from RegexAST to more specific NFA state and lexical rule components.


41-77: LGTM! Serialization format updated correctly.

The changes properly reflect the transition from tag-based to capture-based transitions, with appropriate documentation updates.


80-87: LGTM! Improved error handling with optional values.

The changes enhance robustness by properly handling optional values in the serialization process.

tests/test-lexer.cpp (1)

128-153: Address TODO comment for register value verification.

The test function should be updated to include register value verification once simulation is implemented.

Would you like me to help implement the register value verification once the simulation functionality is available?

src/log_surgeon/SchemaParser.cpp (3)

3-23: LGTM! Enhanced error handling and type safety.

The addition of headers like , , and improves error handling and type safety.


177-177: LGTM! Updated to use Capture instead of Tag.

The change correctly implements the transition from tag-based to capture-based system.


212-213: LGTM! Updated terminology in comments.

Comments correctly reflect the change from "negative tags" to "negative captures".

Also applies to: 258-259

.github/workflows/build.yaml (3)

31-31: LGTM! Improved build reproducibility.

Using a specific Ubuntu version (22.04) instead of latest ensures consistent build environments.


42-42: LGTM! Updated condition matches OS matrix.

The condition correctly reflects the change to ubuntu-22.04 in the OS matrix.


56-58: LGTM! Enhanced CI debugging capabilities.

The addition of test log printing on failure will help diagnose CI issues more effectively.

tests/CMakeLists.txt (5)

5-8: Finite Automata Header Files Addition:
The inclusion of new finite automata headers (i.e. Capture.hpp, Dfa.hpp, DfaState.hpp, and DfaTransition.hpp) is well organised and clearly supports the new DFA features.


14-14: Register and Transition Header Files Update:
The additions of Register.hpp, RegisterOperation.hpp, SpontaneousTransition.hpp, and TagOperation.hpp are consistent with the PR’s objectives of introducing register operations and updating DFA transitions.

Also applies to: 16-17, 19-19


31-31: Unique Identifier Support:
Adding UniqueIdGenerator.hpp ensures that unique ID functionality is accessible, supporting the broader finite automata enhancements.


34-42: Enhanced Test Suite Configuration:
The update to the SOURCES_TESTS variable to include test-capture.cpp and test-dfa.cpp (in addition to the other test files) effectively aligns with the new features introduced in this PR.


44-50: Executable Target Configuration:
The target definition for unit-test and the associated compile features and include directories are correctly set up.

lint-tasks.yml (4)

1-16: Lint Task Configuration Overview:
The initial setup, including the definition of variables and the organization of tasks, is clean and clearly structured. There are no issues identified in these initial sections.


29-32: Expanded Source Patterns for C++ Linting:
The addition of source file patterns for the tests directory (i.e. tests/**/*.cpp, tests/**/*.h, tests/**/*.hpp, and tests/**/*.tpp) enhances linting coverage to include all relevant test files.


69-72: Exclusion of Build Directories:
The updated find command now excludes files in paths matching */*build*/*, which is a good practice to avoid unnecessary linting of generated files.


61-68: YAML Linting Task Details:
The configuration for the yml-check and yml-fix tasks (including activation of the virtual environment and the use of a strict configuration) is appropriate and well-implemented.

CMakeLists.txt (2)

114-115: UniqueIdGenerator Header Inclusion:
The inclusion of UniqueIdGenerator.hpp at the end of the SOURCE_FILES list correctly supports the new functionality for generating unique identifiers.


125-239: Build and Installation Configuration Integrity:
The remainder of the CMake configuration—including external dependencies, target linking, compile options, and installation directives—is set up correctly and follows best practices. No issues were identified in these sections.

@SharafMohamed SharafMohamed changed the title feat: Add register operations and DFA transitions. feat: Add register operations and DFA transitions. Feb 4, 2025
@SharafMohamed SharafMohamed changed the title feat: Add register operations and DFA transitions. feat: Add register operations and DFA transitions. Feb 4, 2025
@SharafMohamed SharafMohamed force-pushed the add-register-operations branch from 78cac0b to 61ad3cb Compare February 14, 2025 13:02
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
src/log_surgeon/finite_automata/RegisterHandler.hpp (3)

21-27: ⚠️ Potential issue

Add overflow check for multiple register additions.

The method should validate that adding num_reg_to_add registers won't cause an overflow of the registers vector.

Apply this diff to add the overflow check:

 auto add_registers(uint32_t const num_reg_to_add) -> std::vector<uint32_t> {
+    if (m_registers.size() > std::numeric_limits<size_t>::max() - num_reg_to_add) {
+        throw std::overflow_error("Adding registers would cause overflow");
+    }
     std::vector<uint32_t> added_registers;
     for (uint32_t i{0}; i < num_reg_to_add; ++i) {
         added_registers.emplace_back(add_register());
     }
     return added_registers;
 }

29-35: ⚠️ Potential issue

Add overflow check for register ID.

The method should validate that adding a new register won't cause an overflow of the registers vector.

Apply this diff to add the overflow check:

 auto add_register() -> reg_id_t {
+    if (m_registers.size() >= std::numeric_limits<size_t>::max()) {
+        throw std::overflow_error("Adding register would cause overflow");
+    }
     auto const prefix_tree_node_id{
             m_prefix_tree.insert(PrefixTree::cRootId, PrefixTree::cDefaultPos)
     };
     m_registers.emplace_back(prefix_tree_node_id);
     return m_registers.size() - 1;
 }

37-43: ⚠️ Potential issue

Add overflow check for register ID.

The method should validate that adding a new register won't cause an overflow of the registers vector.

Apply this diff to add the overflow check:

 auto add_register(PrefixTree::id_t const prefix_tree_parent_node_id) -> reg_id_t {
+    if (m_registers.size() >= std::numeric_limits<size_t>::max()) {
+        throw std::overflow_error("Adding register would cause overflow");
+    }
     auto const prefix_tree_node_id{
             m_prefix_tree.insert(prefix_tree_parent_node_id, PrefixTree::cDefaultPos)
     };
     m_registers.emplace_back(prefix_tree_node_id);
     return m_registers.size() - 1;
 }
🧹 Nitpick comments (4)
tests/test-nfa.cpp (1)

42-62: Consider adding state transition documentation.

While the expected output correctly represents the NFA states and transitions, adding comments to explain the purpose of key states would improve maintainability. For example:

  • State 0: Initial state with branches for 'Z' and 'A'
  • States 5-8: Letter capture states
  • States 14-15: Container ID digit capture states
src/log_surgeon/finite_automata/Dfa.hpp (3)

59-107: Verify the TODO comment about UTF-8 case.

The implementation is well-structured, but there's a TODO comment about handling the UTF-8 case. Please ensure this is tracked.

Would you like me to create an issue to track the implementation of UTF-8 case handling?


123-142: Verify the TODO comment about UTF-8 handling.

The implementation is well-structured, but there's a TODO comment about handling UTF-8 multi-byte transitions. Please ensure this is tracked.

Would you like me to create an issue to track the implementation of UTF-8 multi-byte transitions?


144-174: Verify the TODO comment about UTF-8 case.

The implementation is efficient with proper container reservations, but there's a TODO comment about handling the UTF-8 case. Please ensure this is tracked.

Would you like me to create an issue to track the implementation of UTF-8 case handling in BFS traversal?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de5c324 and e3d766b.

📒 Files selected for processing (15)
  • CMakeLists.txt (1 hunks)
  • src/log_surgeon/Lexer.hpp (5 hunks)
  • src/log_surgeon/Lexer.tpp (8 hunks)
  • src/log_surgeon/LexicalRule.hpp (2 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (5 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (7 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
  • src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (4 hunks)
  • src/log_surgeon/finite_automata/RegisterHandler.hpp (2 hunks)
  • src/log_surgeon/finite_automata/RegisterOperation.hpp (1 hunks)
  • src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1 hunks)
  • src/log_surgeon/finite_automata/TagOperation.hpp (1 hunks)
  • tests/CMakeLists.txt (2 hunks)
  • tests/test-nfa.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/log_surgeon/finite_automata/PrefixTree.hpp
  • CMakeLists.txt
  • src/log_surgeon/finite_automata/RegisterOperation.hpp
  • tests/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==

**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/LexicalRule.hpp
  • tests/test-nfa.cpp
  • src/log_surgeon/finite_automata/Dfa.hpp
  • src/log_surgeon/finite_automata/TagOperation.hpp
  • src/log_surgeon/finite_automata/SpontaneousTransition.hpp
  • src/log_surgeon/finite_automata/RegexAST.hpp
  • src/log_surgeon/finite_automata/RegisterHandler.hpp
  • src/log_surgeon/Lexer.hpp
  • src/log_surgeon/finite_automata/Nfa.hpp
  • src/log_surgeon/finite_automata/NfaState.hpp
🧠 Learnings (5)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegisterHandler.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T21:56:13.425Z
Learning: In the `log_surgeon` project, header guards in C++ header files should include `_HPP` at the end to match the filename. For example, in `RegisterHandler.hpp`, the header guard should be `LOG_SURGEON_FINITE_AUTOMATA_REGISTER_HANDLER_HPP`.
src/log_surgeon/Lexer.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (37)
tests/test-nfa.cpp (2)

9-10: LGTM! Header additions are well-organized.

The new headers align with the PR's focus on enhancing NFA state management and lexical rule handling.


65-67: LGTM! Robust error handling and test diagnostics.

The changes demonstrate good practices:

  • Using std::optional for safe serialization result handling
  • Adding explicit value checks
  • Improving test failure diagnostics with CAPTURE

Also applies to: 72-72

src/log_surgeon/finite_automata/RegisterHandler.hpp (4)

45-47: LGTM!

The method correctly uses at() for bounds checking and properly delegates to the prefix tree's set operation.


49-51: LGTM!

The method correctly uses at() for bounds checking and properly copies the prefix tree node ID.


53-56: LGTM!

The method correctly uses at() for bounds checking and properly delegates to the prefix tree's insert operation.


58-61: LGTM!

The method correctly uses at() for bounds checking and properly delegates to the prefix tree's get_reversed_positions operation.

src/log_surgeon/finite_automata/RegexAST.hpp (4)

24-24: Include of TagOperation.hpp is consistent with the new architecture
No issues observed with this addition.


863-866: Spontaneous transition for optional repetitions
This approach for handling m_min == 0 via a spontaneous transition is logical and aligns with zero-times repetition semantics.


911-943:

Details

✅ Verification successful

Capture group documentation and implementation
The expanded comments on the capture group NFA structure are very clear. Please verify that tests thoroughly exercise these transitions and capture boundaries.


🏁 Script executed:

#!/bin/bash
# Searching for references to RegexASTCapture usage in test files to confirm coverage
rg -A 10 "RegexASTCapture" tests

Length of output: 1809


Verified: Capture Group NFA Testing Coverage is Sufficient

The tests in tests/test-lexer.cpp confirm that RegexASTCapture (and its specialised instantiation, e.g. RegexASTCaptureByte) are being exercised. The dynamic casts and assertions—including checks for capture names and properties of the nested capture regex—demonstrate that the transitions and capture boundaries documented in src/log_surgeon/finite_automata/RegexAST.hpp (lines 911–943) are thoroughly tested.


117-122:

Details

❓ Verification inconclusive

Negative captures logic looks correct
This usage of new_state_from_negative_captures is consistent with the existing NFA flow. Ensure corresponding tests confirm that negative captures work as intended.


🏁 Script executed:

#!/bin/bash
# Searching for tests referencing "new_state_from_negative_captures" to confirm coverage
rg -A 5 "new_state_from_negative_captures" tests

Length of output: 48


Negative Captures Logic Verification

The code at lines 117–122 in src/log_surgeon/finite_automata/RegexAST.hpp appears to correctly invoke new_state_from_negative_captures in a manner that fits the existing NFA flow. However, the search for tests referencing new_state_from_negative_captures in the tests directory returned no results. Could you please verify manually that there are tests covering negative captures? If not, please consider adding tests to ensure that the negative captures functionality is working as intended.

src/log_surgeon/LexicalRule.hpp (3)

5-7: Headers imported for captures and vectors
These headers are correct and necessary for the new capture functionality.


28-30: Validate lifetime of returned captures vector
The get_captures() method returns a reference to data owned by m_regex. Ensure that users of this method do not outlive m_regex.


45-48: Integration with new_accepting_state and negative captures
Calling new_accepting_state(m_variable_id) followed by add_to_nfa_with_negative_captures is consistent with the updated negative captures approach.

src/log_surgeon/finite_automata/TagOperation.hpp (2)

1-58: Well-structured implementation of TagOperation
The enum, constructor, and serialization logic follow a clear design.


1-58: Consider handling unforeseen TagOperationType cases
If new enum values are introduced in the future, the switch in serialize() may silently omit them. A fallback case or exception can help detect missing cases at compile time.

src/log_surgeon/finite_automata/SpontaneousTransition.hpp (4)

26-30: LGTM! Well-designed constructors.

The constructors are well-implemented, with efficient use of std::move for tag operations.


32-34: LGTM! Well-designed getter methods.

The getters follow best practices with const references and proper use of [[nodiscard]].


36-42: LGTM! Well-documented method declaration.

The serialize method is well-documented with clear @param and @return descriptions.


49-67: LGTM! Modern and efficient implementation.

The implementation makes good use of modern C++ features:

  • Proper use of ranges for transforming tag operations
  • Efficient string formatting with fmt
  • Follows coding guidelines with false == state_ids.contains(m_dest_state)
src/log_surgeon/finite_automata/Dfa.hpp (2)

109-121: LGTM! Efficient state creation.

The implementation efficiently creates new DFA states using smart pointers and proper state management.


176-190: LGTM! Clean serialization implementation.

The implementation efficiently serializes the DFA using BFS traversal order and proper string formatting.

src/log_surgeon/finite_automata/NfaState.hpp (5)

40-52: LGTM! Well-designed constructors.

The constructors are well-implemented with proper initialization and delegation.


54-65: LGTM! Efficient transition addition.

The implementation efficiently adds spontaneous transitions with proper memory management.


92-108: LGTM! Well-designed getter methods.

The getters follow best practices with proper use of [[nodiscard]] and const references.


168-183: LGTM! Well-implemented epsilon closure.

The implementation correctly handles epsilon closure using spontaneous transitions and follows coding guidelines.


185-217: LGTM! Clean serialization implementation.

The implementation efficiently serializes the NFA state and follows coding guidelines.

src/log_surgeon/Lexer.hpp (3)

121-124: LGTM! Well-typed DFA accessor.

The method correctly returns a const reference to the DFA with proper template parameters.


159-165: LGTM! Well-implemented register ID lookup.

The method correctly handles optional values and follows coding guidelines.


216-221: Verify the choice of std::map over std::unordered_map.

The implementation uses std::map for m_tag_to_final_reg_id instead of std::unordered_map. Consider using std::unordered_map if ordered iteration is not required, as it provides better average-case performance.

Would you like me to measure the performance impact of using std::unordered_map instead of std::map?

src/log_surgeon/finite_automata/Nfa.hpp (4)

147-152: LGTM!

The function correctly creates a new accepting state with the provided matching variable ID.


154-170: LGTM!

The function correctly creates a new state with negative tag operations for the provided captures.


172-185: LGTM!

The function correctly creates start and end states with appropriate tag operations for positive captures.


220-238: LGTM!

The function now properly handles serialization failures by returning std::nullopt instead of potentially invalid data.

src/log_surgeon/Lexer.tpp (4)

87-88: LGTM!

The function now uses the more descriptive get_dest_state method and has improved error handling with explicit nullptr checks.

Also applies to: 91-92, 93-94, 99-100


218-219: LGTM!

The function now uses the more descriptive get_dest_state method and has improved error handling with explicit nullptr checks for wildcard scanning.

Also applies to: 222-223, 224-225, 230-231, 246-247, 280-283


340-341: LGTM!

The function now uses the more descriptive get_dest_state method for handling start-of-file characters.


407-408: LGTM!

The function now correctly uses the updated DFA constructor with both TypedDfaState and TypedNfaState template parameters, and has improved error handling with explicit nullptr checks.

Also applies to: 410-411

@SharafMohamed SharafMohamed force-pushed the add-register-operations branch 2 times, most recently from 69de670 to ed7ef25 Compare February 18, 2025 16:21
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
src/log_surgeon/LexicalRule.hpp (1)

34-37: 🛠️ Refactor suggestion

Address the TODO comment by making the returned pointer constant.

The TODO comment indicates that the returned pointer should be constant. This would improve const-correctness and prevent unintended modifications.

Apply this diff to fix the issue:

-    [[nodiscard]] auto get_regex() const -> finite_automata::RegexAST<TypedNfaState>* {
-        // TODO: make the returned pointer constant
-        return m_regex.get();
+    [[nodiscard]] auto get_regex() const -> finite_automata::RegexAST<TypedNfaState> const* {
+        return m_regex.get();
♻️ Duplicate comments (1)
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)

29-43: ⚠️ Potential issue

Add overflow protection when calculating register IDs.

The methods return size() - 1 without checking for vector overflow.

 auto add_register() -> reg_id_t {
+    if (m_registers.size() >= std::numeric_limits<reg_id_t>::max()) {
+        throw std::runtime_error("Register limit exceeded");
+    }
     auto const prefix_tree_node_id{
             m_prefix_tree.insert(PrefixTree::cRootId, PrefixTree::cDefaultPos)
     };
     m_registers.emplace_back(prefix_tree_node_id);
     return m_registers.size() - 1;
 }
🧹 Nitpick comments (4)
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)

21-27: Consider pre-allocating the vector size for better performance.

The vector could grow multiple times during the loop. Pre-allocating the size would prevent reallocations.

 auto add_registers(uint32_t const num_reg_to_add) -> std::vector<uint32_t> {
     std::vector<uint32_t> added_registers;
+    added_registers.reserve(num_reg_to_add);
     for (uint32_t i{0}; i < num_reg_to_add; ++i) {
         added_registers.emplace_back(add_register());
     }
     return added_registers;
 }
src/log_surgeon/finite_automata/Dfa.hpp (1)

144-174: LGTM, but address the UTF-8 TODO comment.

The BFS implementation is efficient with proper container usage and capacity reservations. However, the TODO comment about handling the UTF-8 case needs to be addressed.

Would you like me to help implement the UTF-8 case handling?

src/log_surgeon/Lexer.tpp (2)

246-247: Consider using early returns to simplify the wildcard handling logic.

The code could be more concise and easier to follow by using early returns.

Apply this diff to improve readability:

-                        auto* dest_state{state->get_dest_state(byte)};
-                        if (false == dest_state->is_accepting()) {
+                        auto* dest_state = state->get_dest_state(byte);
+                        if (nullptr == dest_state || false == dest_state->is_accepting()) {
                             token = Token{m_last_match_pos,
                                     input_buffer.storage().pos(),
                                     input_buffer.storage().get_active_buffer(),

Also applies to: 280-282


406-407: Address the TODO comment about DFA ignoring captures.

The comment indicates that the DFA is not properly handling captures, which could lead to incorrect parsing.

Would you like me to help implement proper capture handling in the DFA?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e3d766b and ed7ef25.

📒 Files selected for processing (14)
  • CMakeLists.txt (1 hunks)
  • examples/intersect-test.cpp (3 hunks)
  • src/log_surgeon/Lexer.hpp (4 hunks)
  • src/log_surgeon/Lexer.tpp (8 hunks)
  • src/log_surgeon/LexicalRule.hpp (1 hunks)
  • src/log_surgeon/SchemaParser.cpp (1 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (5 hunks)
  • src/log_surgeon/finite_automata/DfaState.hpp (3 hunks)
  • src/log_surgeon/finite_automata/DfaStatePair.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (2 hunks)
  • src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegisterHandler.hpp (2 hunks)
  • tests/CMakeLists.txt (2 hunks)
  • tests/test-register-handler.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • src/log_surgeon/finite_automata/PrefixTree.hpp
  • CMakeLists.txt
  • tests/test-register-handler.cpp
  • src/log_surgeon/SchemaParser.cpp
  • tests/CMakeLists.txt
  • src/log_surgeon/finite_automata/Nfa.hpp
  • src/log_surgeon/finite_automata/DfaStatePair.hpp
  • examples/intersect-test.cpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/LexicalRule.hpp
  • src/log_surgeon/Lexer.tpp
  • src/log_surgeon/finite_automata/Dfa.hpp
  • src/log_surgeon/Lexer.hpp
  • src/log_surgeon/finite_automata/RegisterHandler.hpp
  • src/log_surgeon/finite_automata/DfaState.hpp
🧠 Learnings (1)
src/log_surgeon/finite_automata/RegisterHandler.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T21:56:13.425Z
Learning: In the `log_surgeon` project, header guards in C++ header files should include `_HPP` at the end to match the filename. For example, in `RegisterHandler.hpp`, the header guard should be `LOG_SURGEON_FINITE_AUTOMATA_REGISTER_HANDLER_HPP`.
🔇 Additional comments (18)
src/log_surgeon/finite_automata/RegisterHandler.hpp (3)

1-9: LGTM! Header guards and includes are well-structured.

The header guard follows the correct naming convention, and the includes are appropriate for the functionality.


11-18: LGTM! Documentation is clear and informative.

The class documentation effectively explains the purpose and responsibilities, with a helpful note about register initialization.


45-61: LGTM! Register operations are well-implemented.

The methods use at() for bounds checking and have been correctly updated to use reg_id_t.

src/log_surgeon/finite_automata/DfaState.hpp (5)

34-38: LGTM! Proper initialization of transitions array.

The constructor correctly initializes all transitions following RAII principles, which prevents potential null pointer issues.


52-55: LGTM! Improved type safety with DfaTransition.

The method now accepts a proper DfaTransition object instead of a raw pointer, which enhances type safety.


57-59: LGTM! Clean implementation of accepting operations.

The method provides a clear interface for adding register operations.


101-141: LGTM! Well-structured serialization implementation.

The serialization logic is thorough and makes good use of the fmt library for string formatting. The code handles all aspects of the DFA state: accepting tags, operations, and transitions.


84-99: 🛠️ Refactor suggestion

Fix boolean expression to follow coding guidelines.

The implementation looks good, but there's a minor style issue.

Update line 94 to follow the coding guideline preferring 'false ==' over '!':

-        if (false == result->empty()) {
+        if (false == result->empty()) {

Likely an incorrect or invalid review comment.

src/log_surgeon/LexicalRule.hpp (3)

5-5: LGTM!

The added includes are necessary for the new functionality and follow standard include ordering.

Also applies to: 7-7


28-30: LGTM!

The method is well-designed with appropriate const-correctness and [[nodiscard]] attribute.


44-50: LGTM!

The method follows a clear sequence of operations and correctly uses negative captures.

src/log_surgeon/finite_automata/Dfa.hpp (3)

11-13: LGTM!

The added includes support the new serialization functionality, and the additional template parameter enhances type safety.

Also applies to: 14-14, 19-19


22-22: LGTM!

The changes improve efficiency and const-correctness while adding useful serialization functionality.

Also applies to: 27-27, 34-34


103-103:

Details

❓ Verification inconclusive

Verify the empty set usage in byte transitions.

The empty set passed to add_byte_transition suggests potential missing functionality or an incomplete implementation.

Run the following script to check for similar patterns:


🏁 Script executed:

#!/bin/bash
# Description: Search for other instances of empty set usage in transitions.
rg -A 5 'add_byte_transition.*\{\{\},'

Length of output: 415


Review of Empty Set Usage in Byte Transitions

The usage at line 103 in src/log_surgeon/finite_automata/Dfa.hpp—where an empty set is passed in the call to dfa_state->add_byte_transition—appears to be intentional, particularly given the nearby comment referencing a future UTF‑8 case implementation. However, please verify that this placeholder behaviour aligns with our design goals and consider adding further documentation or inline comments to clarify its purpose for future maintainability.

src/log_surgeon/Lexer.hpp (3)

6-6: LGTM!

The added map include supports the new tag-to-register mapping functionality.


216-216: LGTM!

The DFA type update improves type safety by making the NFA state type explicit.


221-221:

Details

❓ Verification inconclusive

Verify the need for ordered map over unordered_map.

The change from unordered_map to map suggests iteration order is important. Please confirm if this is necessary, as unordered_map generally provides better performance.

Run the following script to check if iteration order is used:


🏁 Script executed:

#!/bin/bash
# Description: Search for iterations over m_tag_to_final_reg_id
rg -A 5 'for.*m_tag_to_final_reg_id'

Length of output: 38


Verify the need for an ordered map.

The change from using an unordered_map to a map suggests that the iteration order might be relevant. However, our automated search for loops iterating over m_tag_to_final_reg_id returned no results. Could you please manually verify whether preserving order is actually required in subsequent code? This additional check would help confirm if the ordered container is necessary or if reverting to unordered_map for potential performance gains would be acceptable.

src/log_surgeon/Lexer.tpp (1)

87-88: LGTM!

The state transition changes improve clarity and type consistency.

Also applies to: 91-92, 218-219, 222-223

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
src/log_surgeon/finite_automata/RegisterOperation.hpp (1)

21-34: Consider adding input validation.

While the implementation is correct, consider validating reg_id and copy_reg_id to ensure they are within valid ranges.

 RegisterOperation(reg_id_t const reg_id, RegisterOperationType const type)
         : m_reg_id{reg_id},
-          m_type{type} {}
+          m_type{type} {
+    validate_reg_id(reg_id);
+}

 RegisterOperation(reg_id_t const reg_id, reg_id_t const copy_reg_id)
         : m_reg_id{reg_id},
           m_type{RegisterOperationType::Copy},
-          m_copy_reg_id{copy_reg_id} {}
+          m_copy_reg_id{copy_reg_id} {
+    validate_reg_id(reg_id);
+    validate_reg_id(copy_reg_id);
+}

+private:
+    static void validate_reg_id(reg_id_t const id) {
+        if (id > MAX_REG_ID) {
+            throw std::invalid_argument("Register ID out of valid range");
+        }
+    }
src/log_surgeon/finite_automata/DfaTransition.hpp (2)

21-29: Add nullptr validation for dest_state.

Consider validating that dest_state is not nullptr in the constructor to fail fast.

 DfaTransition(std::vector<RegisterOperation> reg_ops, DfaState<state_type>* dest_state)
         : m_reg_ops{std::move(reg_ops)},
-          m_dest_state{dest_state} {}
+          m_dest_state{dest_state} {
+    if (dest_state == nullptr) {
+        throw std::invalid_argument("dest_state cannot be nullptr");
+    }
+}

40-67: Optimize vector allocation in serialize().

Consider reserving space for transformed_ops to avoid reallocations.

     std::vector<std::string> transformed_ops;
+    transformed_ops.reserve(m_reg_ops.size());
     for (auto const& reg_op : m_reg_ops) {
tests/test-dfa.cpp (2)

28-40: Consider adding more test cases.

While this test case is comprehensive, consider adding tests for:

  • Invalid input handling
  • Edge cases (empty regex, complex patterns)
  • Error conditions

53-69: Consider extracting comparison logic into a helper function.

The line-by-line comparison logic could be made more readable by extracting it into a helper function.

+static void compare_serialized_output(string const& actual, string const& expected) {
+    stringstream ss_actual{actual};
+    stringstream ss_expected{expected};
+    string actual_line;
+    string expected_line;
+
+    CAPTURE(actual);
+    CAPTURE(expected);
+    while (getline(ss_actual, actual_line) && getline(ss_expected, expected_line)) {
+        REQUIRE(actual_line == expected_line);
+    }
+    getline(ss_actual, actual_line);
+    REQUIRE(actual_line.empty());
+    getline(ss_expected, expected_line);
+    REQUIRE(expected_line.empty());
+}
+
-    // Compare expected and actual line-by-line
-    auto const actual_serialized_dfa{dfa.serialize()};
-    stringstream ss_actual{actual_serialized_dfa};
-    stringstream ss_expected{expected_serialized_dfa};
-    string actual_line;
-    string expected_line;
-
-    CAPTURE(actual_serialized_dfa);
-    CAPTURE(expected_serialized_dfa);
-    while (getline(ss_actual, actual_line) && getline(ss_expected, expected_line)) {
-        REQUIRE(actual_line == expected_line);
-    }
-    getline(ss_actual, actual_line);
-    REQUIRE(actual_line.empty());
-    getline(ss_expected, expected_line);
-    REQUIRE(expected_line.empty());
+    compare_serialized_output(dfa.serialize(), expected_serialized_dfa);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed7ef25 and 32443c7.

📒 Files selected for processing (3)
  • src/log_surgeon/finite_automata/DfaTransition.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegisterOperation.hpp (1 hunks)
  • tests/test-dfa.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/DfaTransition.hpp
  • src/log_surgeon/finite_automata/RegisterOperation.hpp
  • tests/test-dfa.cpp
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (macos-latest, Release)
  • GitHub Check: build (macos-latest, Debug)
🔇 Additional comments (4)
src/log_surgeon/finite_automata/RegisterOperation.hpp (3)

13-20: LGTM! Well-structured enum class and class declaration.

The enum class is properly scoped with an appropriate underlying type, and the class interface is clean with good encapsulation.


39-53: Improve error handling in serialize().

The default case should be replaced with proper error handling as suggested in the previous review.


55-59: LGTM! Well-structured private members.

Good use of encapsulation and std::optional for the copy register ID.

src/log_surgeon/finite_automata/DfaTransition.hpp (1)

30-33: LGTM! Well-designed accessor methods.

Good use of [[nodiscard]] attribute and proper return type specifications.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/test-dfa.cpp (3)

28-40: Consider improving test setup readability.

The test setup could be more readable by extracting the schema creation into a helper function and adding comments explaining the expected DFA structure.

+// Helper function to create schema with a variable
+Schema create_schema_with_var(const string& var_name, const string& pattern) {
+    Schema schema;
+    schema.add_variable(var_name + ":" + pattern, -1);
+    return schema;
+}

 TEST_CASE("Test Simple Untagged DFA", "[DFA]") {
-    Schema schema;
-    string const var_name{"capture"};
-    string const var_schema{var_name + ":" + "userID=123"};
-    schema.add_variable(var_schema, -1);
+    // Create DFA for pattern "userID=123"
+    // Expected structure: u->s->e->r->I->D->=->[1-3]
+    auto schema = create_schema_with_var("capture", "userID=123");

55-71: Extract common verification code.

The line-by-line comparison logic is duplicated in both test cases. Consider extracting it into a helper function.

+// Helper function to verify DFA serialization
+void verify_dfa_serialization(const string& actual, const string& expected) {
+    stringstream ss_actual{actual};
+    stringstream ss_expected{expected};
+    string actual_line;
+    string expected_line;
+
+    CAPTURE(actual);
+    CAPTURE(expected);
+    while (getline(ss_actual, actual_line) && getline(ss_expected, expected_line)) {
+        REQUIRE(actual_line == expected_line);
+    }
+    getline(ss_actual, actual_line);
+    REQUIRE(actual_line.empty());
+    getline(ss_expected, expected_line);
+    REQUIRE(expected_line.empty());
+}

-    // Compare expected and actual line-by-line
-    auto const actual_serialized_dfa{dfa.serialize()};
-    stringstream ss_actual{actual_serialized_dfa};
-    stringstream ss_expected{expected_serialized_dfa};
-    string actual_line;
-    string expected_line;
-
-    CAPTURE(actual_serialized_dfa);
-    CAPTURE(expected_serialized_dfa);
-    while (getline(ss_actual, actual_line) && getline(ss_expected, expected_line)) {
-        REQUIRE(actual_line == expected_line);
-    }
-    getline(ss_actual, actual_line);
-    REQUIRE(actual_line.empty());
-    getline(ss_expected, expected_line);
-    REQUIRE(expected_line.empty());
+    verify_dfa_serialization(dfa.serialize(), expected_serialized_dfa);

73-113: Add test cases for edge cases and error conditions.

The current test cases cover basic and complex patterns but could be expanded to include:

  • Empty patterns
  • Invalid patterns
  • Patterns with escaped special characters
  • Patterns with boundary conditions (^, $)
  • Patterns with character classes ([[:digit:]], [[:space:]], etc.)

Would you like me to generate additional test cases for these scenarios?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32443c7 and eea8ad5.

📒 Files selected for processing (1)
  • tests/test-dfa.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • tests/test-dfa.cpp
🔇 Additional comments (1)
tests/test-dfa.cpp (1)

1-27: LGTM! Well-organized headers and type declarations.

The headers are properly organized, and the type aliases improve code readability.

@SharafMohamed SharafMohamed force-pushed the add-register-operations branch from eea8ad5 to c42a214 Compare February 25, 2025 22:20
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (8)
src/log_surgeon/Lexer.hpp (1)

221-221: Ensure naming consistency.
Since the previous member was m_tag_to_reg_id, confirm that this new name m_tag_to_final_reg_id is consistently used and explained, avoiding confusion about legacy references.

tests/test-dfa.cpp (2)

28-71: Commendable thoroughness in verifying DFA serialization line-by-line.
These checks ensure that even small changes in state ordering or transitions are caught. However, consider adding negative tests (e.g., malformed inputs) to boost confidence in error-handling scenarios.


73-114: Add variety to the tested transitions.
This second test case focuses primarily on branches and simple character classes. Consider adding more complex transitions (like repeated patterns) or verifying that backslashes and special characters are handled as expected by the DFA.

src/log_surgeon/finite_automata/Dfa.hpp (3)

11-12: Check for minimal fmt includes.
If only fmt::format is used, consider including just <fmt/format.h>. Redundant headers can marginally slow down builds or clutter the codebase.


50-55: Ensure BFS logic accounts for multi-byte transitions if necessary.
Currently, the BFS logic loops over single-byte transitions. If future expansions include multi-byte characters, confirm that this function can adapt or be extended seamlessly.


103-103: Validate the usage of empty set in byte transitions.
Here, transitions use {{}, dest_state} instead of a more descriptive structure for register ops. If registers are not applicable, consider a clearer naming or comment to avoid confusion later.

src/log_surgeon/finite_automata/DfaTransition.hpp (2)

30-32: Public getters appear consistent.

Returning copies of m_reg_ops is acceptable for an immutable pattern; however, consider returning a reference for performance if data sets are large.


49-67: Pointer ownership caution.

m_dest_state is a raw pointer, which can lead to dangling references if the pointed DfaState is destructed first. Consider using references or a smart pointer if feasible.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eea8ad5 and 228e8b7.

📒 Files selected for processing (9)
  • CMakeLists.txt (1 hunks)
  • src/log_surgeon/Lexer.hpp (2 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (5 hunks)
  • src/log_surgeon/finite_automata/DfaState.hpp (4 hunks)
  • src/log_surgeon/finite_automata/DfaTransition.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegisterHandler.hpp (2 hunks)
  • src/log_surgeon/finite_automata/RegisterOperation.hpp (1 hunks)
  • tests/CMakeLists.txt (2 hunks)
  • tests/test-dfa.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • CMakeLists.txt
  • tests/CMakeLists.txt
  • src/log_surgeon/finite_automata/RegisterHandler.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • tests/test-dfa.cpp
  • src/log_surgeon/Lexer.hpp
  • src/log_surgeon/finite_automata/RegisterOperation.hpp
  • src/log_surgeon/finite_automata/DfaState.hpp
  • src/log_surgeon/finite_automata/DfaTransition.hpp
  • src/log_surgeon/finite_automata/Dfa.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (macos-latest, Release)
🔇 Additional comments (13)
src/log_surgeon/Lexer.hpp (1)

161-163: Consider verifying performance trade-off of using std::map instead of std::unordered_map.
This switch might be intentional for consistent iteration order, yet for frequent look-ups, std::unordered_map can be quicker depending on usage patterns.

src/log_surgeon/finite_automata/Dfa.hpp (2)

24-27: Documentation is clear and concise.
The overview for the serialize() method is well-explained, increasing code clarity.


144-190: Excellent addition of a BFS-based serialize method.
This BFS traversal ensures a stable ordering of states. For large DFAs, confirm that the queue-based approach remains efficient.

src/log_surgeon/finite_automata/RegisterOperation.hpp (3)

1-17: Looks good overall.

The enumeration and includes appear correct. The usage of false == expression complies with the stated coding guidelines.


21-29: Constructors are clear and expressive.

They properly initialise m_type and optionally set m_copy_reg_id. No concerns here.


39-53: Consider throwing an exception for unknown operation types.

Returning "?" in the default case may mask developer errors and complicate debugging. Throwing or using an exhaustive switch helps avoid unexpected states.

src/log_surgeon/finite_automata/DfaState.hpp (5)

34-38: Clear default initialisation.

Initialising each transition with an empty set of register operations is a sound approach.


52-56: Transitions stored in an array are well-organised.

Assigning the transition object at [byte] clarifies how transitions are managed for each input character.


57-59: Straightforward acceptance operation tracking.

Storing the RegisterOperation by value is acceptable here. Ensure consumers handle required ownership semantics.


101-141: Serialization is robust and flexible.

It neatly handles both accepting tags and transitions. This design fosters easy debugging and logging of the DFA’s structure.


85-87:

Details

❓ Verification inconclusive

Verify out-of-bounds risk in get_dest_state.

If character is greater than 255 for StateType::Byte, this could cause undefined behaviour. Please confirm that character is never outside [0, 255].


🏁 Script executed:

#!/bin/bash
# Searching for call sites of get_dest_state to ensure no usage beyond [0..255] for Byte states
rg -A 5 'get_dest_state\s*\('

Length of output: 9851


Action Required: Confirm and Safeguard the Byte Range Invariant

The grep output indicates that call sites—such as those in Lexer.tpp—appear to pass values that are typically within the [0, 255] range. However, in the if constexpr (StateType::Byte == state_type) branch of DfaState::get_dest_state, there is no explicit range check for character. This means that if an out-of-bound value is ever passed, it could lead to undefined behaviour due to an out-of-bound array access.

Please verify that all call sites guarantee that the character argument is within the [0, 255] range. If this invariant might be violated in any scenario, consider one or more of the following actions:

  • Enforce the invariant by adding an assertion (e.g., assert(character < cSizeOfByte)) right before the array access.
  • Document the precondition in the function’s comments to clearly state that for StateType::Byte, the character must be between 0 and 255.
  • Review usage in all call sites for potential risks if non-byte values are passed into this function.
src/log_surgeon/finite_automata/DfaTransition.hpp (2)

24-28: Constructors are concise.

Default construction and move-initialising m_reg_ops are straightforward; no immediate issues arise.


40-42: Serializable transitions.

Allowing an optional result is helpful for error or missing-state scenarios, preventing partial data from creeping in.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed Dfa.hpp.

SharafMohamed and others added 2 commits February 26, 2025 17:43
Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)

153-158: Lambda function could be simplified in the context of its usage.

The lambda function is only used once in the loop below. Consider inlining the logic where it's used or simplifying if any new change as per the previous review is accepted.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 765598f and 34a7f5f.

📒 Files selected for processing (1)
  • src/log_surgeon/finite_automata/Dfa.hpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/Dfa.hpp
🪛 GitHub Actions: Build
src/log_surgeon/finite_automata/Dfa.hpp

[error] 181-181: use of undeclared identifier 'traversal_oerder'; did you mean 'traversal_order'?


[error] 178-178: 'traversal_order' declared here

🔇 Additional comments (7)
src/log_surgeon/finite_automata/Dfa.hpp (7)

11-13: Good addition of the fmt library for formatted output.

The inclusion of fmt library headers is appropriate for the new serialization functionality.


24-27: Well-documented and properly declared new serialize method.

The method is correctly declared with [[nodiscard]] attribute and trailing return type syntax, which is consistent with modern C++ style.


50-54: Good implementation of BFS traversal helper method.

The method is properly documented and uses the same style conventions as other methods in the class.


103-103: Correct adaptation to the new DfaTransition interface.

The call correctly passes a pair with empty register operations and the destination state.


166-166: Maintain consistency in TODO comments.

        // TODO: Handle the utf8 case

168-170: Simplify checking and adding states to BFS traversal.

This function doesn't need to reuse add_to_queue_and_visited, so we can simplify the code:

-            if (nullptr != dest_state) {
-                add_to_queue_and_visited(dest_state);
-            }
+            if (nullptr == dest_state || visited_states.contains(dest_state)) {
+                continue;
+            }
+            visited_states.emplace(dest_state);
+            state_queue.push(dest_state);

190-190: Well-formatted serialization output using fmt library.

The use of fmt::format and fmt::join is a clean way to format the serialized output with newlines.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed DfaState.hpp

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/log_surgeon/finite_automata/Dfa.hpp (1)

79-79: Consider using a data structure optimised for 256 possible keys.
As ascii_transitions_map might only handle up to 256 entries, consider using an array or an std::unordered_map<uint8_t, StateSet> with a reserve of 256 for potential performance benefits.

src/log_surgeon/finite_automata/DfaState.hpp (1)

49-51: Method signature fits the new transition design.
Optionally, consider passing dfa_transition by const reference if it is large, but for small objects this is acceptable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d64cc5 and 37aa5f8.

📒 Files selected for processing (3)
  • src/log_surgeon/finite_automata/Dfa.hpp (6 hunks)
  • src/log_surgeon/finite_automata/DfaState.hpp (4 hunks)
  • src/log_surgeon/finite_automata/RegisterOperation.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/Dfa.hpp
  • src/log_surgeon/finite_automata/RegisterOperation.hpp
  • src/log_surgeon/finite_automata/DfaState.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (macos-latest, Release)
🔇 Additional comments (18)
src/log_surgeon/finite_automata/Dfa.hpp (8)

11-12: All necessary includes appear correctly added.
No issues noted in these newly introduced headers.

Also applies to: 14-14


24-28: Good documentation for the new serialize method.
The docstring clearly indicates the return value and usage.


50-55: BFS documentation is concise and clear.
No issues identified in this comment block or signature.


82-82: Loop with uint16_t adequately covers cSizeOfByte.
No issues found in this loop iteration approach.


101-103: Transition creation logic is correct.
The use of {{}, dest_state} for register operations is coherent with the updated transition structure.


143-145: BFS helper function is well-named and straightforward.
No issues noted in these lines.


147-174: BFS state visitation logic is properly implemented.
Reserve calls and the insertion checks look correct, and the lambda usage is neat.


176-191: serialize() method implementation appears correct and efficient.
The BFS traversal approach, state ID mapping, and final joined output are neatly executed.

src/log_surgeon/finite_automata/RegisterOperation.hpp (1)

1-85: Overall structure and implementation look solid.

  • The factory methods (create_set_operation, create_negate_operation, create_copy_operation) help prevent misuse of constructors.
  • Returning std::nullopt for invalid states in serialize() is a sensible approach.
  • Use of false == m_copy_reg_id.has_value() aligns with the coding guideline.

No further issues noted.

src/log_surgeon/finite_automata/DfaState.hpp (9)

4-4: Headers imported look appropriate.
These additions seem necessary for the new functionality (array, fmt, etc.).

Also applies to: 8-8, 11-11, 14-15, 18-19


35-35: Constructor usage of fill(...) is a clean way to initialise transitions.
No concerns noted.


53-53: Accepting operations integration is straightforward.
Storing them in a vector is clear and maintainable.


57-62: serialize(...) docstring is clearly explaining the parameters and result.
Implementation details are consistent with the design.


66-66: Minor doc snippet for character transitions.
No issues to report.


72-72: Addition of m_accepting_ops brings register-based logic into DfaState.
No further concerns here.


80-80: Definition of get_dest_state is well-structured.
The if-constexpr handles Byte vs. Utf8 logic effectively.


82-82: Transition returns from m_bytes_transition and tree look correct.
No issues found in these return statements.

Also applies to: 85-85, 91-91


97-138: serialize(...) method thoroughly documents the state's structure.

  • Accepting tags and operations are neatly enumerated and appended.
  • Byte transitions are appended in an easily readable format.

Overall, no issues noted.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed DfaTransition.hpp.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests lgtm. Two more comments before we can merge.

Co-authored-by: Lin Zhihao <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/log_surgeon/finite_automata/DfaTransition.hpp (1)

45-50: Fix minor docstring typo.
In line 49, replace “statd_ids” with “state_ids” to match the method parameter name.

- * @return std::nullopt if `m_dest_state` is not in `statd_ids`.
+ * @return std::nullopt if `m_dest_state` is not in `state_ids`.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32ad10d and 0586f90.

📒 Files selected for processing (1)
  • src/log_surgeon/finite_automata/DfaTransition.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/DfaTransition.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (macos-latest, Release)
🔇 Additional comments (6)
src/log_surgeon/finite_automata/DfaTransition.hpp (6)

1-3: Good use of include guards.
No issues found here. This prevents multiple inclusions and helps maintain consistent declarations.


21-29: Well-structured documentation and class definition.
Thank you for providing clear bullet points describing each aspect of the transition. This directly addresses the previous request for more descriptive docstrings.


31-35: Constructors are concise and efficient.
The default constructor and the parameterized constructor both look good. Using std::move(reg_ops) is effective for avoiding unnecessary copies.


37-39: Returning a const reference for register operations is aligned with the previous suggestion.
This choice helps prevent unnecessary copies and is consistent with earlier feedback on returning a const reference.


41-43: Getter for destination state looks fine.
Accessing a const pointer to the destination state is straightforward and efficient.


60-78: Serialization method follows coding guidelines.
The checks use “false == <expression>” consistently, adhering to the stated style convention. The optional-based approach for handling serialization failures is clear and well structured.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
tests/test-dfa.cpp (1)

28-117: Consider adding tests for register operations.

While these tests thoroughly validate DFA construction and serialization, there are no explicit tests for the register operations mentioned in the PR description. Consider adding tests that verify the correct implementation of register operations to ensure complete test coverage.

src/log_surgeon/finite_automata/Dfa.hpp (2)

84-84: Consider using uint8_t for byte-related loop variable.

The loop iterates over byte values (0-255) but uses uint16_t, which can represent values up to 65535. While this works correctly, using uint8_t would better match the intended range.

-            for (uint16_t i{0}; i < cSizeOfByte; ++i) {
+            for (uint8_t i{0}; i < cSizeOfByte; ++i) {

168-168: Consider using uint8_t for byte-related loop variable.

Since the loop is iterating from 0 to 255 (cSizeOfByte), using uint8_t would be more appropriate than uint32_t.

-        for (uint32_t idx{0}; idx < cSizeOfByte; ++idx) {
+        for (uint8_t idx{0}; idx < cSizeOfByte; ++idx) {
src/log_surgeon/finite_automata/DfaState.hpp (2)

54-56: Consider taking RegisterOperation by value for more efficient moves

The method takes a RegisterOperation by const reference but then copies it into the vector. It would be more efficient to take the parameter by value and use std::move to avoid an unnecessary copy.

-    auto add_accepting_op(RegisterOperation const reg_op) -> void {
-        m_accepting_ops.push_back(reg_op);
+    auto add_accepting_op(RegisterOperation reg_op) -> void {
+        m_accepting_ops.push_back(std::move(reg_op));

102-108: Inconsistent error handling for accepting operations

The serialization of accepting operations (lines 102-108) silently continues when serialization fails, unlike the byte transitions handling (lines 119-129) which returns std::nullopt on failure. This inconsistency might be intentional but should be documented or adjusted for consistency.

         auto const serialized_accepting_op{accepting_op.serialize()};
-        if (serialized_accepting_op.has_value()) {
+        if (false == serialized_accepting_op.has_value()) {
+            return std::nullopt;
+        }
-            accepting_op_strings.push_back(serialized_accepting_op.value());
-        }
+        accepting_op_strings.push_back(serialized_accepting_op.value());
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffa78b1 and e5b1cce.

📒 Files selected for processing (6)
  • src/log_surgeon/Lexer.tpp (8 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (6 hunks)
  • src/log_surgeon/finite_automata/DfaState.hpp (3 hunks)
  • src/log_surgeon/finite_automata/DfaStatePair.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DfaTransition.hpp (1 hunks)
  • tests/test-dfa.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/DfaStatePair.hpp
  • src/log_surgeon/finite_automata/Dfa.hpp
  • tests/test-dfa.cpp
  • src/log_surgeon/finite_automata/DfaState.hpp
  • src/log_surgeon/Lexer.tpp
  • src/log_surgeon/finite_automata/DfaTransition.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (macos-latest, Debug)
🔇 Additional comments (29)
src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

72-78: The transition handling is correctly implemented.

The code properly retrieves transitions using get_transition() and checks if they have values with has_value() before accessing destination states through get_dest_state(). This approach prevents potential null pointer dereferences.

tests/test-dfa.cpp (2)

28-73: Well-structured test case for simple DFA.

The test case properly sets up a schema, constructs a DFA from lexical rules, and validates the serialized output. The line-by-line comparison approach is thorough and the checks for empty lines ensure exact matching.


75-117: Complex DFA test case properly validates serialization.

This test validates a more complex regex pattern involving alternation, character classes, and quantifiers. The serialized output verification is comprehensive.

src/log_surgeon/finite_automata/Dfa.hpp (5)

26-29: Clear documentation for serialize method.

The method documentation clearly explains the purpose and return value behavior.


56-56: BFS traversal method is properly documented.

The documentation clearly describes the purpose and return value of the method.


103-106: Transition addition is correctly implemented.

The code properly creates and adds transitions with empty register operations, adapting to the new DFA transition architecture.


146-177: BFS traversal implementation is robust.

The implementation correctly uses a queue for BFS traversal, tracks visited states to avoid cycles, and maintains the order in which states are visited. The lambda function to add states to the queue and visited set is well-structured and reused effectively.


179-198: Serialization implementation is comprehensive.

The serialization correctly:

  1. Gets states in BFS traversal order
  2. Creates a mapping of states to IDs
  3. Serializes each state with proper error handling
  4. Joins the serialized states with newlines

The error handling with std::nullopt is particularly important for robustness.

src/log_surgeon/Lexer.tpp (5)

87-87: Transition handling is correctly updated.

The code properly retrieves transitions using get_transition() instead of directly accessing destination states.


100-100: Proper null check with has_value().

The code correctly checks if the transition exists before trying to access its destination state.


169-169: Destination state access is correct.

The code properly accesses the destination state from the transition after checking if it exists.


282-289: Well-implemented transition handling for wildcard logic.

This section correctly handles transitions by:

  1. Getting the optional transition
  2. Checking if it has a value using has_value()
  3. Accessing the destination state only if it exists

This is a good example of proper transition handling.


419-419: Correct usage of has_value() check.

The code correctly uses has_value() to check if transitions exist when populating the m_is_first_char array.

src/log_surgeon/finite_automata/DfaTransition.hpp (9)

17-56: Well-structured DfaTransition implementation with const correctness

The DfaTransition class is well-designed with appropriate const correctness in the accessor methods and member variables. The class properly encapsulates the destination state and register operations that define a DFA transition.


31-33: Good use of std::move for efficient parameter transfer

Using std::move with the reg_ops parameter efficiently transfers ownership of the vector without unnecessary copying, which is a good practice for non-trivial types.


35-37: LGTM: Properly returning const reference

Returning a const reference to the vector prevents unnecessary copying and ensures the caller cannot modify the internal state.


43-51: Well-documented serialize method with proper error handling

The documentation for the serialize method clearly explains the return values in different scenarios, providing good guidance for users of this class.


54-55: LGTM: Using const pointer for destination state

Using a const pointer for m_dest_state ensures the destination state cannot be modified through this transition, maintaining the immutability of the DFA structure once built.


58-76: Robust implementation of serialize with proper error handling

The serialize method properly handles error cases by returning std::nullopt when the destination state isn't in the provided map or when any register operation fails to serialize. The use of fmt::format for string formatting is clean and readable.


62-64: Good adherence to coding guidelines with condition ordering

The condition false == state_ids.contains(m_dest_state) follows the coding guideline that prefers false == <expression> rather than !<expression>.


69-71: Consistent error handling pattern

The pattern of checking for nullopt and early return is consistent with the error handling approach used throughout the class.


72-73: LGTM: Efficient use of emplace_back

Using emplace_back is more efficient than push_back for non-trivial types like std::string.

src/log_surgeon/finite_automata/DfaState.hpp (7)

36-36: LGTM: Default constructor meets requirements

The default constructor is appropriate here since std::array<std::optional<DfaTransition<state_type>>, cSizeOfByte> will initialize all elements with std::nullopt.


50-52: Good transition from raw pointers to DfaTransition objects

The method now accepts a DfaTransition object rather than a raw pointer, which encapsulates both the destination state and register operations. This improves the design by making the relationship between transitions and their operations explicit.


69-72: Improved state transition access with better error handling

The get_transition method now returns an optional DfaTransition, which provides more information and better error handling than just returning the destination state pointer.


75-76: LGTM: Good use of std::array and std::optional

Using std::array with std::optional for transitions is cleaner and safer than the previous approach using raw pointers.


117-129: Thorough serialization of transitions with proper error handling

The transition serialization code is well-implemented with proper error checking. It correctly skips transitions that don't exist and returns std::nullopt if any transition serialization fails.


131-138: Clean string formatting using fmt library

The format string is well-structured and correctly incorporates all the components of the DFA state, including accepting tags, operations, and transitions.


83-91:

Details

❓ Verification inconclusive

Incomplete UTF-8 support in get_transition method

The UTF-8 case for get_transition simply returns std::nullopt without using m_tree_transitions. This might be intentional if UTF-8 support is still in progress, but should be documented or addressed.


🏁 Script executed:

#!/bin/bash
# Description: Check if UTF-8 support is implemented elsewhere or if this is a known limitation

# Look for usages of m_tree_transitions to see if it's used elsewhere
rg "m_tree_transitions" --type cpp

# Look for comments or TODOs related to UTF-8 support
rg "UTF-8|utf8|unicode" --type cpp -C 3

Length of output: 21188


Action Required: Clarify UTF-8 Handling in get_transition Method

The UTF-8 branch in the get_transition method (in file src/log_surgeon/finite_automata/DfaState.hpp, lines 83–91) currently returns std::nullopt rather than utilising m_tree_transitions. Our search indicates that while m_tree_transitions is actively used elsewhere (e.g. in NfaState.hpp), the corresponding UTF-8 implementation for DFA transitions is still missing and is noted as a TODO in other parts of the code (see Dfa.hpp).

  • Location: src/log_surgeon/finite_automata/DfaState.hpp (lines 83–91)
  • Observation: The UTF-8 branch (for StateType::Utf8) directly returns std::nullopt without any further processing.
  • Recommendation:
    • If UTF-8 support is intended to be implemented later, include a descriptive comment or TODO note in this branch to indicate that the current behaviour is an acknowledged limitation.
    • Otherwise, consider implementing the proper handling by utilising m_tree_transitions as done for similar cases elsewhere in the codebase.

Comment on lines +167 to +168
// TODO: Handle the utf8 case
for (uint32_t idx{0}; idx < cSizeOfByte; ++idx) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

TODO comment should be addressed.

The TODO comment about handling the UTF-8 case should be addressed in this implementation since the PR focuses on DFA transitions.


🏁 Script executed:

#!/bin/bash
# Check if there are other TODO comments related to UTF-8 in DFA-related files
rg "TODO.*utf8" src/log_surgeon/finite_automata/

Length of output: 378


Action Required: Address UTF-8 Handling TODOs

The DFA transition implementation still contains several TODO comments related to UTF-8 support. In particular, note these instances:

  • In src/log_surgeon/finite_automata/Dfa.hpp, the lines with:
    • // TODO: Handle the utf8 case
    • // TODO: add this for the utf8 case (appearing twice)
  • In src/log_surgeon/finite_automata/Nfa.hpp:
    • // TODO: handle the utf8 case

Given that this PR focuses on DFA transitions, please either implement the actual UTF-8 handling logic or provide a clear plan (e.g. a follow-up ticket) to address these comments consistently. This will help ensure that incomplete logic does not propagate in the codebase.

void Lexer<TypedNfaState, TypedDfaState>::prepend_start_of_file_char(ParserInputBuffer& input_buffer
) {
m_prev_state = m_dfa->get_root()->get_dest_state(utf8::cCharStartOfFile);
m_prev_state = m_dfa->get_root()->get_transition(utf8::cCharStartOfFile)->get_dest_state();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing transition validity check.

The code directly accesses the destination state from a transition without first checking if the transition exists. This could lead to undefined behavior if there's no transition for the start of file character.

-    m_prev_state = m_dfa->get_root()->get_transition(utf8::cCharStartOfFile)->get_dest_state();
+    auto const& transition{m_dfa->get_root()->get_transition(utf8::cCharStartOfFile)};
+    if (false == transition.has_value()) {
+        // Handle the case where there's no transition for start of file
+        m_prev_state = m_dfa->get_root();
+    } else {
+        m_prev_state = transition->get_dest_state();
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
m_prev_state = m_dfa->get_root()->get_transition(utf8::cCharStartOfFile)->get_dest_state();
auto const& transition{m_dfa->get_root()->get_transition(utf8::cCharStartOfFile)};
if (false == transition.has_value()) {
// Handle the case where there's no transition for start of file
m_prev_state = m_dfa->get_root();
} else {
m_prev_state = transition->get_dest_state();
}

Comment on lines +91 to +92
auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state(
)};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing transition validity check.

The code directly accesses the destination state from a transition without first checking if the transition exists using has_value(). This could lead to undefined behavior if there's no transition for the newline character.

-                auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state(
-                )};
+                auto const& root_transition{m_dfa->get_root()->get_transition(next_char)};
+                if (false == root_transition.has_value()) {
+                    // Handle the case where there's no transition for newline
+                    continue;
+                }
+                auto const* dest_state{root_transition->get_dest_state()};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state(
)};
auto const& root_transition{m_dfa->get_root()->get_transition(next_char)};
if (false == root_transition.has_value()) {
// Handle the case where there's no transition for newline
continue;
}
auto const* dest_state{root_transition->get_dest_state()};

if (wildcard == '?') {
for (uint32_t byte = 0; byte < cSizeOfByte; byte++) {
auto* dest_state{state->get_dest_state(byte)};
auto const* dest_state{state->get_transition(byte)->get_dest_state()};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing transition validity check in wildcard code.

The code directly accesses the destination state from a transition without checking if it exists using has_value(). This could cause a runtime error if there's no transition for a specific byte.

-                        auto const* dest_state{state->get_transition(byte)->get_dest_state()};
+                        auto const& transition{state->get_transition(byte)};
+                        if (false == transition.has_value()) {
+                            token = Token{m_last_match_pos,
+                                    input_buffer.storage().pos(),
+                                    input_buffer.storage().get_active_buffer(),
+                                    input_buffer.storage().size(),
+                                    m_last_match_line,
+                                    &cTokenUncaughtStringTypes};
+                            return ErrorCode::Success;
+                        }
+                        auto const* dest_state{transition->get_dest_state()};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto const* dest_state{state->get_transition(byte)->get_dest_state()};
auto const& transition{state->get_transition(byte)};
if (false == transition.has_value()) {
token = Token{m_last_match_pos,
input_buffer.storage().pos(),
input_buffer.storage().get_active_buffer(),
input_buffer.storage().size(),
m_last_match_line,
&cTokenUncaughtStringTypes};
return ErrorCode::Success;
}
auto const* dest_state{transition->get_dest_state()};

Comment on lines +223 to +224
auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state(
)};
Copy link

@coderabbitai coderabbitai bot Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing transition validity check.

Similar to the earlier instance, the code directly accesses the destination state from a transition without checking if it exists, which could lead to undefined behavior.

-                auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state(
-                )};
+                auto const& root_transition{m_dfa->get_root()->get_transition(next_char)};
+                if (false == root_transition.has_value()) {
+                    // Handle the case where there's no transition for newline
+                    continue;
+                }
+                auto const* dest_state{root_transition->get_dest_state()};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state(
)};
auto const& root_transition{m_dfa->get_root()->get_transition(next_char)};
if (false == root_transition.has_value()) {
// Handle the case where there's no transition for newline
continue;
}
auto const* dest_state{root_transition->get_dest_state()};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I guess \n and ? always exist? @SharafMohamed

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the transition should always exist for those two cases.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/log_surgeon/finite_automata/DfaState.hpp (4)

36-36: Consider explicit initialization.

Relying on default construction for m_bytes_transition is acceptable. However, if you want to ensure that each element is initially set to an empty optional, you can use an in-class initializer or explicit constructor initialization.


82-82: Address the UTF-8 TODO.

Since there's a placeholder comment to handle UTF-8, confirm if you intend to implement a dedicated logic branch for StateType::Utf8. If so, note that this method currently only handles m_bytes_transition.


84-87: Revisit return type for get_transition.

Returning an std::optional<DfaTransition<state_type>> const& can cause confusion if the reference is used beyond this method’s scope, especially once the underlying optional is destroyed. Consider returning an std::optional<DfaTransition<state_type>> by value for safer usage.


90-134: Ensure correct handling of non-printable characters when serializing transitions.

Using static_cast<char>(idx) is sufficient for ASCII symbols but might produce malformed or invisible characters for some transition indexes. If readability beyond standard ASCII is required, consider encoding these values (e.g., with an escape mechanism) before emitting them.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5b1cce and 3b1dafc.

📒 Files selected for processing (1)
  • src/log_surgeon/finite_automata/DfaState.hpp (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/DfaState.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (macos-latest, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
  • GitHub Check: build (ubuntu-22.04, Release)
🔇 Additional comments (2)
src/log_surgeon/finite_automata/DfaState.hpp (2)

4-4: Include directives look fine.

These added headers (<array>, <optional>, <string>, <unordered_map>, <fmt/core.h>, <fmt/format.h>, DfaTransition.hpp, and RegisterOperation.hpp) are suitable for implementing the new transition and register operations while retaining clarity and modularity.

Also applies to: 8-9, 12-12, 15-16, 19-20


50-52: Validate the transition index.

When assigning to m_bytes_transition[byte], consider whether the calling code may pass values outside the expected range (0–255). If so, a boundary check would be prudent to avoid undefined behaviour.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
src/log_surgeon/finite_automata/DfaState.hpp (3)

66-71: Returning a const reference to an optional may risk lifetime misuses.
Though returning a reference can avoid copy overhead, it exposes the possibility that the underlying optional can be invalidated. Consider returning an optional by value for clarity and safety.


82-82: UTF-8 handling remains a TODO.
Implementation of this feature is critical for consistent string handling in multi-byte contexts.

Would you like support creating a new issue or drafting an approach for UTF-8 transitions?


90-133: Potential unprintable string output in serialization.
When transitioning on non-printable characters, the serialized format may contain invisible or control characters. If readability or debuggability is paramount, consider escaping or expressing these as numeric codes.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b1dafc and 0b397e2.

📒 Files selected for processing (1)
  • src/log_surgeon/finite_automata/DfaState.hpp (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • src/log_surgeon/finite_automata/DfaState.hpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (ubuntu-22.04, Release)
  • GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (7)
src/log_surgeon/finite_automata/DfaState.hpp (7)

4-20: All included headers appear valid and relevant.
No major concerns. The added headers (array, optional, string, unordered_map, fmt libraries, DfaTransition.hpp, RegisterOperation.hpp) are logically consistent with the file’s functionality and match the needs of the updated code.


36-36: Default constructor looks appropriate.
Providing a default constructor ensures that all array and optional members are automatically value-initialized, simplifying usage of the class.


50-52: Validate bounds when assigning byte transitions.
Although using a uint8_t as the key inherently constrains the range to [0..255], verify that any external calls cannot supply invalid data. Otherwise, you could encounter unexpected behaviour.


54-55: Adding accepting register operations is straightforward.
The straightforward approach of pushing the operation into the vector is clear and efficient for storing heterogeneous register actions.


59-65: Documentation comments appropriately describe the serialize function’s return value.
This clarifies that std::nullopt might be forwarded if any nested serialization fails, which aids readers and maintainers in understanding potential failure conditions.


75-76: New fields for accepting operations and byte transitions appear well-structured.
Storing RegisterOperation vectors and leveraging std::array<std::optional<DfaTransition<state_type>>> ensure clarity around which transitions exist. No immediate concerns.


83-87: Byte-specialized get_transition method is clear.
Returning the optional transition from the array for a single byte index is straightforward and predictable.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

feat: Add `RegisterOperation` for TDFA into DFA transitions.

@SharafMohamed SharafMohamed changed the title feat: Add register operations and DFA transitions. feat: Add RegisterOperation for TDFA into DFA transitions. Mar 10, 2025
@SharafMohamed SharafMohamed merged commit ffb8196 into y-scope:main Mar 10, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants