feat: Add `RegisterOperation` for TDFA into DFA transitions. by SharafMohamed · Pull Request #89 · y-scope/log-surgeon

SharafMohamed · 2025-02-03T18:08:09Z

Description

Register operations introduced to the DFA as the parallel to tag operations in the NFA.
DFA transitions introduced to tie register operations to destination states.
Update the Lexer, DFA, and DFA states to use the register operations and DFA transitions.

Validation performed

Add test-dfa to test untagged DFAs.

Summary by CodeRabbit

New Features
- Enhanced finite automata functionality with improved state management, acceptance operations, and DFA serialization using breadth-first traversal.
- Introduced new classes and methods for managing transitions and register operations, including serialization capabilities.
- Added new header files for transition and register operations, expanding the finite automata module.
Tests
- Added comprehensive test cases validating DFA construction and serialization to ensure dependable outcomes.
- Included new test scenarios for both simple and complex DFAs to verify correct serialization outputs.

coderabbitai · 2025-02-03T18:08:18Z

Walkthrough

This pull request enhances the finite automata module by adding new header files to support DFA transitions and register operations, modifying DFA state representations to include serialization and improved transition handling, and updating mapping structures in the Lexer module. The RegisterHandler has been updated to use a new register ID type for consistency. Additionally, the build configuration and test suite have been modified to incorporate these new sources and tests for DFA functionality.

Changes

File(s)	Change Summary
`CMakeLists.txt`, `tests/CMakeLists.txt`	Updated build configuration to include new finite automata sources and tests; added headers like `DfaTransition.hpp` and `RegisterOperation.hpp`, included `test-dfa.cpp`, and removed redundant `test-capture.cpp`.
`src/log_surgeon/Lexer.hpp`	Modified tag ID handling by replacing `m_tag_to_reg_id` (using `std::unordered_map`) with `m_tag_to_final_reg_id` (using `std::map`) and updated the associated retrieval method.
`src/log_surgeon/finite_automata/Dfa.hpp`, `src/log_surgeon/finite_automata/DfaState.hpp`	Added serialization methods (`serialize`, `get_bfs_traversal_order`), updated transition representation in constructors, renamed `get_dest_state` to `get_transition`, and introduced `add_accepting_op` in DFA state management.
`src/log_surgeon/finite_automata/RegisterHandler.hpp`	Updated method signatures to use `reg_id_t` instead of `register_id_t` to ensure type consistency across register operations.
`src/log_surgeon/finite_automata/DfaTransition.hpp`, `src/log_surgeon/finite_automata/RegisterOperation.hpp`	Introduced new headers: `DfaTransition.hpp` defines a template class for DFA transitions with serialization and accessor methods; `RegisterOperation.hpp` defines an enum and class for register operations including constructors and a serialization method.
`tests/test-dfa.cpp`	Added new Catch2 test cases for simple and complex DFA serialization; tests validate the serialized output of the DFA against expected string representations.

Possibly related PRs

feat: Separate concept tags from captures and store capture-to-tag mapping in the lexer. #72: The changes in the main PR, which involve adding new source files for the finite automata module, are related to the retrieved PR as both include modifications to the Dfa.hpp and DfaState.hpp files, indicating a shared focus on the finite automata components.
feat: Add PrefixTree and RegisterHandler to support TDFA simulation. #56: The changes in the main PR, which add new source files for the finite automata module, are related to the retrieved PR as both involve modifications to the CMakeLists.txt file and the addition of new header files that enhance the finite automata functionality, specifically including RegisterOperation.hpp and DfaTransition.hpp.
refactor: Use default register position in PrefixTree; Add RegisterHandler constructor for multiple registers; Add register_id_t type alias. #88: The changes in the main PR are related to the addition of new header files for the finite automata module, which includes RegisterOperation.hpp, a file that is also modified in the retrieved PR, indicating a direct connection through the use of the RegisterOperation class.

Suggested reviewers

LinZhihao-723

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b397e2 and 11cb2a7.

📒 Files selected for processing (1)

src/log_surgeon/Lexer.hpp (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/log_surgeon/Lexer.hpp

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: build (ubuntu-22.04, Release)
GitHub Check: build (ubuntu-22.04, Debug)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (20)

src/log_surgeon/finite_automata/DfaState.hpp (3)
34-38: Consider simplifying your constructor’s loop.
You could use std::fill in an initializer list or directly in the constructor body to initialize m_bytes_transition with the same default object, improving clarity.
 DfaState() {
-    for (auto& transition : m_bytes_transition) {
-        transition = DfaTransition<state_type>{{}, nullptr};
-    }
+    std::fill(std::begin(m_bytes_transition),
+              std::end(m_bytes_transition),
+              DfaTransition<state_type>{{}, nullptr});
 }
61-67: Documenting the new serialize method is helpful.
The docstring clarifies usage by devs working in parallel on DFAs.

101-141: Consider caching the accepting state check in serialize.
You call is_accepting() multiple times; consider assigning the result to a local boolean for minor performance and clarity benefits. Otherwise, this method is elegantly structured for debugging or logging.
+ bool accepting = is_accepting();
  auto const accepting_tags_string = 
-     is_accepting()
+     accepting
      ? fmt::format("accepting_tags={{{}}},", fmt::join(m_matching_variable_ids, ","))
      : "";
src/log_surgeon/Lexer.hpp (2)

128-130: Returning a const reference to a unique_ptr.
Although returning a const reference prevents modification of the pointer, consider documenting that callers must check for null in usage scenarios.

151-157: Minimal overhead approach.
Using contains followed by at is concise. Should performance become a concern, consider retrieving an iterator directly to avoid a double lookup.

src/log_surgeon/finite_automata/Dfa.hpp (2)

28-31: Documenting the new serialize method is beneficial.
This helps readers quickly understand that a string-based snapshot of DFA structure is provided.

128-146: Future introspection in get_intersect.
The method currently handles ASCII transitions. You have a TODO for UTF-8 support; consider scheduling it soon to avoid partial coverage for multi-byte encodings.

Would you like help with a follow-up patch to handle UTF-8 transitions?

src/log_surgeon/UniqueIdGenerator.hpp (1)

1-14: Straightforward unique ID generation.
The incrementing logic is correct for single-threaded use; however, consider a synchronization mechanism or atomic operations if generating IDs concurrently.

tests/test-capture.cpp (1)

9-36: Test coverage could be enhanced.

While the current test cases cover basic functionality, consider adding tests for:

String validation (if any restrictions exist on capture names)

Error cases (if any)

Comparison operations (if implemented)

src/log_surgeon/LexicalRule.hpp (1)

28-30: Consider adding documentation for get_captures().

Add documentation to describe:

The purpose of the method

The return value semantics (ownership, lifetime)
src/log_surgeon/finite_automata/RegisterOperation.hpp (1)
36-38: Fix typo in documentation.

"opertion" should be "operation".
     /**
-     * @return A string representation of the register opertion.
+     * @return A string representation of the register operation.
      */
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)
31-31: Consider returning const reference for better performance.

The getter get_tag_ops() returns a copy of the vector. For better performance, consider returning a const reference if the caller doesn't need a copy.
-    [[nodiscard]] auto get_tag_ops() const -> std::vector<TagOperation> { return m_tag_ops; }
+    [[nodiscard]] auto get_tag_ops() const -> std::vector<TagOperation> const& { return m_tag_ops; }
src/log_surgeon/finite_automata/DfaTransition.hpp (2)
30-30: Consider returning const reference for better performance.

Similar to SpontaneousTransition, consider returning a const reference from get_reg_ops() if the caller doesn't need a copy.
-    [[nodiscard]] auto get_reg_ops() const -> std::vector<RegisterOperation> { return m_reg_ops; }
+    [[nodiscard]] auto get_reg_ops() const -> std::vector<RegisterOperation> const& { return m_reg_ops; }
57-64: Consider reserving vector capacity for better performance.

Pre-allocate the vector capacity to avoid potential reallocations during push_back operations.
     std::vector<std::string> transformed_ops;
+    transformed_ops.reserve(m_reg_ops.size());
     for (auto const& reg_op : m_reg_ops) {
tests/test-dfa.cpp (1)
28-69: Consider using SECTION for better test organization.

The test case could benefit from using Catch2's SECTION macro to organize different aspects of the test (e.g., setup, serialization, comparison).
 TEST_CASE("Test Untagged DFA", "[DFA]") {
-    Schema schema;
-    string const var_name{"capture"};
-    string const var_schema{var_name + ":" + "Z|(A[abcd]B\\d+C)"};
-    schema.add_variable(var_schema, -1);
+    SECTION("Test DFA serialization") {
+        Schema schema;
+        string const var_name{"capture"};
+        string const var_schema{var_name + ":" + "Z|(A[abcd]B\\d+C)"};
+        schema.add_variable(var_schema, -1);
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)
21-27: Consider reserving vector capacity for better performance.

Pre-allocate the vector capacity to avoid reallocations during emplace_back operations.
     std::vector<uint32_t> added_registers;
+    added_registers.reserve(num_reg_to_add);
     for (uint32_t i{0}; i < num_reg_to_add; ++i) {
tests/test-register-handler.cpp (1)
23-27: Consider simplifying the register addition logic.

The current implementation can be made more concise while maintaining the same functionality.
-        if (0 == i) {
-            handler.add_register();
-        } else {
-            handler.add_register(i);
-        }
+        handler.add_register(i > 0 ? i : std::nullopt);
tests/test-nfa.cpp (1)
38-38: Consider using std::move for rules.

The rules vector could be moved into the ByteNfa constructor to avoid unnecessary copying.
-    ByteNfa const nfa{rules};
+    ByteNfa const nfa{std::move(rules)};
tests/test-lexer.cpp (2)
93-126: Consider optimizing delimiter collection.

The loop to collect delimiters could be simplified by using a vector comprehension or std::copy_if.
-    vector<uint32_t> delimiters;
-    for (uint32_t i{0}; i < log_surgeon::cSizeOfByte; i++) {
-        if (lexer.is_delimiter(i)) {
-            delimiters.push_back(i);
-        }
-    }
+    vector<uint32_t> delimiters;
+    delimiters.reserve(log_surgeon::cSizeOfByte);
+    std::copy_if(
+        std::begin(cDelimiters), std::end(cDelimiters),
+        std::back_inserter(delimiters),
+        [&lexer](uint32_t i) { return lexer.is_delimiter(i); }
+    );
296-350: Consider adding edge cases to test suite.

While the current test cases cover basic functionality, consider adding tests for:

Empty input strings

Maximum length inputs

Invalid capture group names

Nested capture groups

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a513c5f and de5c324.

📒 Files selected for processing (31)

.github/workflows/build.yaml (3 hunks)
CMakeLists.txt (2 hunks)
examples/intersect-test.cpp (3 hunks)
lint-tasks.yml (2 hunks)
src/log_surgeon/Lexer.hpp (4 hunks)
src/log_surgeon/Lexer.tpp (10 hunks)
src/log_surgeon/LexicalRule.hpp (3 hunks)
src/log_surgeon/SchemaParser.cpp (4 hunks)
src/log_surgeon/UniqueIdGenerator.hpp (1 hunks)
src/log_surgeon/finite_automata/Capture.hpp (2 hunks)
src/log_surgeon/finite_automata/Dfa.hpp (6 hunks)
src/log_surgeon/finite_automata/DfaState.hpp (3 hunks)
src/log_surgeon/finite_automata/DfaStatePair.hpp (1 hunks)
src/log_surgeon/finite_automata/DfaTransition.hpp (1 hunks)
src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (19 hunks)
src/log_surgeon/finite_automata/Register.hpp (1 hunks)
src/log_surgeon/finite_automata/RegisterHandler.hpp (2 hunks)
src/log_surgeon/finite_automata/RegisterOperation.hpp (1 hunks)
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1 hunks)
src/log_surgeon/finite_automata/TagOperation.hpp (1 hunks)
src/log_surgeon/finite_automata/TaggedTransition.hpp (0 hunks)
tests/CMakeLists.txt (2 hunks)
tests/test-capture.cpp (1 hunks)
tests/test-dfa.cpp (1 hunks)
tests/test-lexer.cpp (5 hunks)
tests/test-nfa.cpp (2 hunks)
tests/test-register-handler.cpp (2 hunks)
tests/test-tag.cpp (0 hunks)

💤 Files with no reviewable changes (2)

tests/test-tag.cpp
src/log_surgeon/finite_automata/TaggedTransition.hpp

✅ Files skipped from review due to trivial changes (1)

src/log_surgeon/finite_automata/Register.hpp

🧰 Additional context used

📓 Path-based instructions (23)

src/log_surgeon/UniqueIdGenerator.hpp (1)