feat: Add RegisterOperation for TDFA into DFA transitions.#89
feat: Add RegisterOperation for TDFA into DFA transitions.#89SharafMohamed merged 32 commits intoy-scope:mainfrom
RegisterOperation for TDFA into DFA transitions.#89Conversation
WalkthroughThis pull request enhances the finite automata module by adding new header files to support DFA transitions and register operations, modifying DFA state representations to include serialization and improved transition handling, and updating mapping structures in the Lexer module. The RegisterHandler has been updated to use a new register ID type for consistency. Additionally, the build configuration and test suite have been modified to incorporate these new sources and tests for DFA functionality. Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (2)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (20)
src/log_surgeon/finite_automata/DfaState.hpp (3)
34-38: Consider simplifying your constructor’s loop.
You could usestd::fillin an initializer list or directly in the constructor body to initializem_bytes_transitionwith the same default object, improving clarity.DfaState() { - for (auto& transition : m_bytes_transition) { - transition = DfaTransition<state_type>{{}, nullptr}; - } + std::fill(std::begin(m_bytes_transition), + std::end(m_bytes_transition), + DfaTransition<state_type>{{}, nullptr}); }
61-67: Documenting the new serialize method is helpful.
The docstring clarifies usage by devs working in parallel on DFAs.
101-141: Consider caching the accepting state check in serialize.
You callis_accepting()multiple times; consider assigning the result to a local boolean for minor performance and clarity benefits. Otherwise, this method is elegantly structured for debugging or logging.+ bool accepting = is_accepting(); auto const accepting_tags_string = - is_accepting() + accepting ? fmt::format("accepting_tags={{{}}},", fmt::join(m_matching_variable_ids, ",")) : "";src/log_surgeon/Lexer.hpp (2)
128-130: Returning a const reference to a unique_ptr.
Although returning a const reference prevents modification of the pointer, consider documenting that callers must check for null in usage scenarios.
151-157: Minimal overhead approach.
Usingcontainsfollowed byatis concise. Should performance become a concern, consider retrieving an iterator directly to avoid a double lookup.src/log_surgeon/finite_automata/Dfa.hpp (2)
28-31: Documenting the new serialize method is beneficial.
This helps readers quickly understand that a string-based snapshot of DFA structure is provided.
128-146: Future introspection in get_intersect.
The method currently handles ASCII transitions. You have a TODO for UTF-8 support; consider scheduling it soon to avoid partial coverage for multi-byte encodings.Would you like help with a follow-up patch to handle UTF-8 transitions?
src/log_surgeon/UniqueIdGenerator.hpp (1)
1-14: Straightforward unique ID generation.
The incrementing logic is correct for single-threaded use; however, consider a synchronization mechanism or atomic operations if generating IDs concurrently.tests/test-capture.cpp (1)
9-36: Test coverage could be enhanced.While the current test cases cover basic functionality, consider adding tests for:
- String validation (if any restrictions exist on capture names)
- Error cases (if any)
- Comparison operations (if implemented)
src/log_surgeon/LexicalRule.hpp (1)
28-30: Consider adding documentation for get_captures().Add documentation to describe:
- The purpose of the method
- The return value semantics (ownership, lifetime)
src/log_surgeon/finite_automata/RegisterOperation.hpp (1)
36-38: Fix typo in documentation."opertion" should be "operation".
/** - * @return A string representation of the register opertion. + * @return A string representation of the register operation. */src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)
31-31: Consider returning const reference for better performance.The getter
get_tag_ops()returns a copy of the vector. For better performance, consider returning a const reference if the caller doesn't need a copy.- [[nodiscard]] auto get_tag_ops() const -> std::vector<TagOperation> { return m_tag_ops; } + [[nodiscard]] auto get_tag_ops() const -> std::vector<TagOperation> const& { return m_tag_ops; }src/log_surgeon/finite_automata/DfaTransition.hpp (2)
30-30: Consider returning const reference for better performance.Similar to SpontaneousTransition, consider returning a const reference from
get_reg_ops()if the caller doesn't need a copy.- [[nodiscard]] auto get_reg_ops() const -> std::vector<RegisterOperation> { return m_reg_ops; } + [[nodiscard]] auto get_reg_ops() const -> std::vector<RegisterOperation> const& { return m_reg_ops; }
57-64: Consider reserving vector capacity for better performance.Pre-allocate the vector capacity to avoid potential reallocations during push_back operations.
std::vector<std::string> transformed_ops; + transformed_ops.reserve(m_reg_ops.size()); for (auto const& reg_op : m_reg_ops) {tests/test-dfa.cpp (1)
28-69: Consider using SECTION for better test organization.The test case could benefit from using Catch2's SECTION macro to organize different aspects of the test (e.g., setup, serialization, comparison).
TEST_CASE("Test Untagged DFA", "[DFA]") { - Schema schema; - string const var_name{"capture"}; - string const var_schema{var_name + ":" + "Z|(A[abcd]B\\d+C)"}; - schema.add_variable(var_schema, -1); + SECTION("Test DFA serialization") { + Schema schema; + string const var_name{"capture"}; + string const var_schema{var_name + ":" + "Z|(A[abcd]B\\d+C)"}; + schema.add_variable(var_schema, -1);src/log_surgeon/finite_automata/RegisterHandler.hpp (1)
21-27: Consider reserving vector capacity for better performance.Pre-allocate the vector capacity to avoid reallocations during emplace_back operations.
std::vector<uint32_t> added_registers; + added_registers.reserve(num_reg_to_add); for (uint32_t i{0}; i < num_reg_to_add; ++i) {tests/test-register-handler.cpp (1)
23-27: Consider simplifying the register addition logic.The current implementation can be made more concise while maintaining the same functionality.
- if (0 == i) { - handler.add_register(); - } else { - handler.add_register(i); - } + handler.add_register(i > 0 ? i : std::nullopt);tests/test-nfa.cpp (1)
38-38: Consider using std::move for rules.The rules vector could be moved into the ByteNfa constructor to avoid unnecessary copying.
- ByteNfa const nfa{rules}; + ByteNfa const nfa{std::move(rules)};tests/test-lexer.cpp (2)
93-126: Consider optimizing delimiter collection.The loop to collect delimiters could be simplified by using a vector comprehension or std::copy_if.
- vector<uint32_t> delimiters; - for (uint32_t i{0}; i < log_surgeon::cSizeOfByte; i++) { - if (lexer.is_delimiter(i)) { - delimiters.push_back(i); - } - } + vector<uint32_t> delimiters; + delimiters.reserve(log_surgeon::cSizeOfByte); + std::copy_if( + std::begin(cDelimiters), std::end(cDelimiters), + std::back_inserter(delimiters), + [&lexer](uint32_t i) { return lexer.is_delimiter(i); } + );
296-350: Consider adding edge cases to test suite.While the current test cases cover basic functionality, consider adding tests for:
- Empty input strings
- Maximum length inputs
- Invalid capture group names
- Nested capture groups
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (31)
.github/workflows/build.yaml(3 hunks)CMakeLists.txt(2 hunks)examples/intersect-test.cpp(3 hunks)lint-tasks.yml(2 hunks)src/log_surgeon/Lexer.hpp(4 hunks)src/log_surgeon/Lexer.tpp(10 hunks)src/log_surgeon/LexicalRule.hpp(3 hunks)src/log_surgeon/SchemaParser.cpp(4 hunks)src/log_surgeon/UniqueIdGenerator.hpp(1 hunks)src/log_surgeon/finite_automata/Capture.hpp(2 hunks)src/log_surgeon/finite_automata/Dfa.hpp(6 hunks)src/log_surgeon/finite_automata/DfaState.hpp(3 hunks)src/log_surgeon/finite_automata/DfaStatePair.hpp(1 hunks)src/log_surgeon/finite_automata/DfaTransition.hpp(1 hunks)src/log_surgeon/finite_automata/Nfa.hpp(6 hunks)src/log_surgeon/finite_automata/NfaState.hpp(6 hunks)src/log_surgeon/finite_automata/PrefixTree.hpp(1 hunks)src/log_surgeon/finite_automata/RegexAST.hpp(19 hunks)src/log_surgeon/finite_automata/Register.hpp(1 hunks)src/log_surgeon/finite_automata/RegisterHandler.hpp(2 hunks)src/log_surgeon/finite_automata/RegisterOperation.hpp(1 hunks)src/log_surgeon/finite_automata/SpontaneousTransition.hpp(1 hunks)src/log_surgeon/finite_automata/TagOperation.hpp(1 hunks)src/log_surgeon/finite_automata/TaggedTransition.hpp(0 hunks)tests/CMakeLists.txt(2 hunks)tests/test-capture.cpp(1 hunks)tests/test-dfa.cpp(1 hunks)tests/test-lexer.cpp(5 hunks)tests/test-nfa.cpp(2 hunks)tests/test-register-handler.cpp(2 hunks)tests/test-tag.cpp(0 hunks)
💤 Files with no reviewable changes (2)
- tests/test-tag.cpp
- src/log_surgeon/finite_automata/TaggedTransition.hpp
✅ Files skipped from review due to trivial changes (1)
- src/log_surgeon/finite_automata/Register.hpp
🧰 Additional context used
📓 Path-based instructions (23)
src/log_surgeon/UniqueIdGenerator.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/PrefixTree.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
tests/test-capture.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
tests/test-register-handler.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/LexicalRule.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/DfaStatePair.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/SchemaParser.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
tests/test-lexer.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
examples/intersect-test.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
tests/test-nfa.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/Capture.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/TagOperation.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/DfaTransition.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/DfaState.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/RegisterOperation.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/Dfa.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
tests/test-dfa.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/Lexer.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/finite_automata/RegexAST.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
📓 Learnings (5)
src/log_surgeon/finite_automata/Capture.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegisterHandler.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T21:56:13.425Z
Learning: In the `log_surgeon` project, header guards in C++ header files should include `_HPP` at the end to match the filename. For example, in `RegisterHandler.hpp`, the header guard should be `LOG_SURGEON_FINITE_AUTOMATA_REGISTER_HANDLER_HPP`.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (136)
src/log_surgeon/finite_automata/DfaState.hpp (8)
7-10: Includes look appropriate.
These added headers align with the new functionalities (e.g. string manipulation and unordered maps).
13-14: Good choice of fmt library.
Using fmt for string formatting and logging is a solid approach that improves readability and performance.
17-18: New includes for transitions and register operations are coherent.
The newly included headers resolve dependencies for the transition logic and the register-based operations.
52-55: Updated transition signatures enhance clarity.
Shifting from pointer-type arguments to strong typedDfaTransitionobjects promotes type safety and maintainability.
57-58: Adding accepting operations is a good extension.
This approach cleanly decouples the register logic from the state’s acceptance condition.
70-72: Clear doc for get_dest_state.
Documenting the difference between byte vs. UTF-8 transitions helps future maintainers.
76-77: New data members for transition ops are well-placed.
Storingm_accepting_opsalongsidem_bytes_transitionensures transitions and acceptance logic remain cohesive.
84-99: Verify indexing in get_dest_state.
Confirm thatcharacterwill never exceedcSizeOfBytewhenstate_type == StateType::Byte; otherwise, this could invoke out-of-bounds indexing.Would you like to add upper boundary checks or rely on upstream validation?
src/log_surgeon/Lexer.hpp (8)
7-7: New includes establish the foundation for register/tag logic.
Including<map>,NfaState.hpp,Register.hpp, andTagOperation.hppis consistent with the added capturing and register functionalities.Also applies to: 16-16, 18-18, 19-19
25-26: Defining custom IDs helps maintain clarity.
Usingregister_id_t,tag_id_t, andsymbol_id_tclarifies domain-specific identifiers rather than relying on bare integer types.Also applies to: 28-28
133-140: Ensure coverage for get_capture_ids_for_var_id.
This function is crucial for linking variables to captures. Please confirm it is tested, especially for edge cases with no captures.
142-149: Check the mapping consistency in get_tag_ids_for_capture_id.
Ensure no collisions or missing data for capture IDs, as returningstd::nulloptcan lead to subtle bugs if unhandled downstream.
159-170: Graceful composition in get_registers_for_capture.
Composing your get_* methods (tags → regs) is a neat cascade. Ensure thorough unit tests for all paths that might returnstd::nullopt.
172-173: Symbol ID mappings are clearly placed.
Exposingm_symbol_idandm_id_symbolfosters straightforward lookups. They seem consistent with the rest of the design.
195-195: m_dfa pointer usage.
Confirmed that using aunique_ptris safe and ensures ownership remains within the lexer. This is a solid practice for resource management.
198-200: Capture-to-tag and tag-to-reg mappings.
The new data structures (m_var_id_to_capture_ids,m_capture_id_to_tag_ids,m_tag_to_final_reg_id) form a consistent chain for retrieving register ops.src/log_surgeon/finite_automata/Dfa.hpp (11)
5-5: Appropriate new includes for extended features.
Including map, stack, Register, RegisterHandler, and TagOperation supports the new template parameters, transitions, and register handling logic.Also applies to: 8-8, 11-12, 14-14, 17-20
23-23: Additional template parameter.
Introducing a second template parameter forTypedNfaStatealigns the DFA with NFA expansions and boosts type safety.
26-26: Constructor taking a const reference.
Passingnfaby const reference is an efficient choice, preventing unnecessary copying of potentially large structures.
38-38: Improved new_state method.
Accepting a set of NFA states ensures clarity around the relationship between the newly created DFA state and its source NFA states.
51-51: Intersect documentation.
The doc clarifies how the method identifies which schema types are reachable in both DFAs. This is a nice explicit approach.
54-62: BFS traversal for prospective usage.
The doc describing a BFS order is relevant for debugging or advanced transformations. Good to have.
65-112: Methodical approach constructing the DFA.
Using a stack of sets and computing epsilon closures systematically is a robust technique. The usage offalse == unmarked_sets.empty()also matches the style guidelines.
95-95: Transition object usage.
Passing{{}, dest_state}clarifies that no register ops are immediately triggered on this transition, reinforcing the new typed transition structure.
114-126: Populating matching_variable_ids.
Collecting variable IDs from each accepting NFA state ensures that the final DFA state knows the correct schema types. This is straightforward and effective.
150-179: BFS ordering is well-structured.
The queue-based approach systematically visits states. The block properly updates visited states to avoid infinite loops.
181-195: serialize method logic is clean.
Concise BFS-based ordering plus using a map of state IDs ensures consistent output across runs, which aids debugging and testing.src/log_surgeon/finite_automata/NfaState.hpp (17)
4-4: New includes look appropriate.
No immediate concerns.Also applies to: 12-12, 14-14, 17-17, 20-21
37-37: Explicit default constructor is clear.
No issues found.
39-41: Accepting constructor is concise.
The constructor eagerly marks the state as accepting and sets the matching variable ID. This looks intentional and correct.
43-49: Constructor chaining logic.
Invokingadd_spontaneous_transitionin the constructor ensures consistent initialization. This is a good pattern.
51-52: Spontaneous transition addition (no tag operations).
Function is straightforward, adding a transition without tag operations. Good clarity.
55-66: Spontaneous transition addition (with tag operations).
Collectingtag_id_tintoTagOperationobjects is concise. The logic properly reserves space based ontag_ids.size().
88-88: Accurate method documentation.
The updated docstring references forwardingSpontaneousTransition::serializereturn value. Good clarity on failure conditions.
93-97: Accessors for acceptance & variable ID.
These accessors are clear and maintain good const-correctness.
98-98: Empty line insertion.
No functional impact.
99-102: Getter for spontaneous transitions.
Exposes internal vector as a constant reference. This is acceptable for read-only access.
104-106: Getter for byte transitions.
Returns a reference to a vector ofNfaState*. Straightforward approach.
108-109: Getter for tree transitions.
Returns aTree const&for UTF-8 states. This remains flexible for the Byte variant.
113-113: Spontaneous transitions member.
Storing them in a vector is consistent with usage.
170-180:epsilon_closureusesfalse == stack.empty().
Matches the coding guidelines and ensures we gather all reachable states from spontaneous transitions.
189-191: Optional acceptance string.
Constructingaccepting_tag_stringonly whenm_acceptingis true is an efficient approach.
201-208: Serializing spontaneous transitions with guard checks.
Good handling ofstd::nulloptto indicate serialization issues.
210-211: Format string updated for spontaneous transitions.
The string now includes the set of compiled transitions, aligning with the new design.Also applies to: 215-215
src/log_surgeon/finite_automata/Nfa.hpp (16)
4-4: New includes & optional usage.
Including<cstddef>and<optional>is consistent with changes in constructor arguments.Also applies to: 7-7
19-21: Included Capture headers & TagOperation.
These additions reflect the new approach for capturing transitions.
23-23: UniqueIdGenerator usage.
Including<log_surgeon/UniqueIdGenerator.hpp>clarifies how unique IDs for tags are created.
26-33: Class-level docstring is thorough.
Explains NFA usage, acceptance ofTypedNfaState, and capture group uniqueness assumptions.
39-39: Constructor acceptsconst&for rules.
Avoids unnecessary copies, which is beneficial for performance.
48-50:new_accepting_statecreation.
Declares a state withmatching_variable_id, enabling acceptance logic. Clear naming.Also applies to: 52-52
57-57:new_state_for_negative_captures
Accepts captures, acquires respective tags, then constructs a negation transition. Clean approach to handle negative captures.Also applies to: 61-63
67-68:new_start_and_end_states_for_capture
Sets start/end tags for captures. This approach ensures separation of concerns for capture-specific transitions.Also applies to: 69-69
87-88:serializedocstring clarifiesstd::nullopt.
Matches the shift to optional returns in the NFA states.Also applies to: 89-89
120-126: Constructor setsm_root.
Begins the NFA with a new blank root state. This sets a consistent baseline for building the automaton.
128-140:get_or_create_capture_tagsmethod.
Ensures consistent creation/storage of tag IDs for each capture. The logic is well encapsulated.
149-153:new_accepting_stateusage.
Confirms acceptance state creation is specialized with a matching variable ID.
156-171:new_state_for_negative_captures
Moves tags to the constructor ofTypedNfaStatewithTagOperationType::Negate. Clear approach.
174-178:new_start_and_end_states_for_capture
Initializes the start tag on the root transition and the end tag on a dedicated new state. Good structuring.
188-189:serializereturnsstd::nullopton failure.
Properly aligns with NFA states' optional serialization.Also applies to: 222-223
232-237: Retaining BFS order in serialization.
Appending toserialized_statesensures a stable ordering of states in output.src/log_surgeon/Lexer.tpp (12)
7-7: Runtime error usage.
Including<stdexcept>suggests robust exception handling as needed.
86-87: Re-checking state after newline.
Resetting to the root and matching at line breaks is consistent with the existence of delimiters.Also applies to: 90-91, 92-92
98-98: Null destination state check.
Ends scanning if no transition exists. Straightforward.
217-218: Re-checking newlines in wildcard scan.
Matches the approach inscan(), re-initializing root transitions where needed.Also applies to: 221-222
245-246: Wildcard expansions.
Ensures that?and*behave consistently by traversing potential states or verifying acceptance. Implementation is thorough.Also applies to: 280-282
337-340:prepend_start_of_file_char
Setsm_prev_stateto the root’s transition onutf8::cCharStartOfFile, allowing scanning from a synthetic start token.
361-366:add_ruleusessymbol_id_t.
Changing the function signature to acceptsymbol_id_tclarifies the variable ID usage.
369-370:get_rulematches byvar_id.
Maintains readability and ensures the correct rule is retrieved.Also applies to: 372-372
381-395: Looping over captures to ensure uniqueness.
Throws an exception if a duplicate capture name is discovered. This is crucial for correctness.
397-403: Populatingm_capture_id_to_tag_ids.
Bridges the NFA’s captured tag IDs with the lexer’s symbol IDs. Good separation of concerns.
405-405: Creating the DFA from the updated NFA.
Note that captures are not fully utilized in the DFA yet, as included in the TODO.
408-414: Settingm_is_first_char.
Initializes state checks for potential transitions from the root, facilitating the scanning logic.src/log_surgeon/finite_automata/RegexAST.hpp (30)
23-23: Includes are logically arranged.
No concerns. The inclusion of "Capture.hpp" aligns with the capture-based refactoring.
33-37: Clear explanation of capture usage.
The updated documentation clarifies how captures are maintained in each node’s subtree. This is a good improvement for future maintainers.
85-99: Ensure consistent capture aggregation.
Collecting subtree captures from child nodes and merging them is essential for correct AST composition. The approach here looks consistent. Might be worth confirming that older references to tags have been removed entirely.
101-103: Negative capture handling.
Setting negative captures is consistent with your broader design, ensuring AST nodes can indicate captures that must remain unmatched along a certain path.
110-122: Good use of negative-capture logic.
This helper ensures that negative captures are handled seamlessly before or in parallel with normal transitions. It's also using the recommended comparison style:if (false == m_negative_captures.empty()) {...}, matching your guidelines.
130-148: Serialization approach is well-structured.
Converting negative captures into a user-friendly string is straightforward, and making use offmtsimplifies the operation.
151-153: Subtree and negative captures stored clearly.
This layout ofm_subtree_capturesandm_negative_capturesis intuitive. Them_subtree_capture_idsusage is not immediately evident, so ensure it’s properly maintained if used downstream.
630-630: Maintaining a non-null capture.
The documentation properly stresses thatm_capturemust never be null. This requirement helps maintain consistent semantics throughout the codebase.
641-642: Constructor argument clarity.
Making the capture argument explicit highlights its importance. Throwing an exception on a null pointer is a solid defensive design choice.
653-659: Building capture sets.
You accumulate existing subtree captures fromm_group_regex_astand then add the newm_capture. This logic ensures correct capture inheritance. Make sure any derived classes do the same if extended.
666-668: Copy constructor is straightforward.
Cloning both the group regex AST and capture works as intended, preserving the AST subtree fully.
707-707: Getter method is sufficiently clear.
Returningstd::string_viewis a good choice for performance, and naming the methodget_group_name()is descriptive.
720-722: Serialization of empty node.
Yielding only negative-capture info is well handled, ensuringRegexASTEmptyaligns with the new approach.
738-739: Literal serialization.
Appending negative captures after the literal is consistent with the rest of the design. Looks good.
769-770: Integer serialization.
Excellent approach to ensure the integer digits and negative captures are merged into the final serialized output.
774-783: OR constructor merges capture information.
Ensuring each side is marked with the other’s subtree captures as negative captures is sophisticated but correct for capturing exclusive paths.
789-790: Adding to NFA with negative captures.
Both sides of the OR useadd_to_nfa_with_negative_captures. This is consistent with the design.
794-794: Safe fallback on serialization.
Handles null pointers gracefully for each side of the OR.
799-800: Pub/sub on negative captures in string form.
Appended at the end of the format string to reflect captures that do not apply.
810-811: Concatenation constructor merges captures.
Setting the subtree captures by combining left and right is essential for consecutive expressions.
819-822: Concatenation NFA transitions.
Inserting an intermediate state for the left half, then resetting root, is a clear technique for combining subexpressions.
828-832: Concatenation serialization.
Combining each child’s serialization plus negative captures ensures a standard output format.
837-840: Consistent approach in Multiplication constructor.
Propagating subtree captures from the operand ensures repetition logic remains correct.
844-844: Subtree capture assignment.
This single-line approach is neat. No issues.
854-879: NFA construction for repeated expressions.
The logic properly handles edge cases form_min == 0andis_infinite(), ensuring robust transitions for repeated patterns. Good usage of negative captures in each iteration.
893-894: Multiplication serialization boundary.
Appending"{min, max}"with'inf'when appropriate is a clean representation.
900-921: Capturing docstring improvements.
Fully annotated diagram clarifies the start and end capture transitions, especially with negative captures in mind. Good documentation.
932-932: Creation of start/end states.
Constructing distinct start/end states for the capture is a strong design, isolating the capture’s boundaries.
936-936: Delegating.
Callingm_group_regex_ast->add_to_nfa_with_negative_capturesfrom within capture logic is consistent with the rest of the design.
942-950: Capture serialization format.
Inserting<capture_name>into the output is user-friendly, and the usage ofstd::u32stringfor wide support is wise.src/log_surgeon/finite_automata/Capture.hpp (1)
1-20: Renaming from Tag to Capture is coherent.
The rename aligns with the new capture-based approach. TheCaptureclass is succinct, and the constructor’s move semantics are standard.src/log_surgeon/LexicalRule.hpp (2)
34-37: Address the TODO comment about making the returned pointer constant.The comment indicates a need to make the returned pointer constant. This should be addressed to prevent unintended modifications.
Would you like me to propose a solution for making the returned pointer constant?
44-48: LGTM! Clean refactoring of add_to_nfa.The changes simplify the code by combining state creation and configuration into a single operation.
src/log_surgeon/finite_automata/TagOperation.hpp (1)
24-30: LGTM! Efficient comparison implementation.Good use of std::tie for implementing both operators.
src/log_surgeon/finite_automata/RegisterOperation.hpp (1)
42-44: Follow coding guidelines for boolean comparison.As per coding guidelines, prefer
false == <expression>over!<expression>.The current implementation already follows this guideline correctly.
src/log_surgeon/finite_automata/SpontaneousTransition.hpp (1)
56-62: Great use of C++20 ranges for transforming operations!The use of ranges for transforming tag operations is elegant and modern.
src/log_surgeon/finite_automata/DfaStatePair.hpp (1)
72-76: Improved method naming and condition syntax!The changes enhance readability through:
- More descriptive method name
get_dest_stateinstead ofnext- Clearer null pointer checks following the coding guideline of
nullptr != xsrc/log_surgeon/finite_automata/PrefixTree.hpp (2)
15-20: Documentation terminology updated appropriately!The change from "tag" to "capture" terminology maintains clarity while aligning with the broader codebase changes.
28-28: Good addition of a named constant!The
cDefaultPosconstant improves code clarity by providing a meaningful name for the default position value.tests/test-register-handler.cpp (1)
40-40: Good improvement in const correctness!Adding the
constqualifier toempty_handlerproperly indicates that it shouldn't be modified after initialization.examples/intersect-test.cpp (3)
24-24: Type signature properly updated for the new DFA implementation!The addition of
ByteNfaStateas a template parameter improves type safety and clarity.
44-44: DFA construction updated consistently!The template parameters in the DFA construction match the function signature changes.
82-82: DFA construction in main function properly updated!The template parameters in the DFA construction are consistent with the other changes in the file.
tests/test-nfa.cpp (3)
8-10: LGTM! Header changes align with component restructuring.The header changes appropriately reflect the shift from RegexAST to more specific NFA state and lexical rule components.
41-77: LGTM! Serialization format updated correctly.The changes properly reflect the transition from tag-based to capture-based transitions, with appropriate documentation updates.
80-87: LGTM! Improved error handling with optional values.The changes enhance robustness by properly handling optional values in the serialization process.
tests/test-lexer.cpp (1)
128-153: Address TODO comment for register value verification.The test function should be updated to include register value verification once simulation is implemented.
Would you like me to help implement the register value verification once the simulation functionality is available?
src/log_surgeon/SchemaParser.cpp (3)
3-23: LGTM! Enhanced error handling and type safety.The addition of headers like , , and improves error handling and type safety.
177-177: LGTM! Updated to use Capture instead of Tag.The change correctly implements the transition from tag-based to capture-based system.
212-213: LGTM! Updated terminology in comments.Comments correctly reflect the change from "negative tags" to "negative captures".
Also applies to: 258-259
.github/workflows/build.yaml (3)
31-31: LGTM! Improved build reproducibility.Using a specific Ubuntu version (22.04) instead of latest ensures consistent build environments.
42-42: LGTM! Updated condition matches OS matrix.The condition correctly reflects the change to ubuntu-22.04 in the OS matrix.
56-58: LGTM! Enhanced CI debugging capabilities.The addition of test log printing on failure will help diagnose CI issues more effectively.
tests/CMakeLists.txt (5)
5-8: Finite Automata Header Files Addition:
The inclusion of new finite automata headers (i.e.Capture.hpp,Dfa.hpp,DfaState.hpp, andDfaTransition.hpp) is well organised and clearly supports the new DFA features.
14-14: Register and Transition Header Files Update:
The additions ofRegister.hpp,RegisterOperation.hpp,SpontaneousTransition.hpp, andTagOperation.hppare consistent with the PR’s objectives of introducing register operations and updating DFA transitions.Also applies to: 16-17, 19-19
31-31: Unique Identifier Support:
AddingUniqueIdGenerator.hppensures that unique ID functionality is accessible, supporting the broader finite automata enhancements.
34-42: Enhanced Test Suite Configuration:
The update to theSOURCES_TESTSvariable to includetest-capture.cppandtest-dfa.cpp(in addition to the other test files) effectively aligns with the new features introduced in this PR.
44-50: Executable Target Configuration:
The target definition forunit-testand the associated compile features and include directories are correctly set up.lint-tasks.yml (4)
1-16: Lint Task Configuration Overview:
The initial setup, including the definition of variables and the organization of tasks, is clean and clearly structured. There are no issues identified in these initial sections.
29-32: Expanded Source Patterns for C++ Linting:
The addition of source file patterns for the tests directory (i.e.tests/**/*.cpp,tests/**/*.h,tests/**/*.hpp, andtests/**/*.tpp) enhances linting coverage to include all relevant test files.
69-72: Exclusion of Build Directories:
The updatedfindcommand now excludes files in paths matching*/*build*/*, which is a good practice to avoid unnecessary linting of generated files.
61-68: YAML Linting Task Details:
The configuration for theyml-checkandyml-fixtasks (including activation of the virtual environment and the use of a strict configuration) is appropriate and well-implemented.CMakeLists.txt (2)
114-115: UniqueIdGenerator Header Inclusion:
The inclusion ofUniqueIdGenerator.hppat the end of theSOURCE_FILESlist correctly supports the new functionality for generating unique identifiers.
125-239: Build and Installation Configuration Integrity:
The remainder of the CMake configuration—including external dependencies, target linking, compile options, and installation directives—is set up correctly and follows best practices. No issues were identified in these sections.
78cac0b to
61ad3cb
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (3)
src/log_surgeon/finite_automata/RegisterHandler.hpp (3)
21-27:⚠️ Potential issueAdd overflow check for multiple register additions.
The method should validate that adding
num_reg_to_addregisters won't cause an overflow of the registers vector.Apply this diff to add the overflow check:
auto add_registers(uint32_t const num_reg_to_add) -> std::vector<uint32_t> { + if (m_registers.size() > std::numeric_limits<size_t>::max() - num_reg_to_add) { + throw std::overflow_error("Adding registers would cause overflow"); + } std::vector<uint32_t> added_registers; for (uint32_t i{0}; i < num_reg_to_add; ++i) { added_registers.emplace_back(add_register()); } return added_registers; }
29-35:⚠️ Potential issueAdd overflow check for register ID.
The method should validate that adding a new register won't cause an overflow of the registers vector.
Apply this diff to add the overflow check:
auto add_register() -> reg_id_t { + if (m_registers.size() >= std::numeric_limits<size_t>::max()) { + throw std::overflow_error("Adding register would cause overflow"); + } auto const prefix_tree_node_id{ m_prefix_tree.insert(PrefixTree::cRootId, PrefixTree::cDefaultPos) }; m_registers.emplace_back(prefix_tree_node_id); return m_registers.size() - 1; }
37-43:⚠️ Potential issueAdd overflow check for register ID.
The method should validate that adding a new register won't cause an overflow of the registers vector.
Apply this diff to add the overflow check:
auto add_register(PrefixTree::id_t const prefix_tree_parent_node_id) -> reg_id_t { + if (m_registers.size() >= std::numeric_limits<size_t>::max()) { + throw std::overflow_error("Adding register would cause overflow"); + } auto const prefix_tree_node_id{ m_prefix_tree.insert(prefix_tree_parent_node_id, PrefixTree::cDefaultPos) }; m_registers.emplace_back(prefix_tree_node_id); return m_registers.size() - 1; }
🧹 Nitpick comments (4)
tests/test-nfa.cpp (1)
42-62: Consider adding state transition documentation.While the expected output correctly represents the NFA states and transitions, adding comments to explain the purpose of key states would improve maintainability. For example:
- State 0: Initial state with branches for 'Z' and 'A'
- States 5-8: Letter capture states
- States 14-15: Container ID digit capture states
src/log_surgeon/finite_automata/Dfa.hpp (3)
59-107: Verify the TODO comment about UTF-8 case.The implementation is well-structured, but there's a TODO comment about handling the UTF-8 case. Please ensure this is tracked.
Would you like me to create an issue to track the implementation of UTF-8 case handling?
123-142: Verify the TODO comment about UTF-8 handling.The implementation is well-structured, but there's a TODO comment about handling UTF-8 multi-byte transitions. Please ensure this is tracked.
Would you like me to create an issue to track the implementation of UTF-8 multi-byte transitions?
144-174: Verify the TODO comment about UTF-8 case.The implementation is efficient with proper container reservations, but there's a TODO comment about handling the UTF-8 case. Please ensure this is tracked.
Would you like me to create an issue to track the implementation of UTF-8 case handling in BFS traversal?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
CMakeLists.txt(1 hunks)src/log_surgeon/Lexer.hpp(5 hunks)src/log_surgeon/Lexer.tpp(8 hunks)src/log_surgeon/LexicalRule.hpp(2 hunks)src/log_surgeon/finite_automata/Dfa.hpp(5 hunks)src/log_surgeon/finite_automata/Nfa.hpp(7 hunks)src/log_surgeon/finite_automata/NfaState.hpp(6 hunks)src/log_surgeon/finite_automata/PrefixTree.hpp(1 hunks)src/log_surgeon/finite_automata/RegexAST.hpp(4 hunks)src/log_surgeon/finite_automata/RegisterHandler.hpp(2 hunks)src/log_surgeon/finite_automata/RegisterOperation.hpp(1 hunks)src/log_surgeon/finite_automata/SpontaneousTransition.hpp(1 hunks)src/log_surgeon/finite_automata/TagOperation.hpp(1 hunks)tests/CMakeLists.txt(2 hunks)tests/test-nfa.cpp(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- src/log_surgeon/finite_automata/PrefixTree.hpp
- CMakeLists.txt
- src/log_surgeon/finite_automata/RegisterOperation.hpp
- tests/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,ts,tsx}`: - Prefer `false ==
**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/LexicalRule.hpp
tests/test-nfa.cpp
src/log_surgeon/finite_automata/Dfa.hpp
src/log_surgeon/finite_automata/TagOperation.hpp
src/log_surgeon/finite_automata/SpontaneousTransition.hpp
src/log_surgeon/finite_automata/RegexAST.hpp
src/log_surgeon/finite_automata/RegisterHandler.hpp
src/log_surgeon/Lexer.hpp
src/log_surgeon/finite_automata/Nfa.hpp
src/log_surgeon/finite_automata/NfaState.hpp
**/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.
src/log_surgeon/LexicalRule.hpptests/test-nfa.cppsrc/log_surgeon/finite_automata/Dfa.hppsrc/log_surgeon/finite_automata/TagOperation.hppsrc/log_surgeon/finite_automata/SpontaneousTransition.hppsrc/log_surgeon/finite_automata/RegexAST.hppsrc/log_surgeon/finite_automata/RegisterHandler.hppsrc/log_surgeon/Lexer.hppsrc/log_surgeon/finite_automata/Nfa.hppsrc/log_surgeon/finite_automata/NfaState.hpp🧠 Learnings (5)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegisterHandler.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T21:56:13.425Z
Learning: In the `log_surgeon` project, header guards in C++ header files should include `_HPP` at the end to match the filename. For example, in `RegisterHandler.hpp`, the header guard should be `LOG_SURGEON_FINITE_AUTOMATA_REGISTER_HANDLER_HPP`.
src/log_surgeon/Lexer.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
src/log_surgeon/finite_automata/Nfa.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/NfaState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (37)
tests/test-nfa.cpp (2)
9-10: LGTM! Header additions are well-organized.The new headers align with the PR's focus on enhancing NFA state management and lexical rule handling.
65-67: LGTM! Robust error handling and test diagnostics.The changes demonstrate good practices:
- Using std::optional for safe serialization result handling
- Adding explicit value checks
- Improving test failure diagnostics with CAPTURE
Also applies to: 72-72
src/log_surgeon/finite_automata/RegisterHandler.hpp (4)
45-47: LGTM!The method correctly uses
at()for bounds checking and properly delegates to the prefix tree's set operation.
49-51: LGTM!The method correctly uses
at()for bounds checking and properly copies the prefix tree node ID.
53-56: LGTM!The method correctly uses
at()for bounds checking and properly delegates to the prefix tree's insert operation.
58-61: LGTM!The method correctly uses
at()for bounds checking and properly delegates to the prefix tree's get_reversed_positions operation.src/log_surgeon/finite_automata/RegexAST.hpp (4)
24-24: Include of TagOperation.hpp is consistent with the new architecture
No issues observed with this addition.
863-866: Spontaneous transition for optional repetitions
This approach for handlingm_min == 0via a spontaneous transition is logical and aligns with zero-times repetition semantics.
911-943:Details
✅ Verification successful
Capture group documentation and implementation
The expanded comments on the capture group NFA structure are very clear. Please verify that tests thoroughly exercise these transitions and capture boundaries.
🏁 Script executed:
#!/bin/bash # Searching for references to RegexASTCapture usage in test files to confirm coverage rg -A 10 "RegexASTCapture" testsLength of output: 1809
Verified: Capture Group NFA Testing Coverage is Sufficient
The tests in
tests/test-lexer.cppconfirm thatRegexASTCapture(and its specialised instantiation, e.g.RegexASTCaptureByte) are being exercised. The dynamic casts and assertions—including checks for capture names and properties of the nested capture regex—demonstrate that the transitions and capture boundaries documented insrc/log_surgeon/finite_automata/RegexAST.hpp(lines 911–943) are thoroughly tested.
117-122:Details
❓ Verification inconclusive
Negative captures logic looks correct
This usage ofnew_state_from_negative_capturesis consistent with the existing NFA flow. Ensure corresponding tests confirm that negative captures work as intended.
🏁 Script executed:
#!/bin/bash # Searching for tests referencing "new_state_from_negative_captures" to confirm coverage rg -A 5 "new_state_from_negative_captures" testsLength of output: 48
Negative Captures Logic Verification
The code at lines 117–122 in
src/log_surgeon/finite_automata/RegexAST.hppappears to correctly invokenew_state_from_negative_capturesin a manner that fits the existing NFA flow. However, the search for tests referencingnew_state_from_negative_capturesin thetestsdirectory returned no results. Could you please verify manually that there are tests covering negative captures? If not, please consider adding tests to ensure that the negative captures functionality is working as intended.src/log_surgeon/LexicalRule.hpp (3)
5-7: Headers imported for captures and vectors
These headers are correct and necessary for the new capture functionality.
28-30: Validate lifetime of returned captures vector
Theget_captures()method returns a reference to data owned bym_regex. Ensure that users of this method do not outlivem_regex.
45-48: Integration with new_accepting_state and negative captures
Callingnew_accepting_state(m_variable_id)followed byadd_to_nfa_with_negative_capturesis consistent with the updated negative captures approach.src/log_surgeon/finite_automata/TagOperation.hpp (2)
1-58: Well-structured implementation of TagOperation
The enum, constructor, and serialization logic follow a clear design.
1-58: Consider handling unforeseen TagOperationType cases
If new enum values are introduced in the future, the switch inserialize()may silently omit them. A fallback case or exception can help detect missing cases at compile time.src/log_surgeon/finite_automata/SpontaneousTransition.hpp (4)
26-30: LGTM! Well-designed constructors.The constructors are well-implemented, with efficient use of
std::movefor tag operations.
32-34: LGTM! Well-designed getter methods.The getters follow best practices with const references and proper use of [[nodiscard]].
36-42: LGTM! Well-documented method declaration.The serialize method is well-documented with clear @param and @return descriptions.
49-67: LGTM! Modern and efficient implementation.The implementation makes good use of modern C++ features:
- Proper use of ranges for transforming tag operations
- Efficient string formatting with fmt
- Follows coding guidelines with
false == state_ids.contains(m_dest_state)src/log_surgeon/finite_automata/Dfa.hpp (2)
109-121: LGTM! Efficient state creation.The implementation efficiently creates new DFA states using smart pointers and proper state management.
176-190: LGTM! Clean serialization implementation.The implementation efficiently serializes the DFA using BFS traversal order and proper string formatting.
src/log_surgeon/finite_automata/NfaState.hpp (5)
40-52: LGTM! Well-designed constructors.The constructors are well-implemented with proper initialization and delegation.
54-65: LGTM! Efficient transition addition.The implementation efficiently adds spontaneous transitions with proper memory management.
92-108: LGTM! Well-designed getter methods.The getters follow best practices with proper use of [[nodiscard]] and const references.
168-183: LGTM! Well-implemented epsilon closure.The implementation correctly handles epsilon closure using spontaneous transitions and follows coding guidelines.
185-217: LGTM! Clean serialization implementation.The implementation efficiently serializes the NFA state and follows coding guidelines.
src/log_surgeon/Lexer.hpp (3)
121-124: LGTM! Well-typed DFA accessor.The method correctly returns a const reference to the DFA with proper template parameters.
159-165: LGTM! Well-implemented register ID lookup.The method correctly handles optional values and follows coding guidelines.
216-221: Verify the choice of std::map over std::unordered_map.The implementation uses
std::mapform_tag_to_final_reg_idinstead ofstd::unordered_map. Consider usingstd::unordered_mapif ordered iteration is not required, as it provides better average-case performance.Would you like me to measure the performance impact of using
std::unordered_mapinstead ofstd::map?src/log_surgeon/finite_automata/Nfa.hpp (4)
147-152: LGTM!The function correctly creates a new accepting state with the provided matching variable ID.
154-170: LGTM!The function correctly creates a new state with negative tag operations for the provided captures.
172-185: LGTM!The function correctly creates start and end states with appropriate tag operations for positive captures.
220-238: LGTM!The function now properly handles serialization failures by returning
std::nulloptinstead of potentially invalid data.src/log_surgeon/Lexer.tpp (4)
87-88: LGTM!The function now uses the more descriptive
get_dest_statemethod and has improved error handling with explicit nullptr checks.Also applies to: 91-92, 93-94, 99-100
218-219: LGTM!The function now uses the more descriptive
get_dest_statemethod and has improved error handling with explicit nullptr checks for wildcard scanning.Also applies to: 222-223, 224-225, 230-231, 246-247, 280-283
340-341: LGTM!The function now uses the more descriptive
get_dest_statemethod for handling start-of-file characters.
407-408: LGTM!The function now correctly uses the updated DFA constructor with both
TypedDfaStateandTypedNfaStatetemplate parameters, and has improved error handling with explicit nullptr checks.Also applies to: 410-411
69de670 to
ed7ef25
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
src/log_surgeon/LexicalRule.hpp (1)
34-37: 🛠️ Refactor suggestionAddress the TODO comment by making the returned pointer constant.
The TODO comment indicates that the returned pointer should be constant. This would improve const-correctness and prevent unintended modifications.
Apply this diff to fix the issue:
- [[nodiscard]] auto get_regex() const -> finite_automata::RegexAST<TypedNfaState>* { - // TODO: make the returned pointer constant - return m_regex.get(); + [[nodiscard]] auto get_regex() const -> finite_automata::RegexAST<TypedNfaState> const* { + return m_regex.get();
♻️ Duplicate comments (1)
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)
29-43:⚠️ Potential issueAdd overflow protection when calculating register IDs.
The methods return
size() - 1without checking for vector overflow.auto add_register() -> reg_id_t { + if (m_registers.size() >= std::numeric_limits<reg_id_t>::max()) { + throw std::runtime_error("Register limit exceeded"); + } auto const prefix_tree_node_id{ m_prefix_tree.insert(PrefixTree::cRootId, PrefixTree::cDefaultPos) }; m_registers.emplace_back(prefix_tree_node_id); return m_registers.size() - 1; }
🧹 Nitpick comments (4)
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)
21-27: Consider pre-allocating the vector size for better performance.The vector could grow multiple times during the loop. Pre-allocating the size would prevent reallocations.
auto add_registers(uint32_t const num_reg_to_add) -> std::vector<uint32_t> { std::vector<uint32_t> added_registers; + added_registers.reserve(num_reg_to_add); for (uint32_t i{0}; i < num_reg_to_add; ++i) { added_registers.emplace_back(add_register()); } return added_registers; }src/log_surgeon/finite_automata/Dfa.hpp (1)
144-174: LGTM, but address the UTF-8 TODO comment.The BFS implementation is efficient with proper container usage and capacity reservations. However, the TODO comment about handling the UTF-8 case needs to be addressed.
Would you like me to help implement the UTF-8 case handling?
src/log_surgeon/Lexer.tpp (2)
246-247: Consider using early returns to simplify the wildcard handling logic.The code could be more concise and easier to follow by using early returns.
Apply this diff to improve readability:
- auto* dest_state{state->get_dest_state(byte)}; - if (false == dest_state->is_accepting()) { + auto* dest_state = state->get_dest_state(byte); + if (nullptr == dest_state || false == dest_state->is_accepting()) { token = Token{m_last_match_pos, input_buffer.storage().pos(), input_buffer.storage().get_active_buffer(),Also applies to: 280-282
406-407: Address the TODO comment about DFA ignoring captures.The comment indicates that the DFA is not properly handling captures, which could lead to incorrect parsing.
Would you like me to help implement proper capture handling in the DFA?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
CMakeLists.txt(1 hunks)examples/intersect-test.cpp(3 hunks)src/log_surgeon/Lexer.hpp(4 hunks)src/log_surgeon/Lexer.tpp(8 hunks)src/log_surgeon/LexicalRule.hpp(1 hunks)src/log_surgeon/SchemaParser.cpp(1 hunks)src/log_surgeon/finite_automata/Dfa.hpp(5 hunks)src/log_surgeon/finite_automata/DfaState.hpp(3 hunks)src/log_surgeon/finite_automata/DfaStatePair.hpp(1 hunks)src/log_surgeon/finite_automata/Nfa.hpp(2 hunks)src/log_surgeon/finite_automata/PrefixTree.hpp(1 hunks)src/log_surgeon/finite_automata/RegisterHandler.hpp(2 hunks)tests/CMakeLists.txt(2 hunks)tests/test-register-handler.cpp(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- src/log_surgeon/finite_automata/PrefixTree.hpp
- CMakeLists.txt
- tests/test-register-handler.cpp
- src/log_surgeon/SchemaParser.cpp
- tests/CMakeLists.txt
- src/log_surgeon/finite_automata/Nfa.hpp
- src/log_surgeon/finite_automata/DfaStatePair.hpp
- examples/intersect-test.cpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
src/log_surgeon/LexicalRule.hppsrc/log_surgeon/Lexer.tppsrc/log_surgeon/finite_automata/Dfa.hppsrc/log_surgeon/Lexer.hppsrc/log_surgeon/finite_automata/RegisterHandler.hppsrc/log_surgeon/finite_automata/DfaState.hpp
🧠 Learnings (1)
src/log_surgeon/finite_automata/RegisterHandler.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T21:56:13.425Z
Learning: In the `log_surgeon` project, header guards in C++ header files should include `_HPP` at the end to match the filename. For example, in `RegisterHandler.hpp`, the header guard should be `LOG_SURGEON_FINITE_AUTOMATA_REGISTER_HANDLER_HPP`.
🔇 Additional comments (18)
src/log_surgeon/finite_automata/RegisterHandler.hpp (3)
1-9: LGTM! Header guards and includes are well-structured.The header guard follows the correct naming convention, and the includes are appropriate for the functionality.
11-18: LGTM! Documentation is clear and informative.The class documentation effectively explains the purpose and responsibilities, with a helpful note about register initialization.
45-61: LGTM! Register operations are well-implemented.The methods use
at()for bounds checking and have been correctly updated to usereg_id_t.src/log_surgeon/finite_automata/DfaState.hpp (5)
34-38: LGTM! Proper initialization of transitions array.The constructor correctly initializes all transitions following RAII principles, which prevents potential null pointer issues.
52-55: LGTM! Improved type safety with DfaTransition.The method now accepts a proper DfaTransition object instead of a raw pointer, which enhances type safety.
57-59: LGTM! Clean implementation of accepting operations.The method provides a clear interface for adding register operations.
101-141: LGTM! Well-structured serialization implementation.The serialization logic is thorough and makes good use of the fmt library for string formatting. The code handles all aspects of the DFA state: accepting tags, operations, and transitions.
84-99: 🛠️ Refactor suggestionFix boolean expression to follow coding guidelines.
The implementation looks good, but there's a minor style issue.
Update line 94 to follow the coding guideline preferring 'false ==' over '!':
- if (false == result->empty()) { + if (false == result->empty()) {Likely an incorrect or invalid review comment.
src/log_surgeon/LexicalRule.hpp (3)
5-5: LGTM!The added includes are necessary for the new functionality and follow standard include ordering.
Also applies to: 7-7
28-30: LGTM!The method is well-designed with appropriate const-correctness and
[[nodiscard]]attribute.
44-50: LGTM!The method follows a clear sequence of operations and correctly uses negative captures.
src/log_surgeon/finite_automata/Dfa.hpp (3)
11-13: LGTM!The added includes support the new serialization functionality, and the additional template parameter enhances type safety.
Also applies to: 14-14, 19-19
22-22: LGTM!The changes improve efficiency and const-correctness while adding useful serialization functionality.
Also applies to: 27-27, 34-34
103-103:Details
❓ Verification inconclusive
Verify the empty set usage in byte transitions.
The empty set passed to
add_byte_transitionsuggests potential missing functionality or an incomplete implementation.Run the following script to check for similar patterns:
🏁 Script executed:
#!/bin/bash # Description: Search for other instances of empty set usage in transitions. rg -A 5 'add_byte_transition.*\{\{\},'Length of output: 415
Review of Empty Set Usage in Byte Transitions
The usage at line 103 in
src/log_surgeon/finite_automata/Dfa.hpp—where an empty set is passed in the call todfa_state->add_byte_transition—appears to be intentional, particularly given the nearby comment referencing a future UTF‑8 case implementation. However, please verify that this placeholder behaviour aligns with our design goals and consider adding further documentation or inline comments to clarify its purpose for future maintainability.src/log_surgeon/Lexer.hpp (3)
6-6: LGTM!The added map include supports the new tag-to-register mapping functionality.
216-216: LGTM!The DFA type update improves type safety by making the NFA state type explicit.
221-221:Details
❓ Verification inconclusive
Verify the need for ordered map over unordered_map.
The change from
unordered_maptomapsuggests iteration order is important. Please confirm if this is necessary, asunordered_mapgenerally provides better performance.Run the following script to check if iteration order is used:
🏁 Script executed:
#!/bin/bash # Description: Search for iterations over m_tag_to_final_reg_id rg -A 5 'for.*m_tag_to_final_reg_id'Length of output: 38
Verify the need for an ordered map.
The change from using an
unordered_mapto amapsuggests that the iteration order might be relevant. However, our automated search for loops iterating overm_tag_to_final_reg_idreturned no results. Could you please manually verify whether preserving order is actually required in subsequent code? This additional check would help confirm if the ordered container is necessary or if reverting tounordered_mapfor potential performance gains would be acceptable.src/log_surgeon/Lexer.tpp (1)
87-88: LGTM!The state transition changes improve clarity and type consistency.
Also applies to: 91-92, 218-219, 222-223
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (5)
src/log_surgeon/finite_automata/RegisterOperation.hpp (1)
21-34: Consider adding input validation.While the implementation is correct, consider validating
reg_idandcopy_reg_idto ensure they are within valid ranges.RegisterOperation(reg_id_t const reg_id, RegisterOperationType const type) : m_reg_id{reg_id}, - m_type{type} {} + m_type{type} { + validate_reg_id(reg_id); +} RegisterOperation(reg_id_t const reg_id, reg_id_t const copy_reg_id) : m_reg_id{reg_id}, m_type{RegisterOperationType::Copy}, - m_copy_reg_id{copy_reg_id} {} + m_copy_reg_id{copy_reg_id} { + validate_reg_id(reg_id); + validate_reg_id(copy_reg_id); +} +private: + static void validate_reg_id(reg_id_t const id) { + if (id > MAX_REG_ID) { + throw std::invalid_argument("Register ID out of valid range"); + } + }src/log_surgeon/finite_automata/DfaTransition.hpp (2)
21-29: Add nullptr validation for dest_state.Consider validating that
dest_stateis not nullptr in the constructor to fail fast.DfaTransition(std::vector<RegisterOperation> reg_ops, DfaState<state_type>* dest_state) : m_reg_ops{std::move(reg_ops)}, - m_dest_state{dest_state} {} + m_dest_state{dest_state} { + if (dest_state == nullptr) { + throw std::invalid_argument("dest_state cannot be nullptr"); + } +}
40-67: Optimize vector allocation in serialize().Consider reserving space for transformed_ops to avoid reallocations.
std::vector<std::string> transformed_ops; + transformed_ops.reserve(m_reg_ops.size()); for (auto const& reg_op : m_reg_ops) {tests/test-dfa.cpp (2)
28-40: Consider adding more test cases.While this test case is comprehensive, consider adding tests for:
- Invalid input handling
- Edge cases (empty regex, complex patterns)
- Error conditions
53-69: Consider extracting comparison logic into a helper function.The line-by-line comparison logic could be made more readable by extracting it into a helper function.
+static void compare_serialized_output(string const& actual, string const& expected) { + stringstream ss_actual{actual}; + stringstream ss_expected{expected}; + string actual_line; + string expected_line; + + CAPTURE(actual); + CAPTURE(expected); + while (getline(ss_actual, actual_line) && getline(ss_expected, expected_line)) { + REQUIRE(actual_line == expected_line); + } + getline(ss_actual, actual_line); + REQUIRE(actual_line.empty()); + getline(ss_expected, expected_line); + REQUIRE(expected_line.empty()); +} + - // Compare expected and actual line-by-line - auto const actual_serialized_dfa{dfa.serialize()}; - stringstream ss_actual{actual_serialized_dfa}; - stringstream ss_expected{expected_serialized_dfa}; - string actual_line; - string expected_line; - - CAPTURE(actual_serialized_dfa); - CAPTURE(expected_serialized_dfa); - while (getline(ss_actual, actual_line) && getline(ss_expected, expected_line)) { - REQUIRE(actual_line == expected_line); - } - getline(ss_actual, actual_line); - REQUIRE(actual_line.empty()); - getline(ss_expected, expected_line); - REQUIRE(expected_line.empty()); + compare_serialized_output(dfa.serialize(), expected_serialized_dfa);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/log_surgeon/finite_automata/DfaTransition.hpp(1 hunks)src/log_surgeon/finite_automata/RegisterOperation.hpp(1 hunks)tests/test-dfa.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
src/log_surgeon/finite_automata/DfaTransition.hppsrc/log_surgeon/finite_automata/RegisterOperation.hpptests/test-dfa.cpp
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (macos-latest, Release)
- GitHub Check: build (macos-latest, Debug)
🔇 Additional comments (4)
src/log_surgeon/finite_automata/RegisterOperation.hpp (3)
13-20: LGTM! Well-structured enum class and class declaration.The enum class is properly scoped with an appropriate underlying type, and the class interface is clean with good encapsulation.
39-53: Improve error handling in serialize().The default case should be replaced with proper error handling as suggested in the previous review.
55-59: LGTM! Well-structured private members.Good use of encapsulation and std::optional for the copy register ID.
src/log_surgeon/finite_automata/DfaTransition.hpp (1)
30-33: LGTM! Well-designed accessor methods.Good use of [[nodiscard]] attribute and proper return type specifications.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/test-dfa.cpp (3)
28-40: Consider improving test setup readability.The test setup could be more readable by extracting the schema creation into a helper function and adding comments explaining the expected DFA structure.
+// Helper function to create schema with a variable +Schema create_schema_with_var(const string& var_name, const string& pattern) { + Schema schema; + schema.add_variable(var_name + ":" + pattern, -1); + return schema; +} TEST_CASE("Test Simple Untagged DFA", "[DFA]") { - Schema schema; - string const var_name{"capture"}; - string const var_schema{var_name + ":" + "userID=123"}; - schema.add_variable(var_schema, -1); + // Create DFA for pattern "userID=123" + // Expected structure: u->s->e->r->I->D->=->[1-3] + auto schema = create_schema_with_var("capture", "userID=123");
55-71: Extract common verification code.The line-by-line comparison logic is duplicated in both test cases. Consider extracting it into a helper function.
+// Helper function to verify DFA serialization +void verify_dfa_serialization(const string& actual, const string& expected) { + stringstream ss_actual{actual}; + stringstream ss_expected{expected}; + string actual_line; + string expected_line; + + CAPTURE(actual); + CAPTURE(expected); + while (getline(ss_actual, actual_line) && getline(ss_expected, expected_line)) { + REQUIRE(actual_line == expected_line); + } + getline(ss_actual, actual_line); + REQUIRE(actual_line.empty()); + getline(ss_expected, expected_line); + REQUIRE(expected_line.empty()); +} - // Compare expected and actual line-by-line - auto const actual_serialized_dfa{dfa.serialize()}; - stringstream ss_actual{actual_serialized_dfa}; - stringstream ss_expected{expected_serialized_dfa}; - string actual_line; - string expected_line; - - CAPTURE(actual_serialized_dfa); - CAPTURE(expected_serialized_dfa); - while (getline(ss_actual, actual_line) && getline(ss_expected, expected_line)) { - REQUIRE(actual_line == expected_line); - } - getline(ss_actual, actual_line); - REQUIRE(actual_line.empty()); - getline(ss_expected, expected_line); - REQUIRE(expected_line.empty()); + verify_dfa_serialization(dfa.serialize(), expected_serialized_dfa);
73-113: Add test cases for edge cases and error conditions.The current test cases cover basic and complex patterns but could be expanded to include:
- Empty patterns
- Invalid patterns
- Patterns with escaped special characters
- Patterns with boundary conditions (^, $)
- Patterns with character classes ([[:digit:]], [[:space:]], etc.)
Would you like me to generate additional test cases for these scenarios?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/test-dfa.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
tests/test-dfa.cpp
🔇 Additional comments (1)
tests/test-dfa.cpp (1)
1-27: LGTM! Well-organized headers and type declarations.The headers are properly organized, and the type aliases improve code readability.
eea8ad5 to
c42a214
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (8)
src/log_surgeon/Lexer.hpp (1)
221-221: Ensure naming consistency.
Since the previous member wasm_tag_to_reg_id, confirm that this new namem_tag_to_final_reg_idis consistently used and explained, avoiding confusion about legacy references.tests/test-dfa.cpp (2)
28-71: Commendable thoroughness in verifying DFA serialization line-by-line.
These checks ensure that even small changes in state ordering or transitions are caught. However, consider adding negative tests (e.g., malformed inputs) to boost confidence in error-handling scenarios.
73-114: Add variety to the tested transitions.
This second test case focuses primarily on branches and simple character classes. Consider adding more complex transitions (like repeated patterns) or verifying that backslashes and special characters are handled as expected by the DFA.src/log_surgeon/finite_automata/Dfa.hpp (3)
11-12: Check for minimalfmtincludes.
If onlyfmt::formatis used, consider including just<fmt/format.h>. Redundant headers can marginally slow down builds or clutter the codebase.
50-55: Ensure BFS logic accounts for multi-byte transitions if necessary.
Currently, the BFS logic loops over single-byte transitions. If future expansions include multi-byte characters, confirm that this function can adapt or be extended seamlessly.
103-103: Validate the usage of empty set in byte transitions.
Here, transitions use{{}, dest_state}instead of a more descriptive structure for register ops. If registers are not applicable, consider a clearer naming or comment to avoid confusion later.src/log_surgeon/finite_automata/DfaTransition.hpp (2)
30-32: Public getters appear consistent.Returning copies of
m_reg_opsis acceptable for an immutable pattern; however, consider returning a reference for performance if data sets are large.
49-67: Pointer ownership caution.
m_dest_stateis a raw pointer, which can lead to dangling references if the pointedDfaStateis destructed first. Consider using references or a smart pointer if feasible.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
CMakeLists.txt(1 hunks)src/log_surgeon/Lexer.hpp(2 hunks)src/log_surgeon/finite_automata/Dfa.hpp(5 hunks)src/log_surgeon/finite_automata/DfaState.hpp(4 hunks)src/log_surgeon/finite_automata/DfaTransition.hpp(1 hunks)src/log_surgeon/finite_automata/RegisterHandler.hpp(2 hunks)src/log_surgeon/finite_automata/RegisterOperation.hpp(1 hunks)tests/CMakeLists.txt(2 hunks)tests/test-dfa.cpp(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- CMakeLists.txt
- tests/CMakeLists.txt
- src/log_surgeon/finite_automata/RegisterHandler.hpp
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
tests/test-dfa.cppsrc/log_surgeon/Lexer.hppsrc/log_surgeon/finite_automata/RegisterOperation.hppsrc/log_surgeon/finite_automata/DfaState.hppsrc/log_surgeon/finite_automata/DfaTransition.hppsrc/log_surgeon/finite_automata/Dfa.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (macos-latest, Release)
🔇 Additional comments (13)
src/log_surgeon/Lexer.hpp (1)
161-163: Consider verifying performance trade-off of usingstd::mapinstead ofstd::unordered_map.
This switch might be intentional for consistent iteration order, yet for frequent look-ups,std::unordered_mapcan be quicker depending on usage patterns.src/log_surgeon/finite_automata/Dfa.hpp (2)
24-27: Documentation is clear and concise.
The overview for theserialize()method is well-explained, increasing code clarity.
144-190: Excellent addition of a BFS-basedserializemethod.
This BFS traversal ensures a stable ordering of states. For large DFAs, confirm that the queue-based approach remains efficient.src/log_surgeon/finite_automata/RegisterOperation.hpp (3)
1-17: Looks good overall.The enumeration and includes appear correct. The usage of
false == expressioncomplies with the stated coding guidelines.
21-29: Constructors are clear and expressive.They properly initialise
m_typeand optionally setm_copy_reg_id. No concerns here.
39-53: Consider throwing an exception for unknown operation types.Returning
"?"in the default case may mask developer errors and complicate debugging. Throwing or using an exhaustive switch helps avoid unexpected states.src/log_surgeon/finite_automata/DfaState.hpp (5)
34-38: Clear default initialisation.Initialising each transition with an empty set of register operations is a sound approach.
52-56: Transitions stored in an array are well-organised.Assigning the transition object at
[byte]clarifies how transitions are managed for each input character.
57-59: Straightforward acceptance operation tracking.Storing the
RegisterOperationby value is acceptable here. Ensure consumers handle required ownership semantics.
101-141: Serialization is robust and flexible.It neatly handles both accepting tags and transitions. This design fosters easy debugging and logging of the DFA’s structure.
85-87:Details
❓ Verification inconclusive
Verify out-of-bounds risk in
get_dest_state.If
characteris greater than 255 forStateType::Byte, this could cause undefined behaviour. Please confirm thatcharacteris never outside[0, 255].
🏁 Script executed:
#!/bin/bash # Searching for call sites of get_dest_state to ensure no usage beyond [0..255] for Byte states rg -A 5 'get_dest_state\s*\('Length of output: 9851
Action Required: Confirm and Safeguard the Byte Range Invariant
The grep output indicates that call sites—such as those in
Lexer.tpp—appear to pass values that are typically within the [0, 255] range. However, in theif constexpr (StateType::Byte == state_type)branch ofDfaState::get_dest_state, there is no explicit range check forcharacter. This means that if an out-of-bound value is ever passed, it could lead to undefined behaviour due to an out-of-bound array access.Please verify that all call sites guarantee that the
characterargument is within the [0, 255] range. If this invariant might be violated in any scenario, consider one or more of the following actions:
- Enforce the invariant by adding an assertion (e.g.,
assert(character < cSizeOfByte)) right before the array access.- Document the precondition in the function’s comments to clearly state that for
StateType::Byte, the character must be between 0 and 255.- Review usage in all call sites for potential risks if non-byte values are passed into this function.
src/log_surgeon/finite_automata/DfaTransition.hpp (2)
24-28: Constructors are concise.Default construction and move-initialising
m_reg_opsare straightforward; no immediate issues arise.
40-42: Serializable transitions.Allowing an optional result is helpful for error or missing-state scenarios, preventing partial data from creeping in.
Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/log_surgeon/finite_automata/Dfa.hpp (1)
153-158: Lambda function could be simplified in the context of its usage.The lambda function is only used once in the loop below. Consider inlining the logic where it's used or simplifying if any new change as per the previous review is accepted.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/log_surgeon/finite_automata/Dfa.hpp(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
src/log_surgeon/finite_automata/Dfa.hpp
🪛 GitHub Actions: Build
src/log_surgeon/finite_automata/Dfa.hpp
[error] 181-181: use of undeclared identifier 'traversal_oerder'; did you mean 'traversal_order'?
[error] 178-178: 'traversal_order' declared here
🔇 Additional comments (7)
src/log_surgeon/finite_automata/Dfa.hpp (7)
11-13: Good addition of the fmt library for formatted output.The inclusion of
fmtlibrary headers is appropriate for the new serialization functionality.
24-27: Well-documented and properly declared new serialize method.The method is correctly declared with
[[nodiscard]]attribute and trailing return type syntax, which is consistent with modern C++ style.
50-54: Good implementation of BFS traversal helper method.The method is properly documented and uses the same style conventions as other methods in the class.
103-103: Correct adaptation to the new DfaTransition interface.The call correctly passes a pair with empty register operations and the destination state.
166-166: Maintain consistency in TODO comments.// TODO: Handle the utf8 case
168-170: Simplify checking and adding states to BFS traversal.This function doesn't need to reuse
add_to_queue_and_visited, so we can simplify the code:- if (nullptr != dest_state) { - add_to_queue_and_visited(dest_state); - } + if (nullptr == dest_state || visited_states.contains(dest_state)) { + continue; + } + visited_states.emplace(dest_state); + state_queue.push(dest_state);
190-190: Well-formatted serialization output using fmt library.The use of fmt::format and fmt::join is a clean way to format the serialized output with newlines.
… all values are valid and the loop doesn't end.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/log_surgeon/finite_automata/Dfa.hpp (1)
79-79: Consider using a data structure optimised for 256 possible keys.
Asascii_transitions_mapmight only handle up to 256 entries, consider using an array or anstd::unordered_map<uint8_t, StateSet>with a reserve of 256 for potential performance benefits.src/log_surgeon/finite_automata/DfaState.hpp (1)
49-51: Method signature fits the new transition design.
Optionally, consider passingdfa_transitionby const reference if it is large, but for small objects this is acceptable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/log_surgeon/finite_automata/Dfa.hpp(6 hunks)src/log_surgeon/finite_automata/DfaState.hpp(4 hunks)src/log_surgeon/finite_automata/RegisterOperation.hpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
src/log_surgeon/finite_automata/Dfa.hppsrc/log_surgeon/finite_automata/RegisterOperation.hppsrc/log_surgeon/finite_automata/DfaState.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (macos-latest, Release)
🔇 Additional comments (18)
src/log_surgeon/finite_automata/Dfa.hpp (8)
11-12: All necessary includes appear correctly added.
No issues noted in these newly introduced headers.Also applies to: 14-14
24-28: Good documentation for the new serialize method.
The docstring clearly indicates the return value and usage.
50-55: BFS documentation is concise and clear.
No issues identified in this comment block or signature.
82-82: Loop with uint16_t adequately covers cSizeOfByte.
No issues found in this loop iteration approach.
101-103: Transition creation logic is correct.
The use of{{}, dest_state}for register operations is coherent with the updated transition structure.
143-145: BFS helper function is well-named and straightforward.
No issues noted in these lines.
147-174: BFS state visitation logic is properly implemented.
Reserve calls and the insertion checks look correct, and the lambda usage is neat.
176-191: serialize() method implementation appears correct and efficient.
The BFS traversal approach, state ID mapping, and final joined output are neatly executed.src/log_surgeon/finite_automata/RegisterOperation.hpp (1)
1-85: Overall structure and implementation look solid.
- The factory methods (
create_set_operation,create_negate_operation,create_copy_operation) help prevent misuse of constructors.- Returning
std::nulloptfor invalid states inserialize()is a sensible approach.- Use of
false == m_copy_reg_id.has_value()aligns with the coding guideline.No further issues noted.
src/log_surgeon/finite_automata/DfaState.hpp (9)
4-4: Headers imported look appropriate.
These additions seem necessary for the new functionality (array, fmt, etc.).Also applies to: 8-8, 11-11, 14-15, 18-19
35-35: Constructor usage of fill(...) is a clean way to initialise transitions.
No concerns noted.
53-53: Accepting operations integration is straightforward.
Storing them in a vector is clear and maintainable.
57-62: serialize(...) docstring is clearly explaining the parameters and result.
Implementation details are consistent with the design.
66-66: Minor doc snippet for character transitions.
No issues to report.
72-72: Addition of m_accepting_ops brings register-based logic into DfaState.
No further concerns here.
80-80: Definition of get_dest_state is well-structured.
The if-constexpr handles Byte vs. Utf8 logic effectively.
82-82: Transition returns from m_bytes_transition and tree look correct.
No issues found in these return statements.Also applies to: 85-85, 91-91
97-138: serialize(...) method thoroughly documents the state's structure.
- Accepting tags and operations are neatly enumerated and appended.
- Byte transitions are appended in an easily readable format.
Overall, no issues noted.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Reviewed DfaTransition.hpp.
Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
LinZhihao-723
left a comment
There was a problem hiding this comment.
The unit tests lgtm. Two more comments before we can merge.
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/log_surgeon/finite_automata/DfaTransition.hpp (1)
45-50: Fix minor docstring typo.
In line 49, replace “statd_ids” with “state_ids” to match the method parameter name.- * @return std::nullopt if `m_dest_state` is not in `statd_ids`. + * @return std::nullopt if `m_dest_state` is not in `state_ids`.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/log_surgeon/finite_automata/DfaTransition.hpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
src/log_surgeon/finite_automata/DfaTransition.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (macos-latest, Release)
🔇 Additional comments (6)
src/log_surgeon/finite_automata/DfaTransition.hpp (6)
1-3: Good use of include guards.
No issues found here. This prevents multiple inclusions and helps maintain consistent declarations.
21-29: Well-structured documentation and class definition.
Thank you for providing clear bullet points describing each aspect of the transition. This directly addresses the previous request for more descriptive docstrings.
31-35: Constructors are concise and efficient.
The default constructor and the parameterized constructor both look good. Usingstd::move(reg_ops)is effective for avoiding unnecessary copies.
37-39: Returning a const reference for register operations is aligned with the previous suggestion.
This choice helps prevent unnecessary copies and is consistent with earlier feedback on returning a const reference.
41-43: Getter for destination state looks fine.
Accessing a const pointer to the destination state is straightforward and efficient.
60-78: Serialization method follows coding guidelines.
The checks use “false == <expression>” consistently, adhering to the stated style convention. The optional-based approach for handling serialization failures is clear and well structured.
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (5)
tests/test-dfa.cpp (1)
28-117: Consider adding tests for register operations.While these tests thoroughly validate DFA construction and serialization, there are no explicit tests for the register operations mentioned in the PR description. Consider adding tests that verify the correct implementation of register operations to ensure complete test coverage.
src/log_surgeon/finite_automata/Dfa.hpp (2)
84-84: Consider using uint8_t for byte-related loop variable.The loop iterates over byte values (0-255) but uses uint16_t, which can represent values up to 65535. While this works correctly, using uint8_t would better match the intended range.
- for (uint16_t i{0}; i < cSizeOfByte; ++i) { + for (uint8_t i{0}; i < cSizeOfByte; ++i) {
168-168: Consider using uint8_t for byte-related loop variable.Since the loop is iterating from 0 to 255 (cSizeOfByte), using uint8_t would be more appropriate than uint32_t.
- for (uint32_t idx{0}; idx < cSizeOfByte; ++idx) { + for (uint8_t idx{0}; idx < cSizeOfByte; ++idx) {src/log_surgeon/finite_automata/DfaState.hpp (2)
54-56: Consider taking RegisterOperation by value for more efficient movesThe method takes a RegisterOperation by const reference but then copies it into the vector. It would be more efficient to take the parameter by value and use std::move to avoid an unnecessary copy.
- auto add_accepting_op(RegisterOperation const reg_op) -> void { - m_accepting_ops.push_back(reg_op); + auto add_accepting_op(RegisterOperation reg_op) -> void { + m_accepting_ops.push_back(std::move(reg_op));
102-108: Inconsistent error handling for accepting operationsThe serialization of accepting operations (lines 102-108) silently continues when serialization fails, unlike the byte transitions handling (lines 119-129) which returns std::nullopt on failure. This inconsistency might be intentional but should be documented or adjusted for consistency.
auto const serialized_accepting_op{accepting_op.serialize()}; - if (serialized_accepting_op.has_value()) { + if (false == serialized_accepting_op.has_value()) { + return std::nullopt; + } - accepting_op_strings.push_back(serialized_accepting_op.value()); - } + accepting_op_strings.push_back(serialized_accepting_op.value());
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/log_surgeon/Lexer.tpp(8 hunks)src/log_surgeon/finite_automata/Dfa.hpp(6 hunks)src/log_surgeon/finite_automata/DfaState.hpp(3 hunks)src/log_surgeon/finite_automata/DfaStatePair.hpp(1 hunks)src/log_surgeon/finite_automata/DfaTransition.hpp(1 hunks)tests/test-dfa.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
src/log_surgeon/finite_automata/DfaStatePair.hppsrc/log_surgeon/finite_automata/Dfa.hpptests/test-dfa.cppsrc/log_surgeon/finite_automata/DfaState.hppsrc/log_surgeon/Lexer.tppsrc/log_surgeon/finite_automata/DfaTransition.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (macos-latest, Debug)
🔇 Additional comments (29)
src/log_surgeon/finite_automata/DfaStatePair.hpp (1)
72-78: The transition handling is correctly implemented.The code properly retrieves transitions using
get_transition()and checks if they have values withhas_value()before accessing destination states throughget_dest_state(). This approach prevents potential null pointer dereferences.tests/test-dfa.cpp (2)
28-73: Well-structured test case for simple DFA.The test case properly sets up a schema, constructs a DFA from lexical rules, and validates the serialized output. The line-by-line comparison approach is thorough and the checks for empty lines ensure exact matching.
75-117: Complex DFA test case properly validates serialization.This test validates a more complex regex pattern involving alternation, character classes, and quantifiers. The serialized output verification is comprehensive.
src/log_surgeon/finite_automata/Dfa.hpp (5)
26-29: Clear documentation for serialize method.The method documentation clearly explains the purpose and return value behavior.
56-56: BFS traversal method is properly documented.The documentation clearly describes the purpose and return value of the method.
103-106: Transition addition is correctly implemented.The code properly creates and adds transitions with empty register operations, adapting to the new DFA transition architecture.
146-177: BFS traversal implementation is robust.The implementation correctly uses a queue for BFS traversal, tracks visited states to avoid cycles, and maintains the order in which states are visited. The lambda function to add states to the queue and visited set is well-structured and reused effectively.
179-198: Serialization implementation is comprehensive.The serialization correctly:
- Gets states in BFS traversal order
- Creates a mapping of states to IDs
- Serializes each state with proper error handling
- Joins the serialized states with newlines
The error handling with std::nullopt is particularly important for robustness.
src/log_surgeon/Lexer.tpp (5)
87-87: Transition handling is correctly updated.The code properly retrieves transitions using
get_transition()instead of directly accessing destination states.
100-100: Proper null check with has_value().The code correctly checks if the transition exists before trying to access its destination state.
169-169: Destination state access is correct.The code properly accesses the destination state from the transition after checking if it exists.
282-289: Well-implemented transition handling for wildcard logic.This section correctly handles transitions by:
- Getting the optional transition
- Checking if it has a value using
has_value()- Accessing the destination state only if it exists
This is a good example of proper transition handling.
419-419: Correct usage of has_value() check.The code correctly uses
has_value()to check if transitions exist when populating them_is_first_chararray.src/log_surgeon/finite_automata/DfaTransition.hpp (9)
17-56: Well-structured DfaTransition implementation with const correctnessThe DfaTransition class is well-designed with appropriate const correctness in the accessor methods and member variables. The class properly encapsulates the destination state and register operations that define a DFA transition.
31-33: Good use of std::move for efficient parameter transferUsing std::move with the reg_ops parameter efficiently transfers ownership of the vector without unnecessary copying, which is a good practice for non-trivial types.
35-37: LGTM: Properly returning const referenceReturning a const reference to the vector prevents unnecessary copying and ensures the caller cannot modify the internal state.
43-51: Well-documented serialize method with proper error handlingThe documentation for the serialize method clearly explains the return values in different scenarios, providing good guidance for users of this class.
54-55: LGTM: Using const pointer for destination stateUsing a const pointer for m_dest_state ensures the destination state cannot be modified through this transition, maintaining the immutability of the DFA structure once built.
58-76: Robust implementation of serialize with proper error handlingThe serialize method properly handles error cases by returning std::nullopt when the destination state isn't in the provided map or when any register operation fails to serialize. The use of fmt::format for string formatting is clean and readable.
62-64: Good adherence to coding guidelines with condition orderingThe condition
false == state_ids.contains(m_dest_state)follows the coding guideline that prefersfalse == <expression>rather than!<expression>.
69-71: Consistent error handling patternThe pattern of checking for nullopt and early return is consistent with the error handling approach used throughout the class.
72-73: LGTM: Efficient use of emplace_backUsing emplace_back is more efficient than push_back for non-trivial types like std::string.
src/log_surgeon/finite_automata/DfaState.hpp (7)
36-36: LGTM: Default constructor meets requirementsThe default constructor is appropriate here since std::array<std::optional<DfaTransition<state_type>>, cSizeOfByte> will initialize all elements with std::nullopt.
50-52: Good transition from raw pointers to DfaTransition objectsThe method now accepts a DfaTransition object rather than a raw pointer, which encapsulates both the destination state and register operations. This improves the design by making the relationship between transitions and their operations explicit.
69-72: Improved state transition access with better error handlingThe
get_transitionmethod now returns an optional DfaTransition, which provides more information and better error handling than just returning the destination state pointer.
75-76: LGTM: Good use of std::array and std::optionalUsing std::array with std::optional for transitions is cleaner and safer than the previous approach using raw pointers.
117-129: Thorough serialization of transitions with proper error handlingThe transition serialization code is well-implemented with proper error checking. It correctly skips transitions that don't exist and returns std::nullopt if any transition serialization fails.
131-138: Clean string formatting using fmt libraryThe format string is well-structured and correctly incorporates all the components of the DFA state, including accepting tags, operations, and transitions.
83-91:Details
❓ Verification inconclusive
Incomplete UTF-8 support in get_transition method
The UTF-8 case for
get_transitionsimply returns std::nullopt without using m_tree_transitions. This might be intentional if UTF-8 support is still in progress, but should be documented or addressed.
🏁 Script executed:
#!/bin/bash # Description: Check if UTF-8 support is implemented elsewhere or if this is a known limitation # Look for usages of m_tree_transitions to see if it's used elsewhere rg "m_tree_transitions" --type cpp # Look for comments or TODOs related to UTF-8 support rg "UTF-8|utf8|unicode" --type cpp -C 3Length of output: 21188
Action Required: Clarify UTF-8 Handling in
get_transitionMethodThe UTF-8 branch in the
get_transitionmethod (in filesrc/log_surgeon/finite_automata/DfaState.hpp, lines 83–91) currently returnsstd::nulloptrather than utilisingm_tree_transitions. Our search indicates that whilem_tree_transitionsis actively used elsewhere (e.g. inNfaState.hpp), the corresponding UTF-8 implementation for DFA transitions is still missing and is noted as a TODO in other parts of the code (seeDfa.hpp).
- Location:
src/log_surgeon/finite_automata/DfaState.hpp(lines 83–91)- Observation: The UTF-8 branch (for
StateType::Utf8) directly returnsstd::nulloptwithout any further processing.- Recommendation:
- If UTF-8 support is intended to be implemented later, include a descriptive comment or TODO note in this branch to indicate that the current behaviour is an acknowledged limitation.
- Otherwise, consider implementing the proper handling by utilising
m_tree_transitionsas done for similar cases elsewhere in the codebase.
| // TODO: Handle the utf8 case | ||
| for (uint32_t idx{0}; idx < cSizeOfByte; ++idx) { |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
TODO comment should be addressed.
The TODO comment about handling the UTF-8 case should be addressed in this implementation since the PR focuses on DFA transitions.
🏁 Script executed:
#!/bin/bash
# Check if there are other TODO comments related to UTF-8 in DFA-related files
rg "TODO.*utf8" src/log_surgeon/finite_automata/Length of output: 378
Action Required: Address UTF-8 Handling TODOs
The DFA transition implementation still contains several TODO comments related to UTF-8 support. In particular, note these instances:
- In
src/log_surgeon/finite_automata/Dfa.hpp, the lines with:// TODO: Handle the utf8 case// TODO: add this for the utf8 case(appearing twice)
- In
src/log_surgeon/finite_automata/Nfa.hpp:// TODO: handle the utf8 case
Given that this PR focuses on DFA transitions, please either implement the actual UTF-8 handling logic or provide a clear plan (e.g. a follow-up ticket) to address these comments consistently. This will help ensure that incomplete logic does not propagate in the codebase.
| void Lexer<TypedNfaState, TypedDfaState>::prepend_start_of_file_char(ParserInputBuffer& input_buffer | ||
| ) { | ||
| m_prev_state = m_dfa->get_root()->get_dest_state(utf8::cCharStartOfFile); | ||
| m_prev_state = m_dfa->get_root()->get_transition(utf8::cCharStartOfFile)->get_dest_state(); |
There was a problem hiding this comment.
Missing transition validity check.
The code directly accesses the destination state from a transition without first checking if the transition exists. This could lead to undefined behavior if there's no transition for the start of file character.
- m_prev_state = m_dfa->get_root()->get_transition(utf8::cCharStartOfFile)->get_dest_state();
+ auto const& transition{m_dfa->get_root()->get_transition(utf8::cCharStartOfFile)};
+ if (false == transition.has_value()) {
+ // Handle the case where there's no transition for start of file
+ m_prev_state = m_dfa->get_root();
+ } else {
+ m_prev_state = transition->get_dest_state();
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| m_prev_state = m_dfa->get_root()->get_transition(utf8::cCharStartOfFile)->get_dest_state(); | |
| auto const& transition{m_dfa->get_root()->get_transition(utf8::cCharStartOfFile)}; | |
| if (false == transition.has_value()) { | |
| // Handle the case where there's no transition for start of file | |
| m_prev_state = m_dfa->get_root(); | |
| } else { | |
| m_prev_state = transition->get_dest_state(); | |
| } |
| auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state( | ||
| )}; |
There was a problem hiding this comment.
Missing transition validity check.
The code directly accesses the destination state from a transition without first checking if the transition exists using has_value(). This could lead to undefined behavior if there's no transition for the newline character.
- auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state(
- )};
+ auto const& root_transition{m_dfa->get_root()->get_transition(next_char)};
+ if (false == root_transition.has_value()) {
+ // Handle the case where there's no transition for newline
+ continue;
+ }
+ auto const* dest_state{root_transition->get_dest_state()};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state( | |
| )}; | |
| auto const& root_transition{m_dfa->get_root()->get_transition(next_char)}; | |
| if (false == root_transition.has_value()) { | |
| // Handle the case where there's no transition for newline | |
| continue; | |
| } | |
| auto const* dest_state{root_transition->get_dest_state()}; |
| if (wildcard == '?') { | ||
| for (uint32_t byte = 0; byte < cSizeOfByte; byte++) { | ||
| auto* dest_state{state->get_dest_state(byte)}; | ||
| auto const* dest_state{state->get_transition(byte)->get_dest_state()}; |
There was a problem hiding this comment.
Missing transition validity check in wildcard code.
The code directly accesses the destination state from a transition without checking if it exists using has_value(). This could cause a runtime error if there's no transition for a specific byte.
- auto const* dest_state{state->get_transition(byte)->get_dest_state()};
+ auto const& transition{state->get_transition(byte)};
+ if (false == transition.has_value()) {
+ token = Token{m_last_match_pos,
+ input_buffer.storage().pos(),
+ input_buffer.storage().get_active_buffer(),
+ input_buffer.storage().size(),
+ m_last_match_line,
+ &cTokenUncaughtStringTypes};
+ return ErrorCode::Success;
+ }
+ auto const* dest_state{transition->get_dest_state()};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto const* dest_state{state->get_transition(byte)->get_dest_state()}; | |
| auto const& transition{state->get_transition(byte)}; | |
| if (false == transition.has_value()) { | |
| token = Token{m_last_match_pos, | |
| input_buffer.storage().pos(), | |
| input_buffer.storage().get_active_buffer(), | |
| input_buffer.storage().size(), | |
| m_last_match_line, | |
| &cTokenUncaughtStringTypes}; | |
| return ErrorCode::Success; | |
| } | |
| auto const* dest_state{transition->get_dest_state()}; |
| auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state( | ||
| )}; |
There was a problem hiding this comment.
Missing transition validity check.
Similar to the earlier instance, the code directly accesses the destination state from a transition without checking if it exists, which could lead to undefined behavior.
- auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state(
- )};
+ auto const& root_transition{m_dfa->get_root()->get_transition(next_char)};
+ if (false == root_transition.has_value()) {
+ // Handle the case where there's no transition for newline
+ continue;
+ }
+ auto const* dest_state{root_transition->get_dest_state()};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto const* dest_state{m_dfa->get_root()->get_transition(next_char)->get_dest_state( | |
| )}; | |
| auto const& root_transition{m_dfa->get_root()->get_transition(next_char)}; | |
| if (false == root_transition.has_value()) { | |
| // Handle the case where there's no transition for newline | |
| continue; | |
| } | |
| auto const* dest_state{root_transition->get_dest_state()}; |
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Yea the transition should always exist for those two cases.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/log_surgeon/finite_automata/DfaState.hpp (4)
36-36: Consider explicit initialization.Relying on default construction for
m_bytes_transitionis acceptable. However, if you want to ensure that each element is initially set to an empty optional, you can use an in-class initializer or explicit constructor initialization.
82-82: Address the UTF-8 TODO.Since there's a placeholder comment to handle UTF-8, confirm if you intend to implement a dedicated logic branch for
StateType::Utf8. If so, note that this method currently only handlesm_bytes_transition.
84-87: Revisit return type forget_transition.Returning an
std::optional<DfaTransition<state_type>> const&can cause confusion if the reference is used beyond this method’s scope, especially once the underlying optional is destroyed. Consider returning anstd::optional<DfaTransition<state_type>>by value for safer usage.
90-134: Ensure correct handling of non-printable characters when serializing transitions.Using
static_cast<char>(idx)is sufficient for ASCII symbols but might produce malformed or invisible characters for some transition indexes. If readability beyond standard ASCII is required, consider encoding these values (e.g., with an escape mechanism) before emitting them.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/log_surgeon/finite_automata/DfaState.hpp(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
src/log_surgeon/finite_automata/DfaState.hpp
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: build (macos-latest, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
- GitHub Check: build (ubuntu-22.04, Release)
🔇 Additional comments (2)
src/log_surgeon/finite_automata/DfaState.hpp (2)
4-4: Include directives look fine.These added headers (
<array>,<optional>,<string>,<unordered_map>,<fmt/core.h>,<fmt/format.h>,DfaTransition.hpp, andRegisterOperation.hpp) are suitable for implementing the new transition and register operations while retaining clarity and modularity.Also applies to: 8-9, 12-12, 15-16, 19-20
50-52: Validate the transition index.When assigning to
m_bytes_transition[byte], consider whether the calling code may pass values outside the expected range (0–255). If so, a boundary check would be prudent to avoid undefined behaviour.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/log_surgeon/finite_automata/DfaState.hpp (3)
66-71: Returning a const reference to an optional may risk lifetime misuses.
Though returning a reference can avoid copy overhead, it exposes the possibility that the underlying optional can be invalidated. Consider returning an optional by value for clarity and safety.
82-82: UTF-8 handling remains a TODO.
Implementation of this feature is critical for consistent string handling in multi-byte contexts.Would you like support creating a new issue or drafting an approach for UTF-8 transitions?
90-133: Potential unprintable string output in serialization.
When transitioning on non-printable characters, the serialized format may contain invisible or control characters. If readability or debuggability is paramount, consider escaping or expressing these as numeric codes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/log_surgeon/finite_automata/DfaState.hpp(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ...
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
src/log_surgeon/finite_automata/DfaState.hpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build (ubuntu-22.04, Release)
- GitHub Check: build (ubuntu-22.04, Debug)
🔇 Additional comments (7)
src/log_surgeon/finite_automata/DfaState.hpp (7)
4-20: All included headers appear valid and relevant.
No major concerns. The added headers (array, optional, string, unordered_map, fmt libraries, DfaTransition.hpp, RegisterOperation.hpp) are logically consistent with the file’s functionality and match the needs of the updated code.
36-36: Default constructor looks appropriate.
Providing a default constructor ensures that all array and optional members are automatically value-initialized, simplifying usage of the class.
50-52: Validate bounds when assigning byte transitions.
Although using auint8_tas the key inherently constrains the range to [0..255], verify that any external calls cannot supply invalid data. Otherwise, you could encounter unexpected behaviour.
54-55: Adding accepting register operations is straightforward.
The straightforward approach of pushing the operation into the vector is clear and efficient for storing heterogeneous register actions.
59-65: Documentation comments appropriately describe the serialize function’s return value.
This clarifies thatstd::nulloptmight be forwarded if any nested serialization fails, which aids readers and maintainers in understanding potential failure conditions.
75-76: New fields for accepting operations and byte transitions appear well-structured.
StoringRegisterOperationvectors and leveragingstd::array<std::optional<DfaTransition<state_type>>>ensure clarity around which transitions exist. No immediate concerns.
83-87: Byte-specialized get_transition method is clear.
Returning the optional transition from the array for a single byte index is straightforward and predictable.
LinZhihao-723
left a comment
There was a problem hiding this comment.
For the PR title, how about:
feat: Add `RegisterOperation` for TDFA into DFA transitions.
RegisterOperation for TDFA into DFA transitions.
Description
Validation performed
test-dfato test untagged DFAs.Summary by CodeRabbit
New Features
Tests