Skip to content

Conversation

@SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Dec 17, 2025

Reference

Built on top of #199 so we can directly apply the hyphen fix to new test cases from #199.

Description

Primary Change

  • This change allows for regex outside of ranges to have non escaped hyphens. E.g. "int:-{0,1}\d+" no longer needs to escape the leading hyphen.
  • This change is not breaking. The user can still escape the hyphen outside of ranges if desired. This is conventional and also makes the implementation a lot cleaner.

Cleanup

  • Removed redundant productions for special characters.
  • Removed commented out production.

Validation Performed

  • Existing unit-tests updated to use non-escaped hyphens.
  • New unit-tests for hyphens added.

Summary by CodeRabbit

Release Notes

  • Refactor

    • Simplified regex pattern handling within the schema parser by consolidating internal cancellation logic.
    • Improved code consistency and maintainability through reorganization of pattern definitions.
  • Tests

    • Updated and expanded test cases to validate regex pattern handling improvements.
    • Added validation for enhanced pattern syntax support across schema types.

✏️ Tip: You can customize this high-level summary in your review settings.

@SharafMohamed SharafMohamed requested a review from a team as a code owner December 17, 2025 23:32
@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

Warning

Rate limit exceeded

@SharafMohamed has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 25 minutes and 2 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 95aa1d2 and 2a61ca7.

📒 Files selected for processing (1)
  • docs/schema.md (2 hunks)

Walkthrough

Changes update regex patterns in documentation and source code to replace escaped hyphens with unescaped equivalents in timestamp and numeric patterns. The SchemaParser is refactored to consolidate backslash-based literal cancellation handling through dynamic rule application rather than explicit productions, while retaining functional behaviour.

Changes

Cohort / File(s) Summary
Documentation & Schema
docs/schema.md
Updated regex patterns: replaced escaped hyphens (\-) with unescaped hyphens (-) in header timestamp, log, and numeric rules (int/float patterns). No functional behaviour changes.
Parser Implementation
src/log_surgeon/SchemaParser.cpp
Removed explicit backslash-cancel productions for punctuation (Lparen, Rparen, Star, Plus, Dash, Dot, Lbracket, Rbracket, Hat, Lbrace, Vbar, Rbrace, Langle, Rangle, QuestionMark). Consolidated cancellation handling through dynamic rule application via loop over m_special_regex_characters. Added Dash as valid Literal through regex_literal_rule.
Test Cases
tests/test-buffer-parser.cpp, tests/test-dfa.cpp, tests/test-nfa.cpp, tests/test-schema.cpp
Updated test regex literals and schema strings to remove escaped hyphens: changed patterns like int:\\-{0,1} to int:-{0,1} and timestamp patterns accordingly. In test-schema.cpp: refactored non_unique_capture_names to move schema creation inside loop; added new test case non_escaped_hyphens to validate handling of unescaped hyphens in variable schemas and verify regex AST component types.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Parser refactoring: The consolidation of backslash cancellation logic from explicit productions to dynamic loop-driven application warrants verification that all special characters are still correctly handled and that the loop adequately covers the removed cases.
  • Test case additions: The new non_escaped_hyphens test case in test-schema.cpp combines schema validation, AST structure verification, and dynamic_cast assertions; confirm the test assertions align with the parser changes and expected AST node types.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and accurately summarizes the main change: allowing non-escaped hyphens outside regex ranges, with a reference to the fixed issue.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7afa3f2 and 5c17c15.

📒 Files selected for processing (19)
  • docs/schema.md (5 hunks)
  • examples/common.cpp (1 hunks)
  • examples/schema.txt (1 hunks)
  • src/log_surgeon/Constants.hpp (2 hunks)
  • src/log_surgeon/Lalr1Parser.tpp (1 hunks)
  • src/log_surgeon/Lexer.hpp (2 hunks)
  • src/log_surgeon/LogEvent.cpp (4 hunks)
  • src/log_surgeon/LogEvent.hpp (1 hunks)
  • src/log_surgeon/LogParser.cpp (7 hunks)
  • src/log_surgeon/LogParserOutputBuffer.hpp (3 hunks)
  • src/log_surgeon/Parser.tpp (1 hunks)
  • src/log_surgeon/SchemaParser.cpp (1 hunks)
  • src/log_surgeon/Token.cpp (2 hunks)
  • src/log_surgeon/Token.hpp (2 hunks)
  • tests/test-buffer-parser.cpp (24 hunks)
  • tests/test-dfa.cpp (1 hunks)
  • tests/test-nfa.cpp (1 hunks)
  • tests/test-reader-parser.cpp (2 hunks)
  • tests/test-schema.cpp (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,h,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • src/log_surgeon/LogEvent.hpp
  • src/log_surgeon/Parser.tpp
  • tests/test-schema.cpp
  • src/log_surgeon/Constants.hpp
  • src/log_surgeon/SchemaParser.cpp
  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
  • examples/common.cpp
  • src/log_surgeon/Lalr1Parser.tpp
  • src/log_surgeon/LogEvent.cpp
  • src/log_surgeon/LogParser.cpp
  • src/log_surgeon/Lexer.hpp
  • tests/test-reader-parser.cpp
  • src/log_surgeon/LogParserOutputBuffer.hpp
  • tests/test-nfa.cpp
  • tests/test-buffer-parser.cpp
  • tests/test-dfa.cpp
🧠 Learnings (27)
📓 Common learnings
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 151
File: tests/test-buffer-parser.cpp:959-959
Timestamp: 2025-08-18T16:46:05.687Z
Learning: In the Log-Surgeon codebase, hyphens need to be escaped with backslashes in regular expressions even outside character classes, which differs from standard regex behavior where hyphens don't require escaping outside character classes.
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 199
File: src/log_surgeon/LogParser.cpp:195-195
Timestamp: 2025-12-03T16:22:52.313Z
Learning: In y-scope/log-surgeon, when reviewing code, do not suggest fixing style issues (such as C-style casts vs. static_cast) in lines that were not modified by the PR. The maintainer prefers to keep PRs focused on their intended changes and defer broader style refactoring to dedicated refactor PRs.
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 152
File: src/log_surgeon/wildcard_query_parser/Query.cpp:159-177
Timestamp: 2025-08-25T20:44:48.955Z
Learning: In the log-surgeon codebase, thread safety issues with global state like NonTerminal::m_next_children_start should be addressed comprehensively in dedicated PRs rather than fixed piecemeal in individual feature PRs. The user SharafMohamed prefers to defer such systemic architectural issues to separate PRs.
Learnt from: davidlion
Repo: y-scope/log-surgeon PR: 165
File: src/log_surgeon/LogEvent.cpp:53-55
Timestamp: 2025-10-22T15:40:29.992Z
Learning: In `src/log_surgeon/LogEvent.cpp`, the `get_logtype()` method has two independent mechanisms for adding `<timestamp>` to the logtype string: (1) A prefix added when `has_timestamp()` returns true, indicating a standalone timestamp token from timestamp rules (lines 53-55), and (2) Named capture group processing that adds `<timestamp>` tags for `(?<timestamp>...)` patterns in any variable rule (lines 61-95). These mechanisms do not interfere with each other. Capture groups named `timestamp` do not affect `has_timestamp()` status, as using captures as actual timestamps is not supported.
<!-- [add_learning]
In the log-surgeon codebase, capture groups (e.g., `(?<timestamp>...)`) in schema variable rules are added to the logtype string through the existing capture group processing logic, but they do not create actual timestamp tokens and do not cause `has_timestamp()` to return true. Using capture groups as actual timestamps is not supported; standalone timestamp tokens come from `timestamp:...
📚 Learning: 2025-10-22T15:40:29.992Z
Learnt from: davidlion
Repo: y-scope/log-surgeon PR: 165
File: src/log_surgeon/LogEvent.cpp:53-55
Timestamp: 2025-10-22T15:40:29.992Z
Learning: In `src/log_surgeon/LogEvent.cpp`, the `get_logtype()` method has two independent mechanisms for adding `<timestamp>` to the logtype string: (1) A prefix added when `has_timestamp()` returns true, indicating a standalone timestamp token from timestamp rules (lines 53-55), and (2) Named capture group processing that adds `<timestamp>` tags for `(?<timestamp>...)` patterns in any variable rule (lines 61-95). These mechanisms do not interfere with each other. Capture groups named `timestamp` do not affect `has_timestamp()` status, as using captures as actual timestamps is not supported.
<!-- [add_learning]
In the log-surgeon codebase, capture groups (e.g., `(?<timestamp>...)`) in schema variable rules are added to the logtype string through the existing capture group processing logic, but they do not create actual timestamp tokens and do not cause `has_timestamp()` to return true. Using capture groups as actual timestamps is not supported; standalone timestamp tokens come from `timestamp:...

Applied to files:

  • src/log_surgeon/LogEvent.hpp
  • examples/schema.txt
  • src/log_surgeon/Constants.hpp
  • docs/schema.md
  • examples/common.cpp
  • src/log_surgeon/LogEvent.cpp
  • src/log_surgeon/LogParser.cpp
  • tests/test-reader-parser.cpp
  • src/log_surgeon/LogParserOutputBuffer.hpp
  • tests/test-buffer-parser.cpp
📚 Learning: 2025-12-03T16:38:15.354Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 199
File: src/log_surgeon/LogParser.cpp:154-155
Timestamp: 2025-12-03T16:38:15.354Z
Learning: In `src/log_surgeon/LogParser.cpp` and related parsing code, when `get_reg_ids_from_capture(capture)` returns an optional, calling `.value()` directly without checking is intentional for captures that should always have register IDs (e.g., `"timestamp"` captures). If the optional is nullopt, it indicates an internal programming error/invariant violation, and throwing `std::bad_optional_access` is the desired behavior rather than defensive error handling with continue/return.

Applied to files:

  • src/log_surgeon/LogEvent.hpp
  • examples/common.cpp
  • src/log_surgeon/LogEvent.cpp
  • src/log_surgeon/LogParser.cpp
  • src/log_surgeon/Lexer.hpp
  • tests/test-reader-parser.cpp
  • src/log_surgeon/LogParserOutputBuffer.hpp
  • tests/test-buffer-parser.cpp
📚 Learning: 2025-12-09T21:17:27.546Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 197
File: src/log_surgeon/finite_automata/RegexAST.hpp:1011-1019
Timestamp: 2025-12-09T21:17:27.546Z
Learning: In C++ files under src/log_surgeon in the y-scope/log-surgeon repository, for const pointer parameters in null checks, Yoda conditions (nullptr == ptr) are not required; standard comparison order (ptr == nullptr) is acceptable since const prevents accidental assignment. Apply this pattern to header and related source files within src/log_surgeon.

Applied to files:

  • src/log_surgeon/LogEvent.hpp
  • src/log_surgeon/Constants.hpp
  • src/log_surgeon/Token.hpp
  • src/log_surgeon/Lexer.hpp
  • src/log_surgeon/LogParserOutputBuffer.hpp
📚 Learning: 2024-11-18T16:45:46.073Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.073Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

Applied to files:

  • src/log_surgeon/Parser.tpp
  • src/log_surgeon/Constants.hpp
  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
  • src/log_surgeon/Lalr1Parser.tpp
  • src/log_surgeon/Lexer.hpp
📚 Learning: 2025-08-08T10:23:06.281Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 144
File: src/log_surgeon/wildcard_query_parser/QueryInterpretation.hpp:80-88
Timestamp: 2025-08-08T10:23:06.281Z
Learning: In y-scope/log-surgeon (C++), small function definitions are allowed to remain inline in headers. For src/log_surgeon/wildcard_query_parser/QueryInterpretation.hpp, do not suggest moving small methods like QueryInterpretation::append_variable_token to the .cpp for “consistency” in future reviews.

Applied to files:

  • src/log_surgeon/Parser.tpp
📚 Learning: 2025-09-03T16:45:58.451Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 152
File: src/log_surgeon/wildcard_query_parser/Query.hpp:119-132
Timestamp: 2025-09-03T16:45:58.451Z
Learning: In log-surgeon's wildcard_query_parser, the DFA intersection process used in Query::get_matching_variable_types loses track of variable type priorities from the original ByteLexer. Priority information cannot be preserved by simply changing the return type from std::set to std::vector - it would require post-processing after the DFA intersection.

Applied to files:

  • tests/test-schema.cpp
  • tests/test-dfa.cpp
📚 Learning: 2025-08-13T12:06:36.584Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 150
File: src/log_surgeon/wildcard_query_parser/WildcardExpressionView.cpp:58-77
Timestamp: 2025-08-13T12:06:36.584Z
Learning: In src/log_surgeon/wildcard_query_parser/WildcardExpressionView.cpp, caching SchemaParser::get_special_regex_characters() with a const& reference provides no performance benefit, indicating the method likely returns a reference to an existing container rather than performing expensive computations.

Applied to files:

  • tests/test-schema.cpp
  • src/log_surgeon/SchemaParser.cpp
📚 Learning: 2025-08-13T12:05:00.245Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 150
File: src/log_surgeon/wildcard_query_parser/WildcardExpressionView.cpp:58-77
Timestamp: 2025-08-13T12:05:00.245Z
Learning: In src/log_surgeon/wildcard_query_parser/WildcardExpressionView.cpp, the generate_regex_string() method should not enforce well-formedness checks via assertions. The is_well_formed() method is intended to be used by callers at their discretion, allowing flexibility to generate regex strings from malformed views if desired.

Applied to files:

  • tests/test-schema.cpp
📚 Learning: 2024-11-13T20:02:13.737Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

Applied to files:

  • tests/test-schema.cpp
  • src/log_surgeon/Constants.hpp
  • src/log_surgeon/Lalr1Parser.tpp
  • src/log_surgeon/Lexer.hpp
  • tests/test-nfa.cpp
  • tests/test-dfa.cpp
📚 Learning: 2025-08-08T10:17:43.495Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 144
File: src/log_surgeon/wildcard_query_parser/VariableQueryToken.hpp:28-31
Timestamp: 2025-08-08T10:17:43.495Z
Learning: In src/log_surgeon/wildcard_query_parser/VariableQueryToken.hpp/.cpp (C++), do not suggest defaulting operator<=>. The project prefers a custom out-of-line comparator that explicitly handles the bool member (via explicit cast) to avoid implicit conversions; keep the current manual implementation.

Applied to files:

  • tests/test-schema.cpp
📚 Learning: 2025-08-08T10:00:20.963Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 144
File: src/log_surgeon/wildcard_query_parser/VariableQueryToken.hpp:28-31
Timestamp: 2025-08-08T10:00:20.963Z
Learning: In src/log_surgeon/wildcard_query_parser/QueryInterpretation.hpp (C++), do not default the comparison operators: the class stores std::vector<std::variant<StaticQueryToken, VariableQueryToken>>, which yields only a weak ordering. A custom operator<=> that maps the variant’s weak ordering to std::strong_ordering is required; avoid suggesting defaulting there in future reviews.

Applied to files:

  • tests/test-schema.cpp
  • tests/test-reader-parser.cpp
📚 Learning: 2025-08-26T10:13:00.368Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 152
File: tests/test-query.cpp:61-67
Timestamp: 2025-08-26T10:13:00.368Z
Learning: In Catch2 unit tests, the REQUIRE macro already provides detailed debugging output when container comparisons (like std::set equality) fail, showing both expected and actual values. Additional CAPTURE statements are typically unnecessary for such comparisons.

Applied to files:

  • tests/test-schema.cpp
  • tests/test-reader-parser.cpp
  • tests/test-buffer-parser.cpp
📚 Learning: 2025-08-18T16:46:05.687Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 151
File: tests/test-buffer-parser.cpp:959-959
Timestamp: 2025-08-18T16:46:05.687Z
Learning: In the Log-Surgeon codebase, hyphens need to be escaped with backslashes in regular expressions even outside character classes, which differs from standard regex behavior where hyphens don't require escaping outside character classes.

Applied to files:

  • tests/test-schema.cpp
  • src/log_surgeon/SchemaParser.cpp
📚 Learning: 2025-08-26T10:06:22.914Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 152
File: src/log_surgeon/wildcard_query_parser/Query.hpp:9-11
Timestamp: 2025-08-26T10:06:22.914Z
Learning: In y-scope/log-surgeon project, it's acceptable to include headers like log_surgeon/Lexer.hpp directly in header files rather than using forward declarations, even when only references are used in the interface. The project prefers the simplicity of direct includes over header coupling optimization through forward declarations.

Applied to files:

  • src/log_surgeon/Constants.hpp
  • src/log_surgeon/Lalr1Parser.tpp
  • src/log_surgeon/Lexer.hpp
  • src/log_surgeon/LogParserOutputBuffer.hpp
📚 Learning: 2025-08-18T12:00:40.545Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 151
File: tests/test-buffer-parser.cpp:867-870
Timestamp: 2025-08-18T12:00:40.545Z
Learning: For schema documentation in code comments, multi-line rules should be supported and are expected since complex regex patterns may not fit comfortably on a single line within linting constraints. The "..." notation or backslash continuation can be used to indicate pattern continuation in documentation blocks.

Applied to files:

  • docs/schema.md
  • src/log_surgeon/LogParser.cpp
📚 Learning: 2025-08-15T00:13:46.717Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 150
File: tests/test-wildcard-expression.cpp:0-0
Timestamp: 2025-08-15T00:13:46.717Z
Learning: In src/log_surgeon/wildcard_query_parser/WildcardExpression.cpp, the escaping logic treats consecutive backslashes as escape-literal pairs, not as individual escape characters. For input like R"(a\*b\?c\\)", position 7 is the escape character and position 8 is the literal backslash being escaped, so only position 7 should have is_escape() return true.

Applied to files:

  • src/log_surgeon/SchemaParser.cpp
📚 Learning: 2025-08-26T09:31:28.567Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 152
File: src/log_surgeon/wildcard_query_parser/ExpressionView.hpp:46-70
Timestamp: 2025-08-26T09:31:28.567Z
Learning: In src/log_surgeon/wildcard_query_parser, escaped wildcards (\* or \?) lose their wildcard nature entirely and become literal characters, while escaped delimiters retain their delimiter property in the delimiter table regardless of being escaped. This means boundary detection logic correctly treats escaped wildcards as non-wildcards but escaped delimiters as still being delimiters.

Applied to files:

  • src/log_surgeon/SchemaParser.cpp
📚 Learning: 2025-08-15T01:04:03.681Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 150
File: src/log_surgeon/wildcard_query_parser/ExpressionView.cpp:11-11
Timestamp: 2025-08-15T01:04:03.681Z
Learning: In y-scope/log-surgeon C++ codebase, prefer explicit includes in implementation files (.cpp) for types that are directly used, even when those headers are already transitively included through the corresponding header file. This makes dependencies explicit and improves maintainability.

Applied to files:

  • src/log_surgeon/LogEvent.cpp
  • src/log_surgeon/LogParserOutputBuffer.hpp
📚 Learning: 2024-10-24T15:54:19.228Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:99-105
Timestamp: 2024-10-24T15:54:19.228Z
Learning: In `src/log_surgeon/finite_automata/RegexNFA.hpp`, it's acceptable to have constructors without the `explicit` specifier. Do not suggest adding `explicit` to constructors in this file.

Applied to files:

  • src/log_surgeon/LogParser.cpp
  • src/log_surgeon/Lexer.hpp
  • tests/test-nfa.cpp
📚 Learning: 2024-10-11T16:16:02.866Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 38
File: src/log_surgeon/finite_automata/RegexAST.hpp:663-669
Timestamp: 2024-10-11T16:16:02.866Z
Learning: In `RegexASTLiteral::serialize()`, to properly handle Unicode characters beyond the ASCII range, cast `m_character` to `char32_t` and use `U"{}{}"` in `fmt::format`.

Applied to files:

  • src/log_surgeon/LogParser.cpp
📚 Learning: 2024-10-24T15:54:35.193Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:442-456
Timestamp: 2024-10-24T15:54:35.193Z
Learning: In the C++ file `src/log_surgeon/finite_automata/RegexNFA.hpp`, for the `RegexNFA::serialize()` function, prioritize code clarity over efficiency when handling string operations.

Applied to files:

  • src/log_surgeon/LogParser.cpp
  • tests/test-nfa.cpp
📚 Learning: 2024-11-27T22:25:35.608Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.

Applied to files:

  • src/log_surgeon/Lexer.hpp
📚 Learning: 2024-11-27T21:56:13.425Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T21:56:13.425Z
Learning: In the `log_surgeon` project, header guards in C++ header files should include `_HPP` at the end to match the filename. For example, in `RegisterHandler.hpp`, the header guard should be `LOG_SURGEON_FINITE_AUTOMATA_REGISTER_HANDLER_HPP`.

Applied to files:

  • src/log_surgeon/Lexer.hpp
  • src/log_surgeon/LogParserOutputBuffer.hpp
📚 Learning: 2024-11-13T22:25:54.168Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 48
File: tests/test-tag.cpp:10-10
Timestamp: 2024-11-13T22:25:54.168Z
Learning: In the log-surgeon codebase (C++), particularly in the finite automata components involving the `Tag` class (`src/log_surgeon/finite_automata/Tag.hpp`), it's important to ensure that `Tag*` pointers in other objects cannot be `nullptr`. Test cases should focus on validating that these `Tag*` pointers are not null where they are used, and handle `nullptr` appropriately.

Applied to files:

  • tests/test-reader-parser.cpp
📚 Learning: 2025-11-06T11:16:52.917Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 184
File: tests/test-reader-parser.cpp:150-154
Timestamp: 2025-11-06T11:16:52.917Z
Learning: In Catch2 test files for the y-scope/log-surgeon repository, an explicit `if (false == optional.has_value()) { return; }` check may be required after `REQUIRE(optional.has_value())` to prevent clang-tidy errors, even though the code is unreachable at runtime. clang-tidy's static analyzer doesn't recognize that REQUIRE aborts execution and may flag unsafe optional access without the explicit check.

Applied to files:

  • tests/test-reader-parser.cpp
  • tests/test-buffer-parser.cpp
📚 Learning: 2025-08-15T12:07:58.626Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 150
File: tests/test-expression-view.cpp:7-7
Timestamp: 2025-08-15T12:07:58.626Z
Learning: In tests/test-expression-view.cpp, the `<catch2/catch_message.hpp>` header is required for clang-tidy to pass, even though it may not be directly used in the visible code.

Applied to files:

  • tests/test-reader-parser.cpp
🧬 Code graph analysis (8)
src/log_surgeon/LogEvent.hpp (2)
src/log_surgeon/LogEvent.cpp (2)
  • nodiscard (33-35)
  • nodiscard (37-48)
src/log_surgeon/LogParserOutputBuffer.hpp (8)
  • nodiscard (37-37)
  • nodiscard (43-45)
  • nodiscard (49-49)
  • nodiscard (53-55)
  • nodiscard (57-59)
  • nodiscard (63-63)
  • nodiscard (67-67)
  • nodiscard (69-69)
tests/test-schema.cpp (1)
src/log_surgeon/LogParser.hpp (1)
  • schema_ast (158-158)
src/log_surgeon/Token.cpp (1)
src/log_surgeon/Token.hpp (1)
  • start_pos (63-66)
tests/test-reader-parser.cpp (1)
src/log_surgeon/Lexer.hpp (2)
  • capture (189-190)
  • capture (218-219)
src/log_surgeon/LogParserOutputBuffer.hpp (2)
src/log_surgeon/LogEvent.cpp (2)
  • nodiscard (33-35)
  • nodiscard (37-48)
src/log_surgeon/LogEvent.hpp (4)
  • nodiscard (53-55)
  • nodiscard (60-60)
  • nodiscard (66-69)
  • nodiscard (85-85)
tests/test-nfa.cpp (1)
src/log_surgeon/Schema.hpp (1)
  • var_schema (40-40)
tests/test-buffer-parser.cpp (2)
src/log_surgeon/Lexer.hpp (2)
  • capture (189-190)
  • capture (218-219)
tests/test-reader-parser.cpp (3)
  • parse_and_validate (66-70)
  • parse_and_validate (78-173)
  • parse_and_validate (78-82)
tests/test-dfa.cpp (1)
src/log_surgeon/Schema.hpp (1)
  • var_schema (40-40)
🪛 Cppcheck (2.18.0)
src/log_surgeon/Parser.tpp

[information] 18-18: Include file

(missingIncludeSystem)


[information] 19-19: Include file

(missingIncludeSystem)


[information] 20-20: Include file

(missingIncludeSystem)


[information] 21-21: Include file

(missingIncludeSystem)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)

tests/test-schema.cpp

[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[style] 220-220: The function 'scan_with_wildcard' is never used.

(unusedFunction)

src/log_surgeon/SchemaParser.cpp

[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)

src/log_surgeon/Token.cpp

[information] 4-4: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[style] 39-39: The function 'get_sub_token' is never used.

(unusedFunction)


[style] 46-46: The function 'set' is never used.

(unusedFunction)


[style] 48-48: The function 'get_num_regs' is never used.

(unusedFunction)

examples/common.cpp

[style] 35-35: The function 'set_has_header' is never used.

(unusedFunction)


[style] 37-37: The function 'has_header' is never used.

(unusedFunction)


[style] 39-39: The function 'set_timestamp' is never used.

(unusedFunction)


[style] 38-38: The function 'append_position' is never used.

(unusedFunction)

src/log_surgeon/Lalr1Parser.tpp

[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)

src/log_surgeon/LogEvent.cpp

[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingIncludeSystem)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[performance] 35-35: Searching before insertion is not necessary. Instead of 'm_lexer.m_symbol_id[name]=m_lexer.m_symbol_id.size()' consider using 'm_lexer.m_symbol_id.try_emplace(name, m_lexer.m_symbol_id.size());'.

(stlFindInsert)


[style] 37-37: The function 'is_multi_valued' is never used.

(unusedFunction)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[style] 53-53: The function 'static_size' is never used.

(unusedFunction)


[style] 64-64: The function 'set_active_buffer' is never used.

(unusedFunction)


[style] 73-73: The function 'get_mutable_active_buffer' is never used.

(unusedFunction)


[style] 76-76: The function 'add_token_group' is never used.

(unusedFunction)


[style] 84-84: The function 'add_token_chain' is never used.

(unusedFunction)


[style] 53-53: The function 'get_variables' is never used.

(unusedFunction)


[style] 63-63: The function 'get_curr_token' is never used.

(unusedFunction)


[style] 68-68: The function 'add_delimiter' is never used.

(unusedFunction)


[style] 69-69: The function 'get_first_char' is never used.

(unusedFunction)


[style] 67-67: The function 'is_root' is never used.

(unusedFunction)


[style] 74-74: The function 'get_parent_id_unsafe' is never used.

(unusedFunction)


[style] 81-81: The function 'get_position' is never used.

(unusedFunction)


[style] 73-73: The function 'token_cast' is never used.

(unusedFunction)


[style] 81-81: The function 'non_terminal_cast' is never used.

(unusedFunction)


[style] 85-85: The function 'get_parser_ast' is never used.

(unusedFunction)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[style] 110-110: The function 'add_production' is never used.

(unusedFunction)

src/log_surgeon/LogParser.cpp

[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[style] 73-73: The function 'get_mutable_active_buffer' is never used.

(unusedFunction)


[style] 76-76: The function 'add_token_group' is never used.

(unusedFunction)


[style] 84-84: The function 'add_token_chain' is never used.

(unusedFunction)


[style] 68-68: The function 'add_delimiter' is never used.

(unusedFunction)


[style] 67-67: The function 'is_root' is never used.

(unusedFunction)


[style] 74-74: The function 'get_parent_id_unsafe' is never used.

(unusedFunction)


[style] 81-81: The function 'get_position' is never used.

(unusedFunction)


[style] 73-73: The function 'token_cast' is never used.

(unusedFunction)


[style] 81-81: The function 'non_terminal_cast' is never used.

(unusedFunction)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[style] 120-120: The function 'get_log_event_view' is never used.

(unusedFunction)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[style] 156-156: The function 'get_delim_table' is never used.

(unusedFunction)


[style] 164-164: The function 'get_dfa' is never used.

(unusedFunction)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)

tests/test-reader-parser.cpp

[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[style] 119-119: The function 'set_type_ids' is never used.

(unusedFunction)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)

tests/test-nfa.cpp

[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)

tests/test-buffer-parser.cpp

[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[style] 94-94: The function 'get_input_pos' is never used.

(unusedFunction)


[style] 97-97: The function 'release_parser_ast' is never used.

(unusedFunction)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[style] 428-428: The function 'negate' is never used.

(unusedFunction)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)


[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)

tests/test-dfa.cpp

[information] Limiting analysis of branches. Use --check-level=exhaustive to analyze all branches.

(normalCheckLevelMaxBranches)

🪛 LanguageTool
examples/schema.txt

[uncategorized] ~47-~47: Possible missing comma found.
Context: ...{2} \d{2} \d{2}:\d{2}:\d{2}) // E.g. 01-21 11:56:42.392 header:(?\d{2}...

(AI_HYDRA_LEO_MISSING_COMMA)

docs/schema.md

[uncategorized] ~91-~91: Loose punctuation mark.
Context: ...pture within a header rule. * $PREFIX$, $SUFFIX$, and $TIMESTAMP-PATTERN$ a...

(UNLIKELY_OPENING_PUNCTUATION)

🔇 Additional comments (44)
src/log_surgeon/Constants.hpp (1)

38-38: LGTM! Token vocabulary updated to header-based terminology.

The replacement of timestamp-related tokens with TokenHeader and cTokenHeader is consistent with the broader architectural shift to header-based log event detection.

Also applies to: 47-47

src/log_surgeon/Parser.tpp (1)

17-17: LGTM! Symbol mappings correctly updated.

The bidirectional mappings for TokenHeader are correctly established, consistent with the pattern used for other token types.

Also applies to: 25-25

src/log_surgeon/Lalr1Parser.tpp (1)

62-62: LGTM! Terminal symbol set updated correctly.

The addition of TokenHeader to the terminals set aligns with the vocabulary changes in Constants.hpp and Parser.tpp.

src/log_surgeon/SchemaParser.cpp (2)

621-621: LGTM! Core fix for non-escaped hyphens outside character ranges.

This production allows hyphens to be treated as literals when they appear outside character classes (outside [...]). Inside character classes, hyphens are still handled by the LiteralRange production at line 641. This change aligns log-surgeon's behavior with standard regex semantics.

Based on learnings, the previous behavior required escaping hyphens everywhere, which differed from standard regex behavior.


666-669: LGTM! Refactored to consolidate escape literal productions.

The loop-based approach dynamically generates escape literal productions for all special regex characters (including Dash), ensuring backward compatibility—users can still escape hyphens if desired.

tests/test-dfa.cpp (1)

192-192: LGTM! Test updated to demonstrate non-escaped hyphen syntax.

The removal of the backslash escape before the hyphen demonstrates the new behavior. The hyphen outside the character class no longer requires escaping, aligning with standard regex conventions. The expected DFA serialization remains unchanged, confirming the behavioral equivalence.

tests/test-nfa.cpp (1)

192-192: LGTM! Test updated consistently with DFA test.

The removal of the backslash escape demonstrates the new hyphen syntax, consistent with the corresponding change in test-dfa.cpp.

examples/schema.txt (2)

1-1: LGTM! Schema updated to header-based timestamp capture.

The migration from timestamp: rules to header:(?<timestamp>...) patterns is consistent with the architectural shift to header-based log event detection. The (?<timestamp>...) named capture provides the timestamp value for each event.

Also applies to: 3-3, 5-5, 7-7, 9-9, 12-12, 14-14, 16-16, 20-20, 22-22, 24-24, 26-26, 28-28, 30-30, 32-32, 34-34, 38-38, 40-40, 42-42, 44-44, 46-46, 48-48, 50-50


59-59: LGTM! Demonstrates non-escaped hyphen syntax.

The removal of the backslash escape before the hyphen in the integer pattern demonstrates the new behavior, consistent with the test updates.

src/log_surgeon/LogEvent.hpp (1)

72-74: LGTM! API updated to return timestamp as optional string.

The change from Token* to std::optional<std::string> const& aligns with the header-based timestamp representation. The implementation correctly delegates to m_log_output_buffer->get_timestamp() (as shown in the relevant code snippets).

Note: This is a breaking API change for existing users of get_timestamp().

src/log_surgeon/Token.hpp (1)

57-71: LGTM!

The new get_sub_token() and get_first_char() methods are well-documented and correctly implemented. The exception documentation for get_sub_token clearly states the std::out_of_range behavior.

examples/common.cpp (1)

35-41: LGTM!

The code correctly adapts to the new optional-based timestamp API. The has_value() check and subsequent value() access follow proper optional usage patterns.

src/log_surgeon/Token.cpp (1)

39-59: LGTM!

The get_sub_token() implementation correctly validates bounds:

  • Checks for negative positions (appropriate since position_t is signed).
  • Validates against buffer size with correct semantics (end_pos > m_buffer.size() allows one-past-the-end).
  • Returns a new Token with the sub-span while preserving relevant metadata.
tests/test-schema.cpp (3)

148-149: LGTM!

These changes correctly demonstrate the PR's objective: hyphens outside character ranges no longer require escaping. Based on learnings, this was previously required but is now optional.


198-199: LGTM!

Good improvement to test structure: moving Schema schema inside the loop ensures clean state per iteration, and CAPTURE(var_name) improves diagnostic output if a test fails.


220-251: Good test coverage for the hyphen feature.

This test case thoroughly validates the new non-escaped hyphen behaviour:

  1. ID-(?<id>\d{4}) — hyphen outside character range (non-escaped)
  2. \d{4}-\d{4}-\d{4} — multiple hyphens outside ranges (non-escaped)
  3. [a\-z] — hyphen inside character class (still requires escaping)

The AST type verification (RegexASTCatByte vs RegexASTGroupByte) appropriately confirms correct parsing.

tests/test-reader-parser.cpp (3)

116-120: LGTM!

The timestamp validation correctly uses the new optional-based API. The false == event.get_timestamp().has_value() pattern follows the coding guidelines.


122-125: LGTM!

The header presence check correctly uses the new has_header() method, and the false == ...has_header() pattern follows the coding guidelines.


158-158: LGTM!

The simplified destructuring reflects the updated get_reg_ids_from_capture() API that now returns std::pair<reg_id_t, reg_id_t> directly instead of an optional.

src/log_surgeon/Lexer.hpp (2)

9-9: LGTM! Correct include for the new exception usage.

The addition of <stdexcept> is appropriate for the std::runtime_error exceptions thrown in get_reg_ids_from_capture.


216-237: LGTM! Improved error handling with explicit exceptions.

The change from returning std::optional to throwing std::runtime_error with descriptive messages aligns well with the codebase's intent that missing register IDs for captures indicates an internal programming error. Based on learnings, this is the desired behaviour rather than defensive error handling. The error messages clearly identify the capture name and the specific missing element (tag ids, start reg id, or end reg id).

src/log_surgeon/LogEvent.cpp (5)

5-5: LGTM! Explicit include for the new return type.

The addition of <optional> is appropriate since get_timestamp() now returns std::optional<std::string> const&.


33-35: LGTM! Return type change aligns with the new header-based architecture.

The return type change from Token* to std::optional<std::string> const& provides a safer, value-based API that avoids raw pointer semantics.


40-67: LGTM! Consistent migration to header-based logic.

The changes correctly replace timestamp-based checks with has_header() checks for determining buffer boundaries and first-token calculations. The logic is consistent with the header-based parsing model.


72-85: LGTM! Simplified capture register ID retrieval.

The code now uses structured bindings to directly destructure the pair returned by get_reg_ids_from_capture, which is cleaner since the function now throws on error rather than returning an optional.


107-115: LGTM! Constructor updated for header-based semantics.

The LogEvent copy constructor correctly propagates the header flag and timestamp using the new setter methods.

docs/schema.md (3)

19-22: LGTM! Clear documentation of the header-based parsing model.

The introduction of headers as log event delimiters with optional timestamp capture is well explained.


76-106: LGTM! Comprehensive Headers section.

The Headers section clearly documents:

  • Multiple header support
  • Optional timestamp captures
  • Behaviour when timestamps are parsed or omitted
  • Reserved naming for timestamp captures
  • Event boundary detection logic

115-119: Hyphen examples in schema documentation are actually escaped, not unescaped.

The examples shown in lines 115-119 use \-{0,1} (escaped hyphens), not -{0,1} as claimed in the review. The int and float rules both properly escape hyphens: int:\-{0,1}\d+ and float:\-{0,1}\d+\.\d+. This aligns with Log-Surgeon's regex requirements, which mandate escaped hyphens even outside character classes. The documentation examples are correct as-is.

Likely an incorrect or invalid review comment.

src/log_surgeon/LogParserOutputBuffer.hpp (3)

4-5: LGTM! Include for optional support.

The addition of <optional> is necessary for the new m_timestamp member.


28-45: LGTM! Clean API for header and timestamp management.

The new accessors provide a clear interface for managing the header flag and timestamp value. The set_timestamp accepting std::optional<std::string_view> and storing as std::optional<std::string> allows for efficient setting while maintaining ownership.


72-73: LGTM! Private members for header-based parsing.

The replacement of m_has_timestamp with m_has_header and m_timestamp provides more flexibility in the header-based parsing model where a header may or may not contain a timestamp.

tests/test-buffer-parser.cpp (5)

92-101: LGTM! Updated timestamp and header validation.

The test validation correctly uses the new optional-based timestamp API and has_header() for event offset calculation.


136-136: LGTM! Simplified capture register retrieval.

Using structured bindings to directly destructure the pair is cleaner now that the function throws on error.


426-431: LGTM! Schema updated for header-based parsing with non-escaped hyphens.

The schema correctly uses header: with timestamp capture and non-escaped hyphens in int: and float: patterns, aligning with the PR objective to allow non-escaped hyphens outside of regex ranges.


519-519: LGTM! Non-escaped hyphen in int pattern.

The change from \-{0,1} to -{0,1} aligns with the PR objective. Based on learnings, hyphens previously required escaping even outside character classes in log-surgeon, but this PR removes that requirement.


1140-1410: LGTM! Comprehensive multiple_headers test case.

This test case thoroughly validates:

  • Multiple header patterns with different capture configurations
  • Optional timestamp captures within headers
  • Alternation patterns (hex|timestamp) with captures
  • Various combinations of captured and uncaptured timestamps
  • Correct position tracking across nine different events

The test provides good coverage for the new header-based parsing functionality.

src/log_surgeon/LogParser.cpp (7)

67-84: Rule preprocessing logic is well-structured.

The differentiation between header rules (which can start at file beginning or after newline) and other rules (which require a delimiter prefix) is correct. The use of move semantics and RAII is appropriate.


103-103: LGTM: Cleaner function signature.

Removing the redundant LogParser:: qualification improves readability.


183-188: Message boundary detection logic is correct.

The boolean condition correctly identifies the start of the next message in both header-based and non-header-based modes. The condition properly follows the coding guideline using false == output_buffer->has_header() rather than negation.


193-200: LGTM: Correct newline handling for non-header mode.

The check correctly uses false == output_buffer->has_header() following the coding guideline, and the logic appropriately handles newline tokens in non-header mode.


232-252: LGTM: Metadata generation logic is sound.

The start offset calculation, token iteration, and multiline detection logic are all correct. The code properly follows the coding guideline using false == m_log_event_view->m_log_output_buffer->has_header().


67-247: Inconsistency between PR objectives and file changes.

The PR objectives state this PR is about allowing "non-escaped hyphens outside of regex ranges," but all visible changes in this file relate to header-based parsing architecture (replacing timestamp-based parsing with header-based parsing). Since the PR description mentions it's "built on top of PR #199," these changes likely originate from the base PR rather than the hyphen-handling feature.

This inconsistency makes it difficult to distinguish which changes are intended for this PR versus inherited from the base. If the hyphen-handling changes are in other files, this file may just contain prerequisite changes from PR #199.


147-164: Use the boolean negation style consistent with the codebase guidelines.

Lines 156-157 correctly check for negative positions—position_t is int32_t, and negative values are valid per the PrefixTree documentation to indicate unmatched captures. However, align the style with the coding guidelines and LogEvent.cpp: use false == starts.empty() && false == ends.empty() instead of .empty() || .empty() checks, and prefer -1 < starts[0] over starts[0] < 0 for consistency.

Copy link
Member

@davidlion davidlion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment about the doc, but otherwise lgtm. I'll look again after we merge since I might have missed something. Title looks fine to me too.

(I copied the code rabbit comments to the other PR since they should be applied there.)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/schema.md (1)

163-165: Clarify updated hyphen escaping rules.

The note about special characters states that hyphens must always be escaped, but this contradicts the PR's feature allowing non-escaped hyphens outside character ranges. The documentation should clarify that hyphens only require escaping when used within character ranges (e.g., [a-z]), not when appearing outside them.

🔎 Proposed clarification for hyphen escaping rules
 * Special characters include: ( ) * + - . [ \ ] ^ { | } < > ?
   - Escape these with '\' when used literally (e.g., \., \(, \\).
-  - Special characters must be escaped even in ranges.
+  - When used within character ranges (e.g., [a-z]), hyphens must be escaped or positioned at the start/end of the range.
+  - Hyphens outside character ranges do not require escaping (e.g., `-{0,1}` is valid).
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c17c15 and 48ea212.

📒 Files selected for processing (3)
  • docs/schema.md (4 hunks)
  • src/log_surgeon/Token.cpp (2 hunks)
  • src/log_surgeon/Token.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,h,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
🧠 Learnings (17)
📓 Common learnings
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 151
File: tests/test-buffer-parser.cpp:959-959
Timestamp: 2025-08-18T16:46:05.687Z
Learning: In the Log-Surgeon codebase, hyphens need to be escaped with backslashes in regular expressions even outside character classes, which differs from standard regex behavior where hyphens don't require escaping outside character classes.
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 152
File: src/log_surgeon/wildcard_query_parser/Query.cpp:159-177
Timestamp: 2025-08-25T20:44:48.955Z
Learning: In the log-surgeon codebase, thread safety issues with global state like NonTerminal::m_next_children_start should be addressed comprehensively in dedicated PRs rather than fixed piecemeal in individual feature PRs. The user SharafMohamed prefers to defer such systemic architectural issues to separate PRs.
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 199
File: src/log_surgeon/LogParser.cpp:195-195
Timestamp: 2025-12-03T16:22:52.313Z
Learning: In y-scope/log-surgeon, when reviewing code, do not suggest fixing style issues (such as C-style casts vs. static_cast) in lines that were not modified by the PR. The maintainer prefers to keep PRs focused on their intended changes and defer broader style refactoring to dedicated refactor PRs.
📚 Learning: 2025-10-22T15:40:29.992Z
Learnt from: davidlion
Repo: y-scope/log-surgeon PR: 165
File: src/log_surgeon/LogEvent.cpp:53-55
Timestamp: 2025-10-22T15:40:29.992Z
Learning: In `src/log_surgeon/LogEvent.cpp`, the `get_logtype()` method has two independent mechanisms for adding `<timestamp>` to the logtype string: (1) A prefix added when `has_timestamp()` returns true, indicating a standalone timestamp token from timestamp rules (lines 53-55), and (2) Named capture group processing that adds `<timestamp>` tags for `(?<timestamp>...)` patterns in any variable rule (lines 61-95). These mechanisms do not interfere with each other. Capture groups named `timestamp` do not affect `has_timestamp()` status, as using captures as actual timestamps is not supported.
<!-- [add_learning]
In the log-surgeon codebase, capture groups (e.g., `(?<timestamp>...)`) in schema variable rules are added to the logtype string through the existing capture group processing logic, but they do not create actual timestamp tokens and do not cause `has_timestamp()` to return true. Using capture groups as actual timestamps is not supported; standalone timestamp tokens come from `timestamp:...

Applied to files:

  • docs/schema.md
📚 Learning: 2025-08-18T12:00:40.545Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 151
File: tests/test-buffer-parser.cpp:867-870
Timestamp: 2025-08-18T12:00:40.545Z
Learning: For schema documentation in code comments, multi-line rules should be supported and are expected since complex regex patterns may not fit comfortably on a single line within linting constraints. The "..." notation or backslash continuation can be used to indicate pattern continuation in documentation blocks.

Applied to files:

  • docs/schema.md
📚 Learning: 2025-12-03T16:38:15.354Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 199
File: src/log_surgeon/LogParser.cpp:154-155
Timestamp: 2025-12-03T16:38:15.354Z
Learning: In `src/log_surgeon/LogParser.cpp` and related parsing code, when `get_reg_ids_from_capture(capture)` returns an optional, calling `.value()` directly without checking is intentional for captures that should always have register IDs (e.g., `"timestamp"` captures). If the optional is nullopt, it indicates an internal programming error/invariant violation, and throwing `std::bad_optional_access` is the desired behavior rather than defensive error handling with continue/return.

Applied to files:

  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
📚 Learning: 2025-11-06T11:16:52.917Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 184
File: tests/test-reader-parser.cpp:150-154
Timestamp: 2025-11-06T11:16:52.917Z
Learning: In Catch2 test files for the y-scope/log-surgeon repository, an explicit `if (false == optional.has_value()) { return; }` check may be required after `REQUIRE(optional.has_value())` to prevent clang-tidy errors, even though the code is unreachable at runtime. clang-tidy's static analyzer doesn't recognize that REQUIRE aborts execution and may flag unsafe optional access without the explicit check.

Applied to files:

  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
📚 Learning: 2025-08-15T12:07:58.626Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 150
File: tests/test-expression-view.cpp:7-7
Timestamp: 2025-08-15T12:07:58.626Z
Learning: In tests/test-expression-view.cpp, the `<catch2/catch_message.hpp>` header is required for clang-tidy to pass, even though it may not be directly used in the visible code.

Applied to files:

  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
📚 Learning: 2025-08-08T10:17:43.495Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 144
File: src/log_surgeon/wildcard_query_parser/VariableQueryToken.hpp:28-31
Timestamp: 2025-08-08T10:17:43.495Z
Learning: In src/log_surgeon/wildcard_query_parser/VariableQueryToken.hpp/.cpp (C++), do not suggest defaulting operator<=>. The project prefers a custom out-of-line comparator that explicitly handles the bool member (via explicit cast) to avoid implicit conversions; keep the current manual implementation.

Applied to files:

  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
📚 Learning: 2025-08-08T10:23:06.281Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 144
File: src/log_surgeon/wildcard_query_parser/QueryInterpretation.hpp:80-88
Timestamp: 2025-08-08T10:23:06.281Z
Learning: In y-scope/log-surgeon (C++), small function definitions are allowed to remain inline in headers. For src/log_surgeon/wildcard_query_parser/QueryInterpretation.hpp, do not suggest moving small methods like QueryInterpretation::append_variable_token to the .cpp for “consistency” in future reviews.

Applied to files:

  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
📚 Learning: 2025-08-08T10:00:20.963Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 144
File: src/log_surgeon/wildcard_query_parser/VariableQueryToken.hpp:28-31
Timestamp: 2025-08-08T10:00:20.963Z
Learning: In src/log_surgeon/wildcard_query_parser/QueryInterpretation.hpp (C++), do not default the comparison operators: the class stores std::vector<std::variant<StaticQueryToken, VariableQueryToken>>, which yields only a weak ordering. A custom operator<=> that maps the variant’s weak ordering to std::strong_ordering is required; avoid suggesting defaulting there in future reviews.

Applied to files:

  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
📚 Learning: 2024-11-13T22:25:54.168Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 48
File: tests/test-tag.cpp:10-10
Timestamp: 2024-11-13T22:25:54.168Z
Learning: In the log-surgeon codebase (C++), particularly in the finite automata components involving the `Tag` class (`src/log_surgeon/finite_automata/Tag.hpp`), it's important to ensure that `Tag*` pointers in other objects cannot be `nullptr`. Test cases should focus on validating that these `Tag*` pointers are not null where they are used, and handle `nullptr` appropriately.

Applied to files:

  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
📚 Learning: 2025-08-15T01:04:03.681Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 150
File: src/log_surgeon/wildcard_query_parser/ExpressionView.cpp:11-11
Timestamp: 2025-08-15T01:04:03.681Z
Learning: In y-scope/log-surgeon C++ codebase, prefer explicit includes in implementation files (.cpp) for types that are directly used, even when those headers are already transitively included through the corresponding header file. This makes dependencies explicit and improves maintainability.

Applied to files:

  • src/log_surgeon/Token.cpp
📚 Learning: 2024-11-27T21:56:13.425Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T21:56:13.425Z
Learning: In the `log_surgeon` project, header guards in C++ header files should include `_HPP` at the end to match the filename. For example, in `RegisterHandler.hpp`, the header guard should be `LOG_SURGEON_FINITE_AUTOMATA_REGISTER_HANDLER_HPP`.

Applied to files:

  • src/log_surgeon/Token.cpp
  • src/log_surgeon/Token.hpp
📚 Learning: 2025-05-05T14:55:34.455Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 106
File: src/log_surgeon/Lalr1Parser.tpp:661-665
Timestamp: 2025-05-05T14:55:34.455Z
Learning: The log-surgeon codebase follows a design approach where function contracts (like `ErrorCode::Success` guaranteeing a valid token) are trusted, and contract violations are allowed to throw exceptions rather than being explicitly checked at every call site.

Applied to files:

  • src/log_surgeon/Token.cpp
📚 Learning: 2024-11-18T16:45:46.073Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.073Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

Applied to files:

  • src/log_surgeon/Token.hpp
📚 Learning: 2025-08-26T10:06:22.914Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 152
File: src/log_surgeon/wildcard_query_parser/Query.hpp:9-11
Timestamp: 2025-08-26T10:06:22.914Z
Learning: In y-scope/log-surgeon project, it's acceptable to include headers like log_surgeon/Lexer.hpp directly in header files rather than using forward declarations, even when only references are used in the interface. The project prefers the simplicity of direct includes over header coupling optimization through forward declarations.

Applied to files:

  • src/log_surgeon/Token.hpp
📚 Learning: 2024-11-02T09:18:31.046Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-02T09:18:31.046Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.

Applied to files:

  • src/log_surgeon/Token.hpp
📚 Learning: 2025-12-09T21:17:27.546Z
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 197
File: src/log_surgeon/finite_automata/RegexAST.hpp:1011-1019
Timestamp: 2025-12-09T21:17:27.546Z
Learning: In C++ files under src/log_surgeon in the y-scope/log-surgeon repository, for const pointer parameters in null checks, Yoda conditions (nullptr == ptr) are not required; standard comparison order (ptr == nullptr) is acceptable since const prevents accidental assignment. Apply this pattern to header and related source files within src/log_surgeon.

Applied to files:

  • src/log_surgeon/Token.hpp
🧬 Code graph analysis (1)
src/log_surgeon/Token.cpp (1)
src/log_surgeon/Token.hpp (1)
  • start_pos (62-65)
🪛 Cppcheck (2.18.0)
src/log_surgeon/Token.cpp

[information] 4-4: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[information] 4-4: Include file

(missingIncludeSystem)


[style] 38-38: The function 'get_sub_token' is never used.

(unusedFunction)


[style] 46-46: The function 'set' is never used.

(unusedFunction)


[style] 38-38: The function 'append_position' is never used.

(unusedFunction)


[style] 48-48: The function 'get_num_regs' is never used.

(unusedFunction)

🪛 LanguageTool
docs/schema.md

[uncategorized] ~91-~91: Loose punctuation mark.
Context: ...pture within a header rule. * $PREFIX$, $SUFFIX$, and $TIMESTAMP-PATTERN$ a...

(UNLIKELY_OPENING_PUNCTUATION)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: lint (macos-15)
  • GitHub Check: lint (ubuntu-24.04)
  • GitHub Check: build (ubuntu-22.04, debug)
  • GitHub Check: build (macos-15, release)
  • GitHub Check: build (ubuntu-24.04, release)
  • GitHub Check: build (ubuntu-22.04, release)
  • GitHub Check: build (ubuntu-24.04, debug)
🔇 Additional comments (2)
src/log_surgeon/Token.hpp (1)

56-65: LGTM!

The get_sub_token method declaration is well-documented with clear parameter descriptions and exception specification.

src/log_surgeon/Token.cpp (1)

4-4: LGTM!

Correct include for std::out_of_range.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48ea212 and c1e4390.

📒 Files selected for processing (1)
  • examples/CMakeLists.txt (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 151
File: tests/test-buffer-parser.cpp:959-959
Timestamp: 2025-08-18T16:46:05.687Z
Learning: In the Log-Surgeon codebase, hyphens need to be escaped with backslashes in regular expressions even outside character classes, which differs from standard regex behavior where hyphens don't require escaping outside character classes.
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 152
File: src/log_surgeon/wildcard_query_parser/Query.cpp:159-177
Timestamp: 2025-08-25T20:44:48.955Z
Learning: In the log-surgeon codebase, thread safety issues with global state like NonTerminal::m_next_children_start should be addressed comprehensively in dedicated PRs rather than fixed piecemeal in individual feature PRs. The user SharafMohamed prefers to defer such systemic architectural issues to separate PRs.
Learnt from: SharafMohamed
Repo: y-scope/log-surgeon PR: 199
File: src/log_surgeon/LogParser.cpp:195-195
Timestamp: 2025-12-03T16:22:52.313Z
Learning: In y-scope/log-surgeon, when reviewing code, do not suggest fixing style issues (such as C-style casts vs. static_cast) in lines that were not modified by the PR. The maintainer prefers to keep PRs focused on their intended changes and defer broader style refactoring to dedicated refactor PRs.
Learnt from: davidlion
Repo: y-scope/log-surgeon PR: 165
File: src/log_surgeon/LogEvent.cpp:53-55
Timestamp: 2025-10-22T15:40:29.992Z
Learning: In `src/log_surgeon/LogEvent.cpp`, the `get_logtype()` method has two independent mechanisms for adding `<timestamp>` to the logtype string: (1) A prefix added when `has_timestamp()` returns true, indicating a standalone timestamp token from timestamp rules (lines 53-55), and (2) Named capture group processing that adds `<timestamp>` tags for `(?<timestamp>...)` patterns in any variable rule (lines 61-95). These mechanisms do not interfere with each other. Capture groups named `timestamp` do not affect `has_timestamp()` status, as using captures as actual timestamps is not supported.
<!-- [add_learning]
In the log-surgeon codebase, capture groups (e.g., `(?<timestamp>...)`) in schema variable rules are added to the logtype string through the existing capture group processing logic, but they do not create actual timestamp tokens and do not cause `has_timestamp()` to return true. Using capture groups as actual timestamps is not supported; standalone timestamp tokens come from `timestamp:...
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: build (ubuntu-22.04, release)
  • GitHub Check: build (ubuntu-22.04, debug)
  • GitHub Check: build (macos-15, release)
  • GitHub Check: build (macos-15, debug)
  • GitHub Check: build (ubuntu-24.04, release)
  • GitHub Check: build (ubuntu-24.04, debug)
  • GitHub Check: lint (ubuntu-24.04)
  • GitHub Check: lint (macos-15)

@davidlion davidlion merged commit 1a8d52e into y-scope:main Dec 19, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants