Skip to content
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
1f4b310
Kinda working.
SharafMohamed Nov 14, 2025
dfa411e
Merge branch 'main' into log-event-boundary-new
SharafMohamed Nov 28, 2025
3ad08ea
Fix edge cases; Update tests.
SharafMohamed Dec 3, 2025
2cc83c9
Update test docstring.
SharafMohamed Dec 3, 2025
3a21375
Fix examples/common.cpp.
SharafMohamed Dec 3, 2025
fc520fd
Cleanup.
SharafMohamed Dec 3, 2025
caca5bc
Add clarifying comment.
SharafMohamed Dec 3, 2025
dd8e45e
Fix accessors for timestamp in output buffer.
SharafMohamed Dec 3, 2025
db78cd6
Fix caching.
SharafMohamed Dec 3, 2025
1351a71
reset m_has_timestamp.
SharafMohamed Dec 3, 2025
1f11aa2
Always get capture_string on wrap around.
SharafMohamed Dec 3, 2025
25b2874
Remove m_has_timestamp and replace it with optional m_timestamp.
SharafMohamed Dec 15, 2025
4bd25cb
Update reader parser to use header check.
SharafMohamed Dec 15, 2025
80376bc
Compile examples.
SharafMohamed Dec 15, 2025
db50803
Replace macros.
SharafMohamed Dec 15, 2025
3d02572
Have get_reg_ids_from_capture throw instead of returning optional.
SharafMohamed Dec 15, 2025
0dfd168
Update readme.
SharafMohamed Dec 15, 2025
82bfd86
Update example schema.
SharafMohamed Dec 15, 2025
a64e9b5
Update readme again.
SharafMohamed Dec 15, 2025
39ebb8f
Improve example schema.
SharafMohamed Dec 15, 2025
67d6e72
Improve example schema.
SharafMohamed Dec 15, 2025
df74af5
Remove extra : in example schema.
SharafMohamed Dec 15, 2025
494f8d1
Fix wording.
SharafMohamed Dec 15, 2025
725a71e
Return captures as tokens to prevent invalidating cache during nested…
SharafMohamed Dec 16, 2025
482df96
Merge branch 'main' into log-event-boundary-new
SharafMohamed Dec 16, 2025
919b598
Update readme to use as $txt$ placeholders and add examples to make i…
SharafMohamed Dec 17, 2025
bd0f4fd
Rename get_capture_token to get_sub_token.
SharafMohamed Dec 17, 2025
23d466c
Add newline.
SharafMohamed Dec 17, 2025
6cb38ea
Update get_sub_token.
SharafMohamed Dec 17, 2025
5c17c15
Allow for non escaped hyphens outside of ranges.
SharafMohamed Dec 17, 2025
7b4d618
Remove unused header.
SharafMohamed Dec 18, 2025
aa61693
Remove unused header.
SharafMohamed Dec 18, 2025
6771a4f
Replace txt with plaintext.
SharafMohamed Dec 18, 2025
48ea212
Merge branch 'log-event-boundary-new' into hyphen_fix
SharafMohamed Dec 18, 2025
c1e4390
Fix cmakelists.
SharafMohamed Dec 18, 2025
791e83e
Fix cmakelists.
SharafMohamed Dec 18, 2025
a47ab26
Merge branch 'main' into log-event-boundary-new
SharafMohamed Dec 19, 2025
351bd18
Merge branch 'main' into log-event-boundary-new
davidlion Dec 19, 2025
096ca4b
Merge branch 'main' into log-event-boundary-new
SharafMohamed Dec 19, 2025
9e538b7
Merge branch 'log-event-boundary-new' into hyphen_fix
SharafMohamed Dec 19, 2025
9a7d974
Merge branch 'log-event-boundary-new' of https://github.com/SharafMoh…
SharafMohamed Dec 19, 2025
3b4dd08
Lint.
SharafMohamed Dec 19, 2025
9002260
Merge branch 'log-event-boundary-new' into hyphen_fix
SharafMohamed Dec 19, 2025
fdc5178
Merge branch 'main' into hyphen_fix
SharafMohamed Dec 19, 2025
95aa1d2
Remove escaped hyphens.
SharafMohamed Dec 19, 2025
2a61ca7
update schema.md for hyphens.
SharafMohamed Dec 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 50 additions & 23 deletions docs/schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@ There are three types of rules in a schema file:

* [Variables](#variables): Defines patterns for capturing specific pieces of the log.
* [Delimiters](#delimiters): Specifies the characters that separate tokens in the log.
* [Timestamps](#timestamps): Identifies the boundary between log events. Timestamps are also treated
as variables.
* [Headers](#headers): Identifies the boundary between log events. Headers are also treated as
variables.
* The first capture named `timestamp` matched within a header pattern is considered the log
event's timestamp.

For documentation, the schema allows for user comments by ignoring any text preceded by `//`.

Expand All @@ -26,15 +28,21 @@ For documentation, the schema allows for user comments by ignoring any text prec
**Syntax:**

```txt
<variable-name>:<variable-pattern>
$VARIABLE_NAME$:$VARIABLE_PATTERN$
```

* `variable-name` may contain any alphanumeric characters, but may not be the reserved names
`delimiters` or `timestamp`.
* `variable-pattern` is a regular expression using the supported
* `$VARIABLE_NAME$` may contain any alphanumeric characters, but may not use the reserved names
`delimiters`, `header`, or `timestamp`.
* `$VARIABLE_PATTERN$` is a regular expression using the supported
[syntax](#regular-expression-syntax).

Note that:
**Example:**

```txt
equalsCapture:.*=(?<equals>.*[a-zA-Z0-9].*)
```

**Note that:**

* A schema file may contain zero or more variable rules.
* Repeating the same variable name in another rule will `OR` the regular expressions (perform an
Expand All @@ -47,36 +55,54 @@ Note that:
**Syntax:**

```txt
delimiters:<characters>
delimiters:$CHARACTERS$
```

* `delimiters` is a reserved name for this rule.
* `characters` is a set of characters that should be treated as delimiters. These characters define
the boundaries between tokens in the log.
* `$CHARACTERS$` is a set of characters that should be treated as delimiters. These characters
define the boundaries between tokens in the log.

Note that:
**Example:**

```txt
delimiters: \t\r\n:,!;%
```

**Note that:**

* A schema file must contain at least one `delimiters` rule. If multiple `delimiters` rules are
specified, only the last one will be used.

### Timestamps
### Headers

**Syntax:**

```txt
timestamp:<timestamp-pattern>
header:$PREFIX$(?<timestamp>$TIMESTAMP-PATTERN$)$SUFFIX$
```

* `timestamp` is a reserved name for this rule.
* `timestamp-pattern` is a regular expression using the supported
* Multiple headers can be specified within a schema.
* The timestamp capture can be omitted if the log-event boundary does not contain a timestamp.
* Multiple timestamp captures are allowed within a header. These can exist within regex repetitions
or alternations.
* If no timestamps are parsed, the event's logtype has no timestamp.
* If one or more timestamps are parsed, the event's logtype uses the first timestamp.
* `timestamp` is a reserved name for the capture within a header rule.
* `$PREFIX$`, `$SUFFIX$`, and `$TIMESTAMP-PATTERN$` are regular expressions using the supported
[syntax](#regular-expression-syntax).

Note that:
**Example:**

```txt
header:Log (?<pid>\d+) (?<timestamp>\[\d{8}\-\d{2}:\d{2}:\d{2}\]){0,1}
```

**Note that:**

* The parser uses a timestamp to denote the start of a new log event if:
* The parser uses a header to denote the start of a new log event if:
* It appears as the first token in the input, or
* It appears after a newline character.
* Until a timestamp is found, the parser will use a newline character to denote the start of a new
* Until a header is found, the parser will use a newline character to denote the start of a new
log event.

## Example schema file
Expand All @@ -86,10 +112,11 @@ Note that:
delimiters: \t\r\n:,!;%

// Keywords
timestamp:\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}(\.\d{3}){0,1}
timestamp:\[\d{8}\-\d{2}:\d{2}:\d{2}\]
int:\-{0,1}[0-9]+
float:\-{0,1}[0-9]+\.[0-9]+
header:(?<timestamp>\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}(\.\d{3}){0,1})
header:Log (?<pid>\d+) (?<timestamp>\[\d{8}\-\d{2}:\d{2}:\d{2}\]){0,1}
header:--- Log:
int:\-{0,1}\d+
float:\-{0,1}\d+\.\d+

// Custom variables
hex:[a-fA-F]+
Expand All @@ -99,7 +126,7 @@ equalsCapture:.*=(?<equals>.*[a-zA-Z0-9].*)

* `delimiters: \t\r\n:,!;%` indicates that ` `, `\t`, `\r`, `\n`, `:`, `,`, `!`, `;`, and `%` are
delimiters.
* `timestamp` matches two different patterns:
* `header` matches two different timestamp patterns:
* `2023-04-19 12:32:08.064`
* `[20230419-12:32:08]`
* `int`, `float`, `hex`, `hasNumber`, and `equalsCapture` all match different user defined
Expand Down
9 changes: 3 additions & 6 deletions examples/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,16 +31,13 @@ auto check_input(std::vector<std::string> const& args) -> int {
}

auto print_timestamp_loglevel(LogEventView const& event, uint32_t loglevel_id) -> void {
Token* timestamp{event.get_timestamp()};
Token* loglevel{nullptr};
if (nullptr != timestamp) {
auto const& optional_timestamp{event.get_log_output_buffer()->get_timestamp()};
if (optional_timestamp.has_value()) {
if (auto const& vec{event.get_variables(loglevel_id)}; false == vec.empty()) {
loglevel = vec[0];
}
}
if (nullptr != timestamp) {
cout << "timestamp: ";
cout << timestamp->to_string_view();
cout << "timestamp: " << optional_timestamp.value();
}
if (nullptr != loglevel) {
cout << ", loglevel:";
Expand Down
46 changes: 23 additions & 23 deletions examples/schema.txt
Original file line number Diff line number Diff line change
@@ -1,53 +1,53 @@
// Timestamps (using the `timestamp` keyword)
// Timestamps (using a `header` rule with a `timestamp` named capture)
// E.g. 2015-01-31T15:50:45.392
timestamp:\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}.\d{3}
header:(?<timestamp>\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}.\d{3})
// E.g. 2015-01-31T15:50:45,392
timestamp:\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2},\d{3}
header:(?<timestamp>\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2},\d{3})
// E.g. [2015-01-31T15:50:45
timestamp:\[\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}
header:(?<timestamp>\[\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2})
// E.g. [20170106-16:56:41]
timestamp:\[\d{4}\d{2}\d{2}\-\d{2}:\d{2}:\d{2}\]
header:(?<timestamp>\[\d{4}\d{2}\d{2}\-\d{2}:\d{2}:\d{2}\])
// E.g. 2015-01-31 15:50:45,392
// E.g. INFO [main] 2015-01-31 15:50:45,085
timestamp:\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2},\d{3}
header:(?<timestamp>\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2},\d{3})
// E.g. 2015-01-31 15:50:45.392
timestamp:\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}.\d{3}
header:(?<timestamp>\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}.\d{3})
// E.g. [2015-01-31 15:50:45,085]
timestamp:\[\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2},\d{3}\]
header:(?<timestamp>\[\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2},\d{3}\])
// E.g. 2015-01-31 15:50:45
// E.g. Started POST /api/v3/internal/allowed for 127.0.0.1 at 2017-06-18 00:20:44
// E.g. update-alternatives 2015-01-31 15:50:45
timestamp:\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}
header:(?<timestamp>\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2})
// E.g. Start-Date: 2015-01-31 15:50:45
timestamp:\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}
header:(?<timestamp>\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2})
// E.g. 2015/01/31 15:50:45
timestamp:\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}
header:(?<timestamp>\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2})
// E.g. 15/01/31 15:50:45
timestamp:\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}
header:(?<timestamp>\d{2}/\d{2}/\d{2} \d{2}:\d{2}:\d{2})
// E.g. 150131 9:50:45
timestamp:\d{2}\d{2}\d{2} [ 0-9]{2}:\d{2}:\d{2}
header:(?<timestamp>\d{2}\d{2}\d{2} [ 0-9]{2}:\d{2}:\d{2})
// E.g. 01 Jan 2016 15:50:17,085
timestamp:\d{2} [A-Z][a-z]{2} \d{4} \d{2}:\d{2}:\d{2},\d{3}
header:(?<timestamp>\d{2} [A-Z][a-z]{2} \d{4} \d{2}:\d{2}:\d{2},\d{3})
// E.g. Jan 01, 2016 3:50:17 PM
timestamp:[A-Z][a-z]{2} \d{2}, \d{4} [ 0-9]{2}:\d{2}:\d{2} [AP]M
header:(?<timestamp>[A-Z][a-z]{2} \d{2}, \d{4} [ 0-9]{2}:\d{2}:\d{2} [AP]M)
// E.g. January 31, 2015 15:50
timestamp:[A-Z][a-z]+ \d{2}, \d{4} \d{2}:\d{2}
header:(?<timestamp>[A-Z][a-z]+ \d{2}, \d{4} \d{2}:\d{2})
// E.g. E [31/Jan/2015:15:50:45
// E.g. localhost - - [01/Jan/2016:15:50:17
// E.g. 192.168.4.5 - - [01/Jan/2016:15:50:17
timestamp:\[\d{2}/[A-Z][a-z]{2}/\d{4}:\d{2}:\d{2}:\d{2}
header:(?<timestamp>\[\d{2}/[A-Z][a-z]{2}/\d{4}:\d{2}:\d{2}:\d{2})
// E.g. 192.168.4.5 - - [01/01/2016:15:50:17
timestamp:\[\d{2}/\d{2}/\d{4}:\d{2}:\d{2}:\d{2}
header:(?<timestamp>\[\d{2}/\d{2}/\d{4}:\d{2}:\d{2}:\d{2})
// E.g. ERROR: apport (pid 4557) Sun Jan 1 15:50:45 2015
timestamp:[A-Z][a-z]{2} [A-Z][a-z]{2} [ 0-9]{2} \d{2}:\d{2}:\d{2} \d{4}
header:(?<timestamp>[A-Z][a-z]{2} [A-Z][a-z]{2} [ 0-9]{2} \d{2}:\d{2}:\d{2} \d{4})
// E.g. <<<2016-11-10 03:02:29:936
timestamp:\<\<\<\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}:\d{3}
header:(?<timestamp>\<\<\<\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}:\d{3})
// E.g. Jan 21 11:56:42
timestamp:[A-Z][a-z]{2} \d{2} \d{2}:\d{2}:\d{2}
header:(?<timestamp>[A-Z][a-z]{2} \d{2} \d{2}:\d{2}:\d{2})
// E.g. 01-21 11:56:42.392
timestamp:\d{2}\-\d{2} \d{2}:\d{2}:\d{2}.\d{3}
header:(?<timestamp>\d{2}\-\d{2} \d{2}:\d{2}:\d{2}.\d{3})
// E.g. 2016-05-08 11:34:04.083464
timestamp:\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}.\d{6}
header:(?<timestamp>\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}.\d{6})

// Delimiters
delimiters: \t\r\n:,!;%
Expand Down
6 changes: 2 additions & 4 deletions src/log_surgeon/Constants.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,7 @@ enum class SymbolId : uint32_t {
TokenInt,
TokenFloat,
TokenHex,
TokenFirstTimestamp,
TokenNewlineTimestamp,
TokenHeader,
TokenNewline
};

Expand All @@ -45,8 +44,7 @@ constexpr char cTokenUncaughtString[] = "$UncaughtString";
constexpr char cTokenInt[] = "int";
constexpr char cTokenFloat[] = "float";
constexpr char cTokenHex[] = "hex";
constexpr char cTokenFirstTimestamp[] = "firstTimestamp";
constexpr char cTokenNewlineTimestamp[] = "newLineTimestamp";
constexpr char cTokenHeader[] = "header";
constexpr char cTokenNewline[] = "newLine";
// Buffer size cannot be odd, so always use a multiple of 2
constexpr uint32_t cStaticByteBuffSize = 48'000;
Expand Down
3 changes: 1 addition & 2 deletions src/log_surgeon/Lalr1Parser.tpp
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,7 @@ Lalr1Parser<TypedNfaState, TypedDfaState>::Lalr1Parser() {
m_terminals.insert((uint32_t)SymbolId::TokenInt);
m_terminals.insert((uint32_t)SymbolId::TokenFloat);
m_terminals.insert((uint32_t)SymbolId::TokenHex);
m_terminals.insert((uint32_t)SymbolId::TokenFirstTimestamp);
m_terminals.insert((uint32_t)SymbolId::TokenNewlineTimestamp);
m_terminals.insert((uint32_t)SymbolId::TokenHeader);
m_terminals.insert((uint32_t)SymbolId::TokenNewline);
}

Expand Down
11 changes: 6 additions & 5 deletions src/log_surgeon/Lexer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#include <memory>
#include <optional>
#include <set>
#include <stdexcept>
#include <string>
#include <unordered_map>
#include <utility>
Expand Down Expand Up @@ -212,24 +213,24 @@ class Lexer {
* Retrieves the register IDs for the start and end tags associated with a given capture.
* @param capture Pointer to the capture to search for.
* @return A pair of register IDs corresponding to the start and end tags of the capture.
* @return std::nullopt if no such capture is found.
* @throw runtime_error if capture does not have tag ids or register ids.
*/
[[nodiscard]] auto get_reg_ids_from_capture(finite_automata::Capture const* const capture) const
-> std::optional<std::pair<reg_id_t, reg_id_t>> {
-> std::pair<reg_id_t, reg_id_t> {
auto const optional_tag_id_pair{get_tag_id_pair_from_capture(capture)};
if (false == optional_tag_id_pair.has_value()) {
return std::nullopt;
throw std::runtime_error(capture->get_name() + " has no tag ids");
}
auto const [start_tag_id, end_tag_id]{optional_tag_id_pair.value()};

auto const optional_start_reg_id{get_reg_id_from_tag_id(start_tag_id)};
if (false == optional_start_reg_id.has_value()) {
return std::nullopt;
throw std::runtime_error(capture->get_name() + " has no start reg id");
}

auto const optional_end_reg_id{get_reg_id_from_tag_id(end_tag_id)};
if (false == optional_end_reg_id.has_value()) {
return std::nullopt;
throw std::runtime_error(capture->get_name() + " has no end reg id");
}

return std::make_pair(optional_start_reg_id.value(), optional_end_reg_id.value());
Expand Down
Loading
Loading