Skip to content

feat: Handle tracking multi-valued tags for capture regex. #130

@SharafMohamed

Description

@SharafMohamed

Request

Currently you can specify regex with capture groups inside a repetition. However, it only returns the last match of the capture instead of all matches.

  • Essentially, the issue is that for multi-valued tags (e.g. ([a]+=(?<val>[a0]+),){4}) we fail to track all instances of the tag.
  • More generally, if a variable is ambiguous, the DFA only tracks the tags for one interpretation. If the user chooses a different interpretation after lexing, then the tag positions will be incorrect.
  • Additionally, if a variable is partially-ambiguous such that a prefix is ambiguous, similar problems can occur where it begins by tacking one interpretation's tags, and even if that interpretation proves to be incorrect for the full variable, the correct interpretation will now have failed to track the needed tags.

Possible implementation

Fix this test case:

/**
 * @ingroup test_buffer_parser_newline_vars
 *
 * @brief Test capture group repetition and backtracking.
 *
 * @details
 * This test checks `BufferParser`'s handling of a variable with a regex containing capture groups
 * repeated multiple times. It verifies the positions of captured subgroups within the parsed token
 * and ensures correct tokenization of the repeated pattern.
 *
 * @section schema Schema Definition
 * @code
 * delimiters: \n\r\[:,)
 * myVar: ([A-Za-z]+=(?<val>[a-zA-Z0-9]+),){4}
 * @endcode
 *
 * @section input Test Input
 * @code
 * "userID=123,age=30,height=70,weight=100,"
 * @endcode
 *
 * @section expected Expected Logtype
 * @code
 * "userID=<val>,age=<val>,height=<val>,weight=<val>,"
 * @endcode
 *
 * @section expected Expected Tokenization
 * @code
 * "userID=123,age=30,height=70,weight=100," -> "keyValuePairs" with:
 *   "123" -> "val", "30 -> "val", "70" -> "val", "100" -> "val"
 * @endcode
 */
TEST_CASE("Test buffer parser with capture group repetition and backtracking", "[BufferParser]") {
    constexpr string_view cDelimitersSchema{R"(delimiters: \n\r\[:,)"};
    constexpr string_view cVarSchema{"keyValuePairs:([A-Za-z]+=(?<val>[a-zA-Z0-9]+),){4}"};
    constexpr string_view cInput{"userID=123,age=30,height=70,weight=100,"};
    ExpectedEvent const expected_event{
            .m_logtype{R"(userID=<val>,age=<val>,height=<val>,weight=<val>,)"},
            .m_timestamp_raw{""},
            .m_tokens{
                    {{"userID=123,age=30,height=70,weight=100,",
                      "keyValuePairs",
                      {{{"val", {{35, 25, 15, 7}, {37, 27, 17, 10}}}}}}}
            }
    };

    Schema schema;
    schema.add_delimiters(cDelimitersSchema);
    schema.add_variable(cVarSchema, -1);
    BufferParser buffer_parser{std::move(schema.release_schema_ast_ptr())};

    parse_and_validate(buffer_parser, cInput, {expected_event});
    // TODO: add backtracking case
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions