Skip to content

Conversation

@davidlion
Copy link
Member

Description

This draft PR is meant to track the gap between the prototype branch and main.

Possible small PRs to split out:

  • experimental flag
  • output handler refactor
  • Array.hpp

Potentially, the stat features can be split out into a separate PR. In this case NodeType::TypedVar would need to be reverted temporarily as it only exists to propagate a variable's type name.

Some other cleanup is still necessary (e.g. TODOs removed or converted to issues).

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

…d issue where only some string variables can filter with partial wildcards.
The following files will likely have merge conflicts when upstream clp
updates log surgeon and it should be fine to use --theirs:
    components/core/src/clp/GrepCore.cpp
    components/core/src/clp/streaming_archive/writer/Archive.cpp
    components/core/src/clp_s/log_converter/LogConverter.cpp
…essing to archive. Only trivial output currently.
@davidlion davidlion requested review from a team and gibber9809 as code owners November 24, 2025 14:53
@davidlion davidlion marked this pull request as draft November 24, 2025 14:53
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 24, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left an initial set of comments mostly focused on format-related stuff.

The most major thing we need a resolution on is escaping for the variable placeholder since that's breaking, but it would also be nice to get rid of TypedVar. We should definitely thing through stuff related to the other comments, but doesn't necessarily require any action for the upcoming release since its guarded by the experimental flag.

Will also try to do a pass through the rest of the code to see if I notice anything else, but the storage format stuff is definitely the most important.

Integer = 0x11,
Dictionary = 0x12,
Float = 0x13,
Schema = 0x14,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new variable placeholder is added here, but it seems like is_variable_placeholder in clp/ir/parsing.hpp hasn't been updated to include the placeholder. This means that the escaping code won't end up properly escaping 0x14 when it appears in logtext, which is a bug.

Granted, if we do properly escape it then this constitutes a major breaking format change that isn't guarded by the experimental flag. Given that, it might actually be better to intentionally leave this bug for now to avoid the breaking format change. The only alternative I can think of is guarding everything related to VariablePlacehodler::Schema by a bunch of compile time macros, but then we would need to ship a separate build for this experimental change.

Comment on lines 49 to 54
/**
*
* @return the total size of header data that will be written to the compressor in bytes
*/
virtual auto get_stats() const -> size_t { return 0; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
*
* @return the total size of header data that will be written to the compressor in bytes
*/
virtual auto get_stats() const -> size_t { return 0; }

Unused?


std::shared_ptr<LogTypeDictionaryWriter> m_log_dict;

std::vector<encoded_log_dict_id_t> m_logtypes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we end up just storing clp::logtype_dictioary_id_t with a static cast to encoded_log_dict_id_t. Not strictly necessary for this prototype, but I'd probably just change this to directly storing clp::logtype_dictionary_id_t -- no real reason to use encoded_log_dict_id_t here.

DictionaryFloat,
LogMessage,
LogType,
TypedVar,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TypedVar,

Since the actual storage format for TypedVar is identical to VarString, I think it's probably best to just get rid of TypedVar. Reasoning generally being that it gets harder to manage the storage format over time if we have multiple column types that are basically the same thing, and also that we have a limited number of NodeTypes (since they're 8 bits), so we generally want to avoid creating new ones when possible.

You'd need to move the variable dictionary encoding + the stat tracking into JsonParser (or maybe some utility in ArchiveWriter that you call from JsonParser) & change ParsedMessage/VarStringColumnWriter to allow directly passing a clp::variable_dictionary_id_t, but I think it's worth it.

DeltaInteger,
FormattedFloat,
DictionaryFloat,
LogMessage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LogMessage,
LogMessage = 100,

I think this is a worthwhile precaution just to make sure that the IDs for these unfinalized node types that'll only be produced under the experimental flag don't overlap with IDs for node types in non-experimental releases for the foreseeable future.

Comment on lines +1494 to +1498
// Storing both the full match and capture groups creates potential duplication of the capture
// group's substring.
// One option is to re-write the full match similar to a logtype where the capture groups are
// replaced with encoded placeholders. The full match would need to be rebuilt using its capture
// groups to display.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe possible to just store the parts of the full match that aren't the capture groups as part of the static logtext? Not sure if that easily fits into how log-surgeon works though.

Comment on lines +1475 to +1508
// TODO clpsls: variable repetition inside a log message
// It is possible to find multiple variables of the same type (therefore having the same name)
// inside the a single log event. This would result in the same node id in the clps schema tree. It
// is invalid JSON to have the same key in an object meaning we either need to create an array of
// values for the key or append to the key (variable type name / token name) to make it unique.
// Storing as an array complicates (named) search as the type of a variable's node would now be
// T|Array[T] (rather than just T).
//
// If it is possible to efficiently search all keys starting with a prefix, we could uniquely store
// each variable instance by appending to the key name (e.g. a node in a log message with no
// repetition would be named "var" while nodes in a message with repetition would be named var.0
// var.1, ..., var.n).
//
//
// TODO clpsls: capture group repetition
// Similarly, to variable repetition storing the capture group variable node as an array in cases
// with repetition creates a node type of T|Array[T].
//
// TODO clpsls:
// Storing both the full match and capture groups creates potential duplication of the capture
// group's substring.
// One option is to re-write the full match similar to a logtype where the capture groups are
// replaced with encoded placeholders. The full match would need to be rebuilt using its capture
// groups to display.
// Additionally wildcard search must know to avoid searching both the full match and capture groups.
// Using placeholders allows for rebuilding the variable (and log message) through node
// position/ordering as the placeholders in the full match enable you to know how many subsequent
// nodes are captures. Otherwise we either need to specially type the node (e.g. VarString and
// CaptureString) or do some other indexing/checking.
//
// ... Storing as a separate type seems the most sensible.
// There is a possible trade-off where de-duplicating the capture group string using placeholders in
// the full match node improves compression ration, but slows down search in certain cases as you
// maybe need to rebuild the full match's value.
Copy link
Contributor

@gibber9809 gibber9809 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely all questions we should have answers for before this makes it into a proper non-experimental release, but seems ok for this prototype. I am a bit worried the full match/capture groups duplication will make it a bit harder to get a baseline for how these changes affect compression ratio though.

Comment on lines 1636 to 1642
logtype_dict_entry.add_schema_var();
m_current_parsed_message.add_unordered_value(
ParsedMessage::TypedVar{std::string{cFullMatch}, token_view.to_string()}
);
m_current_schema.insert_unordered(
m_archive_writer->add_node(capture_node_id, NodeType::TypedVar, cFullMatch)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works, but its a little bit strange from the perspective of the current implementation.

Issue here is that the schema is supposed to be a set of leaf nodes that dominate some subtree and imply the full structure of the record. Here we're storing the full match and all of the captures as siblings, but arguably the full match is the parent of the captures. That said, we definitely don't support having a schema tree node and its children be part of the same schema, so the way you're storing this is probably fine. I guess the "conventional" thing to do would be to just store the captures, but that requires some thought.

Should be fine for now since this is a prototype, but its something we need to think through.

capture_view.set_start_pos(start_positions[0]);
capture_view.set_end_pos(end_positions[0]);

logtype_dict_entry.add_schema_var();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the logtype end up looking like? It seems like we add a schema var for the full match and all of the captures, which seems weird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants