Clps + ls prototype draft #1653

davidlion · 2025-11-24T14:53:08Z

Description

This draft PR is meant to track the gap between the prototype branch and main.

Possible small PRs to split out:

experimental flag
output handler refactor
Array.hpp

Potentially, the stat features can be split out into a separate PR. In this case NodeType::TypedVar would need to be reverted temporarily as it only exists to propagate a variable's type name.

Some other cleanup is still necessary (e.g. TODOs removed or converted to issues).

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

…plays full LogMessage.

…ser based on file type.

…d issue where only some string variables can filter with partial wildcards.

…ersion.

The following files will likely have merge conflicts when upstream clp updates log surgeon and it should be fine to use --theirs: components/core/src/clp/GrepCore.cpp components/core/src/clp/streaming_archive/writer/Archive.cpp components/core/src/clp_s/log_converter/LogConverter.cpp

…essing to archive. Only trivial output currently.

…al stats output.

…tionary and var stats.

coderabbitai · 2025-11-24T14:53:25Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gibber9809

Left an initial set of comments mostly focused on format-related stuff.

The most major thing we need a resolution on is escaping for the variable placeholder since that's breaking, but it would also be nice to get rid of TypedVar. We should definitely thing through stuff related to the other comments, but doesn't necessarily require any action for the upcoming release since its guarded by the experimental flag.

Will also try to do a pass through the rest of the code to see if I notice anything else, but the storage format stuff is definitely the most important.

gibber9809 · 2025-11-25T15:48:09Z

components/core/src/clp/ir/types.hpp

    Integer = 0x11,
    Dictionary = 0x12,
    Float = 0x13,
+    Schema = 0x14,


A new variable placeholder is added here, but it seems like is_variable_placeholder in clp/ir/parsing.hpp hasn't been updated to include the placeholder. This means that the escaping code won't end up properly escaping 0x14 when it appears in logtext, which is a bug.

Granted, if we do properly escape it then this constitutes a major breaking format change that isn't guarded by the experimental flag. Given that, it might actually be better to intentionally leave this bug for now to avoid the breaking format change. The only alternative I can think of is guarding everything related to VariablePlacehodler::Schema by a bunch of compile time macros, but then we would need to ship a separate build for this experimental change.

gibber9809 · 2025-11-25T15:55:44Z

components/core/src/clp_s/ColumnWriter.hpp

+    /**
+     *
+     * @return the total size of header data that will be written to the compressor in bytes
+     */
+    virtual auto get_stats() const -> size_t { return 0; }
+


Suggested change

/**

*

* @return the total size of header data that will be written to the compressor in bytes

*/

virtual auto get_stats() const -> size_t { return 0; }

Unused?

gibber9809 · 2025-11-25T15:59:12Z

components/core/src/clp_s/ColumnWriter.hpp

+
+    std::shared_ptr<LogTypeDictionaryWriter> m_log_dict;
+
+    std::vector<encoded_log_dict_id_t> m_logtypes;


Looks like we end up just storing clp::logtype_dictioary_id_t with a static cast to encoded_log_dict_id_t. Not strictly necessary for this prototype, but I'd probably just change this to directly storing clp::logtype_dictionary_id_t -- no real reason to use encoded_log_dict_id_t here.

gibber9809 · 2025-11-25T16:08:08Z

components/core/src/clp_s/SchemaTree.hpp

    DictionaryFloat,
+    LogMessage,
+    LogType,
+    TypedVar,


Suggested change

TypedVar,

Since the actual storage format for TypedVar is identical to VarString, I think it's probably best to just get rid of TypedVar. Reasoning generally being that it gets harder to manage the storage format over time if we have multiple column types that are basically the same thing, and also that we have a limited number of NodeTypes (since they're 8 bits), so we generally want to avoid creating new ones when possible.

You'd need to move the variable dictionary encoding + the stat tracking into JsonParser (or maybe some utility in ArchiveWriter that you call from JsonParser) & change ParsedMessage/VarStringColumnWriter to allow directly passing a clp::variable_dictionary_id_t, but I think it's worth it.

gibber9809 · 2025-11-25T16:10:37Z

components/core/src/clp_s/SchemaTree.hpp

    DeltaInteger,
    FormattedFloat,
    DictionaryFloat,
+    LogMessage,


Suggested change

LogMessage,

LogMessage = 100,

I think this is a worthwhile precaution just to make sure that the IDs for these unfinalized node types that'll only be produced under the experimental flag don't overlap with IDs for node types in non-experimental releases for the foreseeable future.

gibber9809 · 2025-11-25T16:23:28Z

components/core/src/clp_s/JsonParser.cpp

+// Storing both the full match and capture groups creates potential duplication of the capture
+// group's substring.
+// One option is to re-write the full match similar to a logtype where the capture groups are
+// replaced with encoded placeholders. The full match would need to be rebuilt using its capture
+// groups to display.


Also maybe possible to just store the parts of the full match that aren't the capture groups as part of the static logtext? Not sure if that easily fits into how log-surgeon works though.

gibber9809 · 2025-11-25T16:26:09Z

components/core/src/clp_s/JsonParser.cpp

+// TODO clpsls: variable repetition inside a log message
+// It is possible to find multiple variables of the same type (therefore having the same name)
+// inside the a single log event. This would result in the same node id in the clps schema tree. It
+// is invalid JSON to have the same key in an object meaning we either need to create an array of
+// values for the key or append to the key (variable type name / token name) to make it unique.
+// Storing as an array complicates (named) search as the type of a variable's node would now be
+// T|Array[T] (rather than just T).
+//
+// If it is possible to efficiently search all keys starting with a prefix, we could uniquely store
+// each variable instance by appending to the key name (e.g. a node in a log message with no
+// repetition would be named "var" while nodes in a message with repetition would be named var.0
+// var.1, ..., var.n).
+//
+//
+// TODO clpsls: capture group repetition
+// Similarly, to variable repetition storing the capture group variable node as an array in cases
+// with repetition creates a node type of T|Array[T].
+//
+// TODO clpsls:
+// Storing both the full match and capture groups creates potential duplication of the capture
+// group's substring.
+// One option is to re-write the full match similar to a logtype where the capture groups are
+// replaced with encoded placeholders. The full match would need to be rebuilt using its capture
+// groups to display.
+// Additionally wildcard search must know to avoid searching both the full match and capture groups.
+// Using placeholders allows for rebuilding the variable (and log message) through node
+// position/ordering as the placeholders in the full match enable you to know how many subsequent
+// nodes are captures. Otherwise we either need to specially type the node (e.g. VarString and
+// CaptureString) or do some other indexing/checking.
+//
+// ... Storing as a separate type seems the most sensible.
+// There is a possible trade-off where de-duplicating the capture group string using placeholders in
+// the full match node improves compression ration, but slows down search in certain cases as you
+// maybe need to rebuild the full match's value.


Definitely all questions we should have answers for before this makes it into a proper non-experimental release, but seems ok for this prototype. I am a bit worried the full match/capture groups duplication will make it a bit harder to get a baseline for how these changes affect compression ratio though.

gibber9809 · 2025-11-25T16:46:40Z

components/core/src/clp_s/JsonParser.cpp

+                logtype_dict_entry.add_schema_var();
+                m_current_parsed_message.add_unordered_value(
+                        ParsedMessage::TypedVar{std::string{cFullMatch}, token_view.to_string()}
+                );
+                m_current_schema.insert_unordered(
+                        m_archive_writer->add_node(capture_node_id, NodeType::TypedVar, cFullMatch)
+                );


I think this works, but its a little bit strange from the perspective of the current implementation.

Issue here is that the schema is supposed to be a set of leaf nodes that dominate some subtree and imply the full structure of the record. Here we're storing the full match and all of the captures as siblings, but arguably the full match is the parent of the captures. That said, we definitely don't support having a schema tree node and its children be part of the same schema, so the way you're storing this is probably fine. I guess the "conventional" thing to do would be to just store the captures, but that requires some thought.

Should be fine for now since this is a prototype, but its something we need to think through.

gibber9809 · 2025-11-25T16:47:26Z

components/core/src/clp_s/JsonParser.cpp

+                        capture_view.set_start_pos(start_positions[0]);
+                        capture_view.set_end_pos(end_positions[0]);
+
+                        logtype_dict_entry.add_schema_var();


What does the logtype end up looking like? It seems like we add a schema var for the full match and all of the captures, which seems weird.

… non-prototype NodeTypes are added.

davidlion added 30 commits October 3, 2025 08:55

Squash starting code. Compression + search run, but search always dis…

1bf8e5c

…plays full LogMessage.

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

457052b

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

8e10d94

Initial code for reading plain text files. Need change type of ls par…

eacfaed

…ser based on file type.

Convert FullMatch and LogType to VarStringT for search.

92b6134

Add capture group support. Support LogType pure-wildcard search. Weir…

d511a22

…d issue where only some string variables can filter with partial wildcards.

Remove some debug logs.

48fc474

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

7a1fbdd

Drop FullMatch type as it is unnecessary.

922bc67

Use DictionaryFloat to work around float conversion; Small refactors.

9300edc

Added FormattedFloat using stod.

8d4c317

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

5e0bdda

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

7dff61f

Remove cmake/Modules/FindLibArchive.cmake due to conflict with task v…

1c0e50b

…ersion.

Fix error code linking issue in indexer.

591a457

Bump libarchive.

44adcb2

Add experimental cli arg with printing placeholders.

c66ca59

Random tweak.

98a6443

Refactor argument passing and log surgeon parser creation.

d979fc4

Switch back to log surgeon main.

b2d941c

feat: Add logtype and variable stats tracking and compressing/decompr…

bf58b50

…essing to archive. Only trivial output currently.

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

150123f

Fix newline in clpsls debug logs.

d355b8b

Fix type typo.

b9d67aa

Refactor the creation of an output handler and use it with experiment…

da1ff34

…al stats output.

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

edb5376

Call flush on output_handler.

f4d1779

namespace fix

c4397c0

Improve experimental flag handling.

7296e0a

davidlion added 11 commits November 12, 2025 17:00

Refactor experimental stats to be in a separate file inside an archive.

d1dc63c

Update cli args before validating.

437b2ce

Remove dead code for a feature that will be added separately later.

b9a58d6

Fix non-experimental path to correctly use ClpStrings again.

016c3bf

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

7c75d79

Add functionality to Array to fix possible holes/gaps between var dic…

0fc78fa

…tionary and var stats.

Add TypedVar to QueryRunner initialize_reader.

5626d3c

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

8e4a36a

Revert dep changes to build.

b3fd4c0

Revert some other changes.

0bdb9cf

Small linting fix.

0a1ae76

davidlion requested review from a team and gibber9809 as code owners November 24, 2025 14:53

davidlion marked this pull request as draft November 24, 2025 14:53

davidlion added 2 commits November 24, 2025 09:54

Revert unnecessary change to clean up diff.

6c5fc8b

Refactor archive experimental option handling; unit tests passing.

fec1cb0

gibber9809 requested changes Nov 25, 2025

View reviewed changes

davidlion added 6 commits November 25, 2025 17:52

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

1254976

Remove TypedVar and add stat helpers to ArchiveWrtier.

92fa397

Remove unused get_stats from ColumnReader.

e2e7765

Set prototype NodeType IDs to be large avoiding any conflicts with as…

a97404b

… non-prototype NodeTypes are added.

Remove unused enocded logtype id from LogTypeColumn{Reader,Writer}.

c35dcd9

Merge remote-tracking branch 'upstream/main' into clpsls-prototype

f2bc72f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clps + ls prototype draft #1653

Clps + ls prototype draft #1653

Uh oh!

davidlion commented Nov 24, 2025

Uh oh!

coderabbitai bot commented Nov 24, 2025 •

edited

Loading

Review skipped

Uh oh!

gibber9809 left a comment

Uh oh!

gibber9809 Nov 25, 2025

Uh oh!

gibber9809 Nov 25, 2025

Uh oh!

gibber9809 Nov 25, 2025

Uh oh!

gibber9809 Nov 25, 2025

Uh oh!

gibber9809 Nov 25, 2025

Uh oh!

gibber9809 Nov 25, 2025

Uh oh!

gibber9809 Nov 25, 2025 •

edited

Loading

Uh oh!

gibber9809 Nov 25, 2025

Uh oh!

gibber9809 Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		std::shared_ptr<LogTypeDictionaryWriter> m_log_dict;

		std::vector<encoded_log_dict_id_t> m_logtypes;

Clps + ls prototype draft #1653

Are you sure you want to change the base?

Clps + ls prototype draft #1653

Uh oh!

Conversation

davidlion commented Nov 24, 2025

Description

Checklist

Validation performed

Uh oh!

coderabbitai bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gibber9809 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gibber9809 Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Nov 24, 2025 •

edited

Loading

gibber9809 Nov 25, 2025 •

edited

Loading