-
Notifications
You must be signed in to change notification settings - Fork 83
Clps + ls prototype draft #1653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…plays full LogMessage.
…ser based on file type.
…d issue where only some string variables can filter with partial wildcards.
The following files will likely have merge conflicts when upstream clp
updates log surgeon and it should be fine to use --theirs:
components/core/src/clp/GrepCore.cpp
components/core/src/clp/streaming_archive/writer/Archive.cpp
components/core/src/clp_s/log_converter/LogConverter.cpp
…essing to archive. Only trivial output currently.
…tionary and var stats.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
gibber9809
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left an initial set of comments mostly focused on format-related stuff.
The most major thing we need a resolution on is escaping for the variable placeholder since that's breaking, but it would also be nice to get rid of TypedVar. We should definitely thing through stuff related to the other comments, but doesn't necessarily require any action for the upcoming release since its guarded by the experimental flag.
Will also try to do a pass through the rest of the code to see if I notice anything else, but the storage format stuff is definitely the most important.
| Integer = 0x11, | ||
| Dictionary = 0x12, | ||
| Float = 0x13, | ||
| Schema = 0x14, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new variable placeholder is added here, but it seems like is_variable_placeholder in clp/ir/parsing.hpp hasn't been updated to include the placeholder. This means that the escaping code won't end up properly escaping 0x14 when it appears in logtext, which is a bug.
Granted, if we do properly escape it then this constitutes a major breaking format change that isn't guarded by the experimental flag. Given that, it might actually be better to intentionally leave this bug for now to avoid the breaking format change. The only alternative I can think of is guarding everything related to VariablePlacehodler::Schema by a bunch of compile time macros, but then we would need to ship a separate build for this experimental change.
| /** | ||
| * | ||
| * @return the total size of header data that will be written to the compressor in bytes | ||
| */ | ||
| virtual auto get_stats() const -> size_t { return 0; } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /** | |
| * | |
| * @return the total size of header data that will be written to the compressor in bytes | |
| */ | |
| virtual auto get_stats() const -> size_t { return 0; } |
Unused?
|
|
||
| std::shared_ptr<LogTypeDictionaryWriter> m_log_dict; | ||
|
|
||
| std::vector<encoded_log_dict_id_t> m_logtypes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we end up just storing clp::logtype_dictioary_id_t with a static cast to encoded_log_dict_id_t. Not strictly necessary for this prototype, but I'd probably just change this to directly storing clp::logtype_dictionary_id_t -- no real reason to use encoded_log_dict_id_t here.
| DictionaryFloat, | ||
| LogMessage, | ||
| LogType, | ||
| TypedVar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TypedVar, |
Since the actual storage format for TypedVar is identical to VarString, I think it's probably best to just get rid of TypedVar. Reasoning generally being that it gets harder to manage the storage format over time if we have multiple column types that are basically the same thing, and also that we have a limited number of NodeTypes (since they're 8 bits), so we generally want to avoid creating new ones when possible.
You'd need to move the variable dictionary encoding + the stat tracking into JsonParser (or maybe some utility in ArchiveWriter that you call from JsonParser) & change ParsedMessage/VarStringColumnWriter to allow directly passing a clp::variable_dictionary_id_t, but I think it's worth it.
| DeltaInteger, | ||
| FormattedFloat, | ||
| DictionaryFloat, | ||
| LogMessage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LogMessage, | |
| LogMessage = 100, |
I think this is a worthwhile precaution just to make sure that the IDs for these unfinalized node types that'll only be produced under the experimental flag don't overlap with IDs for node types in non-experimental releases for the foreseeable future.
| // Storing both the full match and capture groups creates potential duplication of the capture | ||
| // group's substring. | ||
| // One option is to re-write the full match similar to a logtype where the capture groups are | ||
| // replaced with encoded placeholders. The full match would need to be rebuilt using its capture | ||
| // groups to display. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe possible to just store the parts of the full match that aren't the capture groups as part of the static logtext? Not sure if that easily fits into how log-surgeon works though.
| // TODO clpsls: variable repetition inside a log message | ||
| // It is possible to find multiple variables of the same type (therefore having the same name) | ||
| // inside the a single log event. This would result in the same node id in the clps schema tree. It | ||
| // is invalid JSON to have the same key in an object meaning we either need to create an array of | ||
| // values for the key or append to the key (variable type name / token name) to make it unique. | ||
| // Storing as an array complicates (named) search as the type of a variable's node would now be | ||
| // T|Array[T] (rather than just T). | ||
| // | ||
| // If it is possible to efficiently search all keys starting with a prefix, we could uniquely store | ||
| // each variable instance by appending to the key name (e.g. a node in a log message with no | ||
| // repetition would be named "var" while nodes in a message with repetition would be named var.0 | ||
| // var.1, ..., var.n). | ||
| // | ||
| // | ||
| // TODO clpsls: capture group repetition | ||
| // Similarly, to variable repetition storing the capture group variable node as an array in cases | ||
| // with repetition creates a node type of T|Array[T]. | ||
| // | ||
| // TODO clpsls: | ||
| // Storing both the full match and capture groups creates potential duplication of the capture | ||
| // group's substring. | ||
| // One option is to re-write the full match similar to a logtype where the capture groups are | ||
| // replaced with encoded placeholders. The full match would need to be rebuilt using its capture | ||
| // groups to display. | ||
| // Additionally wildcard search must know to avoid searching both the full match and capture groups. | ||
| // Using placeholders allows for rebuilding the variable (and log message) through node | ||
| // position/ordering as the placeholders in the full match enable you to know how many subsequent | ||
| // nodes are captures. Otherwise we either need to specially type the node (e.g. VarString and | ||
| // CaptureString) or do some other indexing/checking. | ||
| // | ||
| // ... Storing as a separate type seems the most sensible. | ||
| // There is a possible trade-off where de-duplicating the capture group string using placeholders in | ||
| // the full match node improves compression ration, but slows down search in certain cases as you | ||
| // maybe need to rebuild the full match's value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely all questions we should have answers for before this makes it into a proper non-experimental release, but seems ok for this prototype. I am a bit worried the full match/capture groups duplication will make it a bit harder to get a baseline for how these changes affect compression ratio though.
| logtype_dict_entry.add_schema_var(); | ||
| m_current_parsed_message.add_unordered_value( | ||
| ParsedMessage::TypedVar{std::string{cFullMatch}, token_view.to_string()} | ||
| ); | ||
| m_current_schema.insert_unordered( | ||
| m_archive_writer->add_node(capture_node_id, NodeType::TypedVar, cFullMatch) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works, but its a little bit strange from the perspective of the current implementation.
Issue here is that the schema is supposed to be a set of leaf nodes that dominate some subtree and imply the full structure of the record. Here we're storing the full match and all of the captures as siblings, but arguably the full match is the parent of the captures. That said, we definitely don't support having a schema tree node and its children be part of the same schema, so the way you're storing this is probably fine. I guess the "conventional" thing to do would be to just store the captures, but that requires some thought.
Should be fine for now since this is a prototype, but its something we need to think through.
| capture_view.set_start_pos(start_positions[0]); | ||
| capture_view.set_end_pos(end_positions[0]); | ||
|
|
||
| logtype_dict_entry.add_schema_var(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the logtype end up looking like? It seems like we add a schema var for the full match and all of the captures, which seems weird.
… non-prototype NodeTypes are added.
Description
This draft PR is meant to track the gap between the prototype branch and main.
Possible small PRs to split out:
Potentially, the stat features can be split out into a separate PR. In this case
NodeType::TypedVarwould need to be reverted temporarily as it only exists to propagate a variable's type name.Some other cleanup is still necessary (e.g. TODOs removed or converted to issues).
Checklist
breaking change.
Validation performed