-
Notifications
You must be signed in to change notification settings - Fork 84
feat(log-converter): Add log-converter binary which converts unstructured text logs into KV-IR; Update log-surgeon to yscope/log-surgeon@840f262. #1460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a new log-converter executable and its sources (CLI, converter, serializer, runtime entry); makes FileWriter movable-only; updates CMake, packaging, dependency metadata, docs, and integration-test configuration to include the new binary and its build/runtime outputs. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Main as log-converter (main)
participant CLI as CommandLineArguments
participant Converter as LogConverter
participant Reader as ReaderInterface
participant Serializer as LogSerializer
participant Writer as FileWriter
User->>Main: invoke log-converter
Main->>CLI: parse_arguments(argc, argv)
CLI-->>Main: ParsingResult (Success / InfoCommand / Failure)
alt Success
Main->>Converter: instantiate
loop per input path
Main->>Reader: open reader for path
Reader-->>Main: reader instance / error
alt reader OK
Main->>Converter: convert_file(path, reader, out_dir)
loop read/parse cycle
Converter->>Reader: read into buffer
Reader-->>Converter: bytes / EOF / error
Converter->>Converter: parse events
alt event has timestamp
Converter->>Serializer: add_message(timestamp, message)
else
Converter->>Serializer: add_message(message)
end
Serializer->>Writer: write / flush as needed
end
Converter->>Serializer: close()
else reader error
Main->>Main: log and continue/abort
end
end
else InfoCommand
Main->>User: print usage/info
else Failure
Main->>User: print error and exit non-zero
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}⚙️ CodeRabbit configuration file
Files:
🧬 Code graph analysis (1)components/core/src/clp_s/log_converter/LogSerializer.hpp (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
taskfile.yaml (1)
155-164: Add log-converter to the rsync list to package the built binary.The verification confirms the issue: log-converter is built (listed in the
generatessection at line 216) but is not included in the rsync command at lines 155-164. The suggested fix is correct—add the binary to the rsync list alongside other core components.rsync -a "{{.G_CORE_COMPONENT_BUILD_DIR}}/clg" "{{.G_CORE_COMPONENT_BUILD_DIR}}/clo" "{{.G_CORE_COMPONENT_BUILD_DIR}}/clp" "{{.G_CORE_COMPONENT_BUILD_DIR}}/clp-s" "{{.G_CORE_COMPONENT_BUILD_DIR}}/indexer" + "{{.G_CORE_COMPONENT_BUILD_DIR}}/log-converter" "{{.G_CORE_COMPONENT_BUILD_DIR}}/reducer-server" "{{.G_SPIDER_BUILD_DIR}}/spider-build/src/spider/spider_scheduler" "{{.G_SPIDER_BUILD_DIR}}/spider-build/src/spider/spider_worker" "{{.OUTPUT_DIR}}/bin/"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (11)
components/core/src/clp_s/CMakeLists.txt(1 hunks)components/core/src/clp_s/FileWriter.hpp(1 hunks)components/core/src/clp_s/log_converter/CMakeLists.txt(1 hunks)components/core/src/clp_s/log_converter/CommandLineArguments.cpp(1 hunks)components/core/src/clp_s/log_converter/CommandLineArguments.hpp(1 hunks)components/core/src/clp_s/log_converter/LogConverter.cpp(1 hunks)components/core/src/clp_s/log_converter/LogConverter.hpp(1 hunks)components/core/src/clp_s/log_converter/LogSerializer.cpp(1 hunks)components/core/src/clp_s/log_converter/LogSerializer.hpp(1 hunks)components/core/src/clp_s/log_converter/log_converter.cpp(1 hunks)taskfile.yaml(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/log_converter/LogConverter.hppcomponents/core/src/clp_s/log_converter/CommandLineArguments.hppcomponents/core/src/clp_s/log_converter/log_converter.cppcomponents/core/src/clp_s/log_converter/LogConverter.cppcomponents/core/src/clp_s/FileWriter.hppcomponents/core/src/clp_s/log_converter/LogSerializer.cppcomponents/core/src/clp_s/log_converter/LogSerializer.hppcomponents/core/src/clp_s/log_converter/CommandLineArguments.cpp
🧬 Code graph analysis (6)
components/core/src/clp_s/log_converter/LogConverter.hpp (1)
components/core/src/clp_s/log_converter/LogSerializer.hpp (1)
output_dir(40-41)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (2)
components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)
nodiscard(36-69)components/core/src/clp_s/CommandLineArguments.hpp (1)
m_network_auth(51-51)
components/core/src/clp_s/log_converter/log_converter.cpp (2)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (2)
CommandLineArguments(22-22)CommandLineArguments(22-22)components/core/src/clp_s/log_converter/LogConverter.hpp (3)
LogConverter(21-21)path(31-35)reader(49-49)
components/core/src/clp_s/log_converter/LogConverter.cpp (3)
components/core/src/clp_s/log_converter/LogConverter.hpp (2)
reader(49-49)path(31-35)components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
output_dir(40-41)message(56-56)timestamp(49-49)components/core/src/clp_s/log_converter/LogSerializer.cpp (2)
create(22-45)create(22-23)
components/core/src/clp_s/log_converter/LogSerializer.cpp (1)
components/core/src/clp_s/log_converter/LogSerializer.hpp (4)
output_dir(40-41)timestamp(49-49)message(56-56)flush_buffer(61-65)
components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (5)
nodiscard(27-27)nodiscard(29-29)nodiscard(31-33)nodiscard(35-35)argc(25-25)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (macos-15)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (24)
taskfile.yaml (1)
216-224: Core build integration looks good.Target added to generates and TARGETS. No further action here.
components/core/src/clp_s/CMakeLists.txt (1)
2-2: Add subdirectory: OK.Subcomponent hook-up is correct.
components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)
180-189: Auth validation and CLI flow look solid.Parsing, help handling, inputs merging, and auth validation are correct.
If S3 is frequently used, consider surfacing missing AWS env vars with a specific error message before conversion starts.
components/core/src/clp_s/log_converter/LogConverter.hpp (3)
1-13: LGTM!The include guards and headers are properly structured and include the necessary dependencies for the class.
14-36: LGTM!The class design and public API are well-structured with clear documentation and appropriate return types.
37-54: LGTM!The private implementation details are well-designed with appropriate buffer sizing, error handling via
std::optional, and proper member initialization.components/core/src/clp_s/log_converter/LogSerializer.hpp (4)
1-14: LGTM!Headers and include guards are properly structured for the serialization component.
19-32: LGTM!The special member functions correctly implement move-only semantics, which is appropriate for a class managing file resources.
33-66: LGTM!The public API is well-designed with a factory pattern for safe initialization and clear methods for message serialization. The inline
close()implementation is appropriate for its simplicity.
67-97: LGTM!The private implementation is well-structured with appropriate constants, move semantics, and buffer management.
components/core/src/clp_s/log_converter/LogSerializer.cpp (4)
1-20: LGTM!All necessary headers are included for the implementation.
22-45: LGTM!The factory method correctly handles initialization failures by returning
std::nullopt. The UUID-based filename generation ensures unique output files, and the broad exception handling is appropriate for a factory pattern.
47-67: LGTM!The method correctly constructs MsgPack objects, serializes them, and manages the buffer. The
string_viewparameters are safely used since serialization occurs immediately within the function scope.
69-83: LGTM!The non-timestamped message serialization follows the same safe pattern as the timestamped version with appropriate buffer management.
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (3)
1-10: LGTM!Headers and include guards are properly structured.
11-36: LGTM!The public API is well-designed with explicit constructor,
[[nodiscard]]attributes for safety, and const-correct getters.
37-46: LGTM!Private members are appropriately organized with sensible default values.
components/core/src/clp_s/log_converter/LogConverter.cpp (7)
1-19: LGTM!All necessary headers are included for the implementation.
20-36: LGTM!The timestamp schema is appropriately documented as non-exhaustive and covers common timestamp patterns. The TODO comment for future unicode support is noted.
38-79: LGTM!The buffer management logic is well-implemented with proper handling of overlapping memory regions, exponential growth with an upper bound, and appropriate error handling.
81-96: LGTM!The initialization sequence correctly sets up the schema, parser, and serializer with appropriate error handling.
141-144: LGTM!Proper cleanup and success return handling.
117-119: Consider refactoring for consistency with coding guidelines.While enum comparisons using
!=are acceptable, the coding guidelines preferfalse == <expression>style. Consider refactoring for consistency:As per coding guidelines.
- if (log_surgeon::ErrorCode::Success != err) { + if (false == (log_surgeon::ErrorCode::Success == err)) { return false; }Alternatively, you could keep the current form as enum inequality checks are commonly accepted and readable.
Likely an incorrect or invalid review comment.
124-131: Timestamp position assumption is verified as correct—log-surgeon's schema documentation guarantees timestamps appear at message start.The log-surgeon unstructured schema specification explicitly states "Timestamps must appear at the start of the message," which validates the code's assumption. The
substr(timestamp.length())operation at line 127 is safe and correct given this API contract.The code requires no changes. Optionally, adding a comment referencing the schema guarantee would improve clarity for future maintainers.
| std::error_code ec{}; | ||
| if (false == std::filesystem::create_directory(command_line_arguments.get_output_dir(), ec) | ||
| && ec) | ||
| { | ||
| SPDLOG_ERROR( | ||
| "Can not create output directory {} - {}", | ||
| command_line_arguments.get_output_dir(), | ||
| ec.message() | ||
| ); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create nested output dirs and validate path type.
create_directory fails for nested paths and doesn’t catch “path exists but not a dir”. Use create_directories and add an exists/is_directory check.
- std::error_code ec{};
- if (false == std::filesystem::create_directory(command_line_arguments.get_output_dir(), ec)
- && ec)
- {
- SPDLOG_ERROR(
- "Can not create output directory {} - {}",
- command_line_arguments.get_output_dir(),
- ec.message()
- );
- return false;
- }
+ std::error_code ec{};
+ auto const& outdir = command_line_arguments.get_output_dir();
+ if (false == std::filesystem::exists(outdir, ec)) {
+ ec.clear();
+ if (false == std::filesystem::create_directories(outdir, ec) && ec) {
+ SPDLOG_ERROR("Cannot create output directory {} - {}", outdir, ec.message());
+ return false;
+ }
+ } else if (false == std::filesystem::is_directory(outdir, ec)) {
+ SPDLOG_ERROR("Output path exists but is not a directory: {}", outdir);
+ return false;
+ }🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/log_converter.cpp around lines 67 to
77, replace std::filesystem::create_directory with
std::filesystem::create_directories to handle nested paths, and before
attempting creation check if the path exists and is a directory (using exists()
and is_directory()); if the path exists but is not a directory log an error and
return false; when calling create_directories, still pass an error_code, check
for failure and log ec.message() as before. Ensure the logic covers: if path
exists and is_directory -> proceed, if path exists and not directory -> error,
else try create_directories and on error log and return false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/log_converter/LogConverter.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/log_converter/LogConverter.cpp
🧬 Code graph analysis (1)
components/core/src/clp_s/log_converter/LogConverter.cpp (3)
components/core/src/clp_s/log_converter/LogConverter.hpp (2)
reader(49-49)path(31-35)components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
output_dir(40-41)message(56-56)timestamp(49-49)components/core/src/clp_s/log_converter/LogSerializer.cpp (2)
create(22-45)create(22-23)
🔇 Additional comments (3)
components/core/src/clp_s/log_converter/LogConverter.cpp (3)
38-79: Buffer management implementation looks solid.The refill logic correctly uses
memmovefor overlapping memory when shifting consumed data,memcpyfor the non-overlapping buffer growth scenario, and enforces a maximum buffer size with appropriate error handling.
86-96: Schema setup and serializer initialization are correct.The log_surgeon schema configuration and serializer creation follow the expected pattern, with proper error handling using the codebase's style guidelines.
125-135: Verify timestamp extraction logic with log_surgeon's event representation.The code extracts the timestamp string view from the event (line 126), then substrings the full message by the timestamp's length to obtain the remainder (lines 127-129). This assumes that:
- The timestamp always appears at the beginning of the message string returned by
event.to_string()- The length of the timestamp string view corresponds exactly to its position in the message
While this may be guaranteed by log_surgeon's parsing behaviour, please verify that
event.to_string()andevent.get_timestamp()->to_string_view()have this relationship. If the timestamp can appear elsewhere or have different formatting in the message string, this logic would produce incorrect results.Consider generating a test case that validates this assumption, or add a runtime check:
if (nullptr != event.get_timestamp()) { auto timestamp{event.get_timestamp()->to_string_view()}; std::string_view message_view{message}; // Verify timestamp is at the beginning if (message_view.starts_with(timestamp)) { auto message_without_timestamp{message_view.substr(timestamp.length())}; message_serialized_successfully = serializer.add_message(timestamp, message_without_timestamp); } else { // Fallback or error handling message_serialized_successfully = serializer.add_message(message); } }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (3)
components/core/src/clp_s/FileWriter.hpp (2)
38-40: Mark move constructornoexceptfor container compatibility.The move constructor should be marked
noexceptto enable optimizations in standard containers (e.g.,std::vectorwill only use move operations during reallocation if they arenoexcept). This issue was flagged in the previous review but remains unaddressed.Apply this diff:
- FileWriter(FileWriter&& writer) + FileWriter(FileWriter&& writer) noexcept : m_file{std::exchange(writer.m_file, nullptr)}, m_fd{std::exchange(writer.m_fd, -1)} {}
42-49: Critical resource leak: Current file handle not closed before move assignment.The move assignment operator still has two issues from the previous review that remain unaddressed:
- Resource leak (CRITICAL): If
m_fileis already open when move assignment occurs, the current file handle is overwritten without being closed, causing a file descriptor andFILE*leak.- Missing
noexcept: The move assignment should be markednoexceptfor container compatibility.Apply this diff to fix both issues:
- auto operator=(FileWriter&& writer) -> FileWriter& { + auto operator=(FileWriter&& writer) noexcept -> FileWriter& { if (this == &writer) { return *this; } + if (nullptr != m_file) { + // Best-effort close to avoid leak; ignore errors in move. + std::fclose(m_file); + } m_file = std::exchange(writer.m_file, nullptr); m_fd = std::exchange(writer.m_fd, -1); return *this; }components/core/src/clp_s/log_converter/log_converter.cpp (1)
67-77: Create nested output dirs and validate path type.Use create_directories and handle “exists but not a directory”. Also fix wording to “Cannot”. [duplicate of a prior review; still unresolved]
- std::error_code ec{}; - if (false == std::filesystem::create_directory(command_line_arguments.get_output_dir(), ec) - && ec) - { - SPDLOG_ERROR( - "Can not create output directory {} - {}", - command_line_arguments.get_output_dir(), - ec.message() - ); - return false; - } + std::error_code ec{}; + auto const& outdir = command_line_arguments.get_output_dir(); + if (false == std::filesystem::exists(outdir, ec)) { + ec.clear(); + if (false == std::filesystem::create_directories(outdir, ec) && ec) { + SPDLOG_ERROR("Cannot create output directory {} - {}", outdir, ec.message()); + return false; + } + } else if (false == std::filesystem::is_directory(outdir, ec)) { + SPDLOG_ERROR("Output path exists but is not a directory: {}", outdir); + return false; + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/core/src/clp_s/FileWriter.hpp(2 hunks)components/core/src/clp_s/log_converter/CMakeLists.txt(1 hunks)components/core/src/clp_s/log_converter/LogConverter.cpp(1 hunks)components/core/src/clp_s/log_converter/log_converter.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/FileWriter.hppcomponents/core/src/clp_s/log_converter/log_converter.cppcomponents/core/src/clp_s/log_converter/LogConverter.cpp
🧬 Code graph analysis (2)
components/core/src/clp_s/log_converter/log_converter.cpp (2)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (3)
CommandLineArguments(22-22)CommandLineArguments(22-22)argc(25-25)components/core/src/clp_s/log_converter/LogConverter.hpp (3)
LogConverter(21-21)path(31-35)reader(49-49)
components/core/src/clp_s/log_converter/LogConverter.cpp (4)
components/core/src/clp_s/log_converter/LogConverter.hpp (2)
reader(49-49)path(31-35)components/core/src/clp_s/FileWriter.hpp (1)
path(127-127)components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
output_dir(40-41)message(56-56)timestamp(49-49)components/core/src/clp_s/log_converter/LogSerializer.cpp (2)
create(22-45)create(22-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
components/core/src/clp_s/FileWriter.hpp (2)
8-8: LGTM: Required include for std::exchange.The
<utility>header is correctly added to supportstd::exchangeused in the move semantics implementation.
34-35: LGTM: Correctly disables copy semantics.Deleting the copy constructor and assignment operator is appropriate for a move-only resource-managing class.
components/core/src/clp_s/log_converter/CMakeLists.txt (1)
18-30: Link set looks correct; duplicate removed.fmt and clp_s::clp_dependencies are present; target properties are sane. LGTM.
components/core/src/clp_s/log_converter/log_converter.cpp (1)
101-106: Review comment is incorrect — nested_readers is guaranteed to be non-empty for FileType::LogText.The code analysis shows that
nested_readerswill always have at least one element whenFileType::LogTextis returned. Intry_deduce_reader_type(InputConfig.cpp line 369), thereadersvector is initialized with the original reader, then at line 380,buffered_readeris added to the vector before the switch statement evaluates the type. When the code reaches theFileType::LogTextcase (line 384), it immediately returns{std::move(readers), type}afterreadershas been populated with both the original reader and the buffered reader. Therefore, calling.back()onnested_readersat log_converter.cpp:104 is safe and will never access an empty container.The suggested diff is unnecessary.
Likely an incorrect or invalid review comment.
| std::ignore = check_and_log_curl_error(path, reader); | ||
| SPDLOG_ERROR("Received input that was not unstructured logtext: {}.", path.path); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Avoid assigning to std::ignore; drop the dependency.
Use a void cast to discard the result. If you keep std::ignore, include .
- std::ignore = check_and_log_curl_error(path, reader);
+ (void)check_and_log_curl_error(path, reader);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::ignore = check_and_log_curl_error(path, reader); | |
| SPDLOG_ERROR("Received input that was not unstructured logtext: {}.", path.path); | |
| return false; | |
| } | |
| (void)check_and_log_curl_error(path, reader); | |
| SPDLOG_ERROR("Received input that was not unstructured logtext: {}.", path.path); | |
| return false; | |
| } |
🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/log_converter.cpp around lines 95 to
98, the code assigns the result of check_and_log_curl_error to std::ignore which
is incorrect and forces a dependency on <tuple>; replace the assignment with a
void cast: (void)check_and_log_curl_error(path, reader); and remove the
unnecessary #include <tuple> (or alternatively, if you choose to keep
std::ignore, add #include <tuple>), ensuring no assignment to std::ignore
remains.
| if (log_surgeon::ErrorCode::BufferOutOfBounds == err) { | ||
| break; | ||
| } | ||
| if (log_surgeon::ErrorCode::Success != err) { | ||
| return false; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Add minimal error logging to aid diagnosis.
Log parser errors before returning to ease debugging in field runs.
- if (log_surgeon::ErrorCode::Success != err) {
- return false;
- }
+ if (log_surgeon::ErrorCode::Success != err) {
+ SPDLOG_ERROR("Parser error {} while processing {}", static_cast<int>(err), path.path);
+ return false;
+ }Want me to push the logging tweak across other early-return paths too?
🤖 Prompt for AI Agents
In components/core/src/clp_s/log_converter/LogConverter.cpp around lines
115-121, the code returns false on non-success parse errors without any logging;
add a minimal log statement that records the error code (and a brief context
like "parse error in LogConverter") immediately before returning false so field
diagnostics capture which ErrorCode occurred; use the existing logging facility
used elsewhere in the file (same logger and log level) and keep the message
concise; also apply the same minimal logging pattern to other early-return paths
in this function to aid debugging.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a brief look. Still waiting for Bill to make a full review.
| namespace clp_s::log_converter { | ||
| auto LogSerializer::create(std::string_view output_dir, std::string_view original_file_path) | ||
| -> std::optional<LogSerializer> { | ||
| nlohmann::json metadata{{cOriginalFileMetadataKey, original_file_path}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid {} initializer for nlohmann::json. It has many weird behavior...
| .key = msgpack::object{cTimestampKey}, | ||
| .val = msgpack::object{timestamp} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kirkrodrigues shall we add a inner layer like what we did in clp-loglib-py? Just in case we need to add timezone offset support in future release...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While Devin was designing the new timestamp parsing for clp-s, he did mention that the inner layer thing has some issues with another feature we want to implement in clp-s.
Specifically, if the user compressed different datasets, each with a different authoritative timestamp key (e.g., ts_key_1 and ts_key_2), currently, the user can't reliably query for logs across both datasets in a certain time range. Yes, the user could do ts_key_1: ... OR ts_key_2: ..., but that's cumbersome and ts_key_2 might exist in both datasets, but might only be authoritative in one.
Ideally, we want to have some mapping from a special key to either of those timestamp keys. This would allow a user to query their logs with something like ": ..." regardless of what the authoritative timestamp key was in their original logs.
However, if the authoritative timestamp key was an object in the original logs, this becomes more difficult to do properly. In contrast, with the new parsed timestamp format in clp-s, we should be able to store the timezone as part of the format.
So tl;dr, we might want to rethink the way the clp-logging Python library writes timestamps to KV-IR streams. Sorry, I didn't realize the issues sooner.
| * Creates an instance of LogSerializer. | ||
| * @param output_dir The destination directory for generated KV-IR. | ||
| * @param original_file_path The original path for the file being converted to KV-IR. | ||
| * @return An instance of LogSerializer on success, or std::nullopt on failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the current Serializer isn't designed to return error code with Result properly, but can we apply Result in this class instead as we will plan to integrate Result to the serializer? It will make this class naturally compatible after migration.
Bill-hbrhbr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comments:
- I think it's sensible to add
[[nodiscard]]to any function that expects caller to check/use its return value - Out-of-range fixes:
log-convertershould be added to thesourcessections oftask:packageas well as itsrsynccmds - I feel like
std::optional<T>should be replaced withystdlib::error_handling::Result<T>and some bools should be replaced withResult<void> - We could add integration tests for log-converter. shouldn't be too hard.
Due to time constraint, you can skip 3/4. Just writing them down now in case we forget in the future.
| if (m_cur_offset > 0) { | ||
| // NOLINTBEGIN(cppcoreguidelines-pro-bounds-pointer-arithmetic) | ||
| std::memmove( | ||
| m_buffer.data(), | ||
| m_buffer.data() + m_cur_offset, | ||
| m_bytes_occupied - m_cur_offset | ||
| ); | ||
| // NOLINTEND(cppcoreguidelines-pro-bounds-pointer-arithmetic) | ||
| m_bytes_occupied -= m_cur_offset; | ||
| m_cur_offset = 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this could be its own helper function with a very self-explanatory title. I know this is compacting the buffer.
| if (m_buffer.size() == m_bytes_occupied) { | ||
| size_t const new_size{2 * m_buffer.size()}; | ||
| if (new_size > cMaxBufferSize) { | ||
| return std::nullopt; | ||
| } | ||
| ystdlib::containers::Array<char> new_buffer(new_size); | ||
| std::memcpy(new_buffer.data(), m_buffer.data(), m_bytes_occupied); | ||
| m_buffer = std::move(new_buffer); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This too can be its own helper, although I question the likelihood of the buffer being full when we've just parsed a new log and wants to refill. Why don't we just set this buffer to its maximum size and never re-allocate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case where this happens is when a single log event is larger than the buffer size. Growing in this way allows us to use less memory for datasets that don't have really long log messages, but adapt to datasets that do. That could potentially end up mattering if this ever needs to run in a resource-constrained environment.
But yeah, I can move it to a helper.
| } | ||
| } | ||
|
|
||
| serializer.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be in serializer's destructor upon variable scope exit.
Bill-hbrhbr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, approving because I won't be awake to re-approve
Co-authored-by: Bingran Hu <[email protected]>
|
Also, please update
Maybe do a grep throughout the repo to find the remaining places |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
42-43: Addnoexceptto move assignment for consistency.The move constructor is marked
noexcept(line 42), but the move assignment operator (line 43) is not. For consistency and to ensure optimal performance in STL containers, both should benoexcept.Apply this diff:
LogSerializer(LogSerializer&&) noexcept = default; -[[nodiscard]] auto operator=(LogSerializer&&) -> LogSerializer& = default; +[[nodiscard]] auto operator=(LogSerializer&&) noexcept -> LogSerializer& = default;
98-105: Skip zero-length writes inflush_buffer().The method currently calls
m_writer.write()even when the buffer is empty, resulting in an unnecessary system call.Apply this diff:
void flush_buffer() { auto const buffer{m_serializer.get_ir_buf_view()}; + if (0U == buffer.size_bytes()) { + return; + } m_writer.write( clp::size_checked_pointer_cast<char const>(buffer.data()), buffer.size_bytes() ); m_serializer.clear_ir_buf(); }
73-77: Makeclose()idempotent to prevent double-close issues.Calling
close()multiple times will write the EOF marker repeatedly and attempt to close the writer multiple times, which could lead to errors. Add a guard flag to make the method idempotent.Apply this diff to add a guard:
void close() { - flush_buffer(); - m_writer.write_numeric_value(clp::ffi::ir_stream::cProtocol::Eof); - m_writer.close(); + if (false == m_closed) { + flush_buffer(); + m_writer.write_numeric_value(clp::ffi::ir_stream::cProtocol::Eof); + m_writer.close(); + m_closed = true; + } }And add the member variable in the private section:
clp::ffi::ir_stream::Serializer<clp::ir::eight_byte_encoded_variable_t> m_serializer; clp_s::FileWriter m_writer; + bool m_closed{false};components/core/src/clp_s/log_converter/LogConverter.cpp (2)
56-58: Use semantically appropriate error code for I/O failures.Returning
std::errc::not_enough_memoryfor non-EOF read errors is misleading. The actual error could be a network failure, permission issue, disk error, or other I/O problem unrelated to memory.Apply this diff:
if (clp::ErrorCode_Success != rc) { - return std::errc::not_enough_memory; + return std::errc::io_error; }
128-132: Timestamp removal incorrectly assumes prefix position.The code assumes the timestamp is at the beginning of the message by using
substr(timestamp.length()). If the timestamp appears elsewhere in the message, this will incorrectly remove the wrong portion of the message.Apply this diff to find and remove the timestamp correctly:
if (nullptr != event.get_timestamp()) { auto const timestamp{event.get_timestamp()->to_string_view()}; - auto const message_without_timestamp{ - std::string_view{message}.substr(timestamp.length()) - }; + auto message_view{std::string_view{message}}; + auto const pos{message_view.find(timestamp)}; + std::string_view message_without_timestamp{}; + if (std::string_view::npos != pos) { + message_without_timestamp = message_view.substr(pos + timestamp.length()); + } else { + message_without_timestamp = message_view; + } YSTDLIB_ERROR_HANDLING_TRYV( serializer.add_message(timestamp, message_without_timestamp) );
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp_s/log_converter/LogConverter.cpp(1 hunks)components/core/src/clp_s/log_converter/LogConverter.hpp(1 hunks)components/core/src/clp_s/log_converter/LogSerializer.hpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/log_converter/LogSerializer.hppcomponents/core/src/clp_s/log_converter/LogConverter.cppcomponents/core/src/clp_s/log_converter/LogConverter.hpp
🧬 Code graph analysis (3)
components/core/src/clp_s/log_converter/LogSerializer.hpp (1)
components/core/src/clp_s/FileWriter.hpp (2)
writer(42-49)writer(42-42)
components/core/src/clp_s/log_converter/LogConverter.cpp (3)
components/core/src/clp_s/log_converter/LogConverter.hpp (2)
reader(56-57)path(35-39)components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
output_dir(33-34)message(67-68)timestamp(57-58)components/core/src/clp_s/log_converter/LogSerializer.cpp (2)
create(28-49)create(28-29)
components/core/src/clp_s/log_converter/LogConverter.hpp (1)
components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
nodiscard(39-39)nodiscard(43-43)output_dir(33-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: package-image
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (macos-15)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
♻️ Duplicate comments (4)
components/core/src/clp_s/log_converter/log_converter.cpp (2)
67-77: Create nested output dirs and validate path type.Handle pre-existing paths and multi-level dirs; error if path exists but isn’t a dir.
- std::error_code ec{}; - if (false == std::filesystem::create_directory(command_line_arguments.get_output_dir(), ec) - && ec) - { - SPDLOG_ERROR( - "Can not create output directory {} - {}", - command_line_arguments.get_output_dir(), - ec.message() - ); - return false; - } + std::error_code ec{}; + auto const& outdir = command_line_arguments.get_output_dir(); + if (false == std::filesystem::exists(outdir, ec)) { + ec.clear(); + if (false == std::filesystem::create_directories(outdir, ec) && ec) { + SPDLOG_ERROR("Cannot create output directory {} - {}", outdir, ec.message()); + return false; + } + } else if (false == std::filesystem::is_directory(outdir, ec)) { + SPDLOG_ERROR("Output path exists but is not a directory: {}", outdir); + return false; + }
95-98: Don’t assign to std::ignore; drop the dependency.Discard the result with a void cast.
- std::ignore = check_and_log_curl_error(path, reader); + (void)check_and_log_curl_error(path, reader);components/core/src/clp_s/log_converter/LogConverter.cpp (2)
94-109: Map read failures to I/O, not memory.Use a semantically correct errc for non-EOF failures.
- m_bytes_occupied += num_bytes_read; + m_num_bytes_buffered += num_bytes_read; if (clp::ErrorCode_EndOfFile == rc) { return num_bytes_read; } if (clp::ErrorCode_Success != rc) { - return std::errc::not_enough_memory; + return std::errc::io_error; }
70-82: Timestamp removal assumes prefix; fix position + trim delimiters.Find the timestamp within the message and trim leading separators from the remainder.
- auto const& event{parser.get_log_parser().get_log_event_view()}; - auto const message{event.to_string()}; - if (nullptr != event.get_timestamp()) { - auto const timestamp{event.get_timestamp()->to_string_view()}; - auto const message_without_timestamp{ - std::string_view{message}.substr(timestamp.length()) - }; - YSTDLIB_ERROR_HANDLING_TRYV( - serializer.add_message(timestamp, message_without_timestamp) - ); - } else { - YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(message)); - } + auto const& event{parser.get_log_parser().get_log_event_view()}; + auto const message{event.to_string()}; + if (nullptr != event.get_timestamp()) { + auto const timestamp{event.get_timestamp()->to_string_view()}; + auto full{std::string_view{message}}; + auto const pos{full.find(timestamp)}; + std::string_view remainder{full}; + if (std::string_view::npos != pos) { + remainder = full.substr(pos + timestamp.length()); + } + auto const is_delim = [](char c) { + switch (c) { + case ' ': case '\t': case '\r': case '\n': case '[': case '(': case ':': + return true; + default: + return false; + } + }; + while (false == remainder.empty() && true == is_delim(remainder.front())) { + remainder.remove_prefix(1); + } + YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(timestamp, remainder)); + } else { + YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(message)); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp_s/log_converter/LogConverter.cpp(1 hunks)components/core/src/clp_s/log_converter/LogConverter.hpp(1 hunks)components/core/src/clp_s/log_converter/log_converter.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/log_converter/log_converter.cppcomponents/core/src/clp_s/log_converter/LogConverter.hppcomponents/core/src/clp_s/log_converter/LogConverter.cpp
🧬 Code graph analysis (3)
components/core/src/clp_s/log_converter/log_converter.cpp (2)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (3)
CommandLineArguments(22-22)CommandLineArguments(22-22)argc(25-25)components/core/src/clp_s/log_converter/LogConverter.hpp (3)
LogConverter(20-20)path(35-36)reader(53-54)
components/core/src/clp_s/log_converter/LogConverter.hpp (1)
components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
nodiscard(39-39)nodiscard(43-43)output_dir(33-34)
components/core/src/clp_s/log_converter/LogConverter.cpp (4)
components/core/src/clp_s/log_converter/LogConverter.hpp (2)
path(35-36)reader(53-54)components/core/src/clp_s/FileWriter.hpp (1)
path(127-127)components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
output_dir(33-34)message(67-68)timestamp(57-58)components/core/src/clp_s/log_converter/LogSerializer.cpp (2)
create(28-49)create(28-29)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: package-image
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: rust-checks (macos-15)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: antlr-code-committed (macos-15)
🔇 Additional comments (1)
components/core/src/clp_s/log_converter/log_converter.cpp (1)
100-106: The review comment is incorrect.The concern about undefined behaviour from calling
nested_readers.back()on an empty stack does not exist. The implementation oftry_deduce_reader_typeestablishes a contract: when it returnsFileType::LogText, the readers vector is guaranteed to contain at least two elements (the original reader at line 369, plus the buffered reader added at line 380 before the return at line 385).The calling code in log_converter.cpp only reaches the
.back()call whenfile_type == LogText(the other cases return early at lines 90–99). Therefore, the precondition for.back()is always satisfied.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
components/core/src/clp_s/log_converter/LogConverter.cpp (5)
66-69: Log parser errors before returning (keeps field diagnostics actionable).Add a minimal error log with the code and path, then return as you do now. Also add the spdlog include (see separate comment).
- if (log_surgeon::ErrorCode::Success != err) { - return std::errc::no_message; - } + if (log_surgeon::ErrorCode::Success != err) { + SPDLOG_ERROR( + "log-converter parse error {} while processing {}", + static_cast<int>(err), + path.path + ); + return std::errc::no_message; + }
70-80: Do not assume the timestamp is at index 0; find and trim delimiters.Current substr(timestamp.length()) corrupts messages when the timestamp isn’t a prefix. Find the timestamp within the message and trim leading separators before serializing.
- if (nullptr != event.get_timestamp()) { - auto const timestamp{event.get_timestamp()->to_string_view()}; - auto const message_without_timestamp{ - std::string_view{message}.substr(timestamp.length()) - }; - YSTDLIB_ERROR_HANDLING_TRYV( - serializer.add_message(timestamp, message_without_timestamp) - ); - } else { + if (nullptr != event.get_timestamp()) { + auto const timestamp{event.get_timestamp()->to_string_view()}; + auto const full{std::string_view{message}}; + auto const pos{full.find(timestamp)}; + std::string_view remainder{full}; + if (std::string_view::npos != pos) { + remainder = full.substr(pos + timestamp.length()); + } + // Trim leading delimiters used around timestamps. + auto const is_delim = [](char c) { + switch (c) { + case ' ': case '\t': case '\r': case '\n': case '[': case '(': case ':': + return true; + default: + return false; + } + }; + while (false == remainder.empty() + && true == is_delim(remainder.front())) { + remainder.remove_prefix(1); + } + YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(timestamp, remainder)); + } else { YSTDLIB_ERROR_HANDLING_TRYV(serializer.add_message(message)); }
3-14: Add spdlog include if adopting error logs.Required for SPDLOG_* macros used above.
#include <string_view> #include <system_error> #include <utility> +#include <spdlog/spdlog.h>
17-17: Remove unused include.InputConfig.hpp isn’t referenced in this file.
-#include "../InputConfig.hpp"
106-108: Use an I/O error for non-EOF read failures.Mapping read failures to not_enough_memory is misleading; use io_error.
- if (clp::ErrorCode_Success != rc) { - return std::errc::not_enough_memory; - } + if (clp::ErrorCode_Success != rc) { + return std::errc::io_error; + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/log_converter/LogConverter.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/log_converter/LogConverter.cpp
🧠 Learnings (2)
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/log_converter/LogConverter.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/log_converter/LogConverter.cpp
🧬 Code graph analysis (1)
components/core/src/clp_s/log_converter/LogConverter.cpp (3)
components/core/src/clp_s/log_converter/LogConverter.hpp (2)
path(35-36)reader(53-54)components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
output_dir(33-34)message(67-68)timestamp(57-58)components/core/src/clp_s/log_converter/LogSerializer.cpp (2)
create(28-49)create(28-29)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: package-image
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, true)
🔇 Additional comments (1)
components/core/src/clp_s/log_converter/LogConverter.cpp (1)
85-86: Handle close() errors (or switch to RAII).Currently close()’s result is ignored; failures will be lost. At minimum, propagate it. Ideally, make LogSerializer’s destructor call close() idempotently and remove the explicit call to avoid leaks on early returns.
[ suggest_essential_refactor ]
- serializer.close(); + YSTDLIB_ERROR_HANDLING_TRYV(serializer.close()); return ystdlib::error_handling::success();
LinZhihao-723
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed log_converter.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed commandline args.
components/core/src/clp_s/log_converter/CommandLineArguments.cpp
Outdated
Show resolved
Hide resolved
| /** | ||
| * Reads and returns a list of paths from a file containing newline-delimited paths. | ||
| * @param input_path_list_file_path Path to the file containing the list of paths. | ||
| * @param path_destination The vector that the paths are pushed into. | ||
| * @return Whether paths were read successfully or not. | ||
| */ | ||
| [[nodiscard]] auto read_paths_from_file( | ||
| std::string const& input_path_list_file_path, | ||
| std::vector<std::string>& path_destination | ||
| ) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should split the declaration and definition.
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
components/core/src/clp_s/log_converter/log_converter.cpp (2)
61-71: Create nested output dirs and validate path type.Handle nested paths and “exists but not a directory”; switch to create_directories and add exists/is_directory checks. Also fix the log message grammar.
Apply this diff:
- std::error_code ec{}; - if (false == std::filesystem::create_directory(command_line_arguments.get_output_dir(), ec) - && ec) - { - SPDLOG_ERROR( - "Can not create output directory {} - {}", - command_line_arguments.get_output_dir(), - ec.message() - ); - return false; - } + std::error_code ec{}; + auto const& outdir = command_line_arguments.get_output_dir(); + if (false == std::filesystem::exists(outdir, ec)) { + ec.clear(); + if (false == std::filesystem::create_directories(outdir, ec) && ec) { + SPDLOG_ERROR("Cannot create output directory {} - {}", outdir, ec.message()); + return false; + } + } else if (false == std::filesystem::is_directory(outdir, ec)) { + SPDLOG_ERROR("Output path exists but is not a directory: {}", outdir); + return false; + }
89-92: Don’t assign to std::ignore; discard via void cast.Avoids relying on and is clearer when intentionally ignoring the result.
Apply this diff:
- std::ignore = check_and_log_curl_error(path, reader.get()); + (void)check_and_log_curl_error(path, reader.get());components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)
59-69: Trim CRLF in files-from list.Strip trailing ‘\r’ so Windows line endings don’t corrupt paths.
Apply this diff:
std::string line; while (true) { error_code = reader.try_read_to_delimiter('\n', false, false, line); if (ErrorCodeSuccess != error_code) { break; } - if (false == line.empty()) { - path_destination.push_back(line); - } + if (false == line.empty()) { + if ('\r' == line.back()) { + line.pop_back(); + } + if (false == line.empty()) { + path_destination.push_back(line); + } + } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/log_converter/CommandLineArguments.cpp(1 hunks)components/core/src/clp_s/log_converter/log_converter.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/log_converter/log_converter.cppcomponents/core/src/clp_s/log_converter/CommandLineArguments.cpp
🧬 Code graph analysis (2)
components/core/src/clp_s/log_converter/log_converter.cpp (2)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (7)
CommandLineArguments(22-22)CommandLineArguments(22-22)nodiscard(27-27)nodiscard(29-29)nodiscard(31-33)nodiscard(35-35)argc(25-25)components/core/src/clp_s/log_converter/LogConverter.hpp (3)
LogConverter(20-20)path(35-36)reader(53-54)
components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (5)
nodiscard(27-27)nodiscard(29-29)nodiscard(31-33)nodiscard(35-35)argc(25-25)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (macos-15)
components/core/src/clp_s/log_converter/CommandLineArguments.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (4)
components/core/src/clp_s/log_converter/LogSerializer.hpp (3)
42-43: Addnoexceptto move assignment operator for consistency.The move constructor is marked
noexceptbut the move assignment operator is not. For consistency and to ensure exception safety, both should have matching exception specifications. Since the move constructor is defaulted andnoexcept, the move assignment should be too.Apply this diff:
LogSerializer(LogSerializer&&) noexcept = default; -[[nodiscard]] auto operator=(LogSerializer&&) -> LogSerializer& = default; +[[nodiscard]] auto operator=(LogSerializer&&) noexcept -> LogSerializer& = default;
77-81: Makeclose()idempotent to prevent double-close bugs.Currently, calling
close()multiple times will write multiple EOF markers and attempt to close the writer repeatedly, which can lead to corruption or runtime errors. Add a guard flag to ensure close operations happen only once.Apply this diff:
void close() { - flush_buffer(); - m_writer.write_numeric_value(clp::ffi::ir_stream::cProtocol::Eof); - m_writer.close(); + if (false == m_closed) { + flush_buffer(); + m_writer.write_numeric_value(clp::ffi::ir_stream::cProtocol::Eof); + m_writer.close(); + m_closed = true; + } }And add the flag to private members:
clp::ffi::ir_stream::Serializer<clp::ir::eight_byte_encoded_variable_t> m_serializer; clp_s::FileWriter m_writer; +bool m_closed{false};
102-109: Skip zero-length writes to avoid unnecessary syscalls.The current implementation always calls
m_writer.write()even when the buffer is empty. Add an early return when there's no data to flush.Apply this diff:
void flush_buffer() { auto const buffer{m_serializer.get_ir_buf_view()}; + if (0U == buffer.size_bytes()) { + return; + } m_writer.write( clp::size_checked_pointer_cast<char const>(buffer.data()), buffer.size_bytes() ); m_serializer.clear_ir_buf(); }components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)
67-76: Handle CRLF line endings to prevent path corruption.The existing past review comment remains unaddressed. When reading paths from files with Windows line endings (CRLF), the trailing
\rcharacter will corrupt the paths.Apply this diff to trim trailing
\r:std::string line; while (true) { error_code = reader.try_read_to_delimiter('\n', false, false, line); if (ErrorCodeSuccess != error_code) { break; } if (false == line.empty()) { + if ('\r' == line.back()) { + line.pop_back(); + } path_destination.push_back(line); } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/log_converter/CommandLineArguments.cpp(1 hunks)components/core/src/clp_s/log_converter/LogSerializer.hpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/log_converter/LogSerializer.hppcomponents/core/src/clp_s/log_converter/CommandLineArguments.cpp
🧬 Code graph analysis (2)
components/core/src/clp_s/log_converter/LogSerializer.hpp (1)
components/core/src/clp_s/FileWriter.hpp (2)
writer(42-49)writer(42-42)
components/core/src/clp_s/log_converter/CommandLineArguments.cpp (1)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (5)
nodiscard(27-27)nodiscard(29-29)nodiscard(31-33)nodiscard(35-35)argc(25-25)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: rust-checks (macos-15)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
🔇 Additional comments (1)
components/core/src/clp_s/log_converter/LogSerializer.hpp (1)
91-96: LGTM!The private constructor correctly takes ownership of the serializer and writer via move semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
components/core/src/clp_s/log_converter/log_converter.cpp (3)
61-71: Create nested output directories and validate existing path type.Use exists/is_directory checks and create_directories; also fix message grammar.
- std::error_code ec{}; - if (false == std::filesystem::create_directory(command_line_arguments.get_output_dir(), ec) - && ec) - { - SPDLOG_ERROR( - "Can not create output directory {} - {}", - command_line_arguments.get_output_dir(), - ec.message() - ); - return false; - } + std::error_code ec{}; + auto const& outdir = command_line_arguments.get_output_dir(); + if (false == std::filesystem::exists(outdir, ec)) { + ec.clear(); + if (false == std::filesystem::create_directories(outdir, ec) && ec) { + SPDLOG_ERROR("Cannot create output directory {} - {}", outdir, ec.message()); + return false; + } + } else if (false == std::filesystem::is_directory(outdir, ec)) { + SPDLOG_ERROR("Output path exists but is not a directory: {}", outdir); + return false; + }
89-92: Invalid use of std::ignore; this won’t compile. Consume or explicitly discard the result.Assigning to std::ignore is ill-formed. Either cast to void or, better, use the boolean to avoid duplicate logs and satisfy [[nodiscard]].
Minimal fix:
- std::ignore = check_and_log_curl_error(path, reader.get()); + (void)check_and_log_curl_error(path, reader.get());Recommended (use the result to prevent double-logging):
- std::ignore = check_and_log_curl_error(path, reader.get()); - SPDLOG_ERROR("Received input that was not unstructured logtext: {}.", path.path); - return false; + auto const had_curl_error = check_and_log_curl_error(path, reader.get()); + if (false == had_curl_error) { + SPDLOG_ERROR("Received input that was not unstructured logtext: {}.", path.path); + } + return false;
95-99: Guard nested_readers.back() to avoid UB when vector is empty.Empty vector leads to undefined behaviour; also guard against null element.
- auto const convert_result{log_converter.convert_file( - path, - nested_readers.back().get(), - command_line_arguments.get_output_dir() - )}; + if (nested_readers.empty() || nullptr == nested_readers.back()) { + (void)check_and_log_curl_error(path, reader.get()); + SPDLOG_ERROR("No reader available after type deduction for {}.", path.path); + return false; + } + auto const convert_result{log_converter.convert_file( + path, + nested_readers.back().get(), + command_line_arguments.get_output_dir() + )};
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/log_converter/log_converter.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/log_converter/log_converter.cpp
🧬 Code graph analysis (1)
components/core/src/clp_s/log_converter/log_converter.cpp (2)
components/core/src/clp_s/log_converter/CommandLineArguments.hpp (7)
CommandLineArguments(22-22)CommandLineArguments(22-22)nodiscard(27-27)nodiscard(29-29)nodiscard(31-33)nodiscard(35-35)argc(25-25)components/core/src/clp_s/log_converter/LogConverter.hpp (3)
LogConverter(20-20)path(35-36)reader(53-54)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: package-image
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (macos-15)
LinZhihao-723
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Pushed some nit fixes. Otherwise lgtm.
- Tested by parsing and converting private users' Android log. Everything worked fine.
- PR title lgtm.
Description
This PR adds a new log-converter binary which can be used to convert unstructured logtext into semi-structured KV-IR. This binary should allow us to offer some support for the use-case in #1308.
The implementation uses log-surgeon to separate the timestamp and logtext from unstructured text logs, and serialize these fields into KV-IR for each message.
The utility currently has several limitations:
Currently the original file name is stored in the metadata of the generated KV-IR as "original_file". clp-s will ingest this metadata and store it in the range-index, but the
_file_namefield in the range index will still correspond to the file name of the generated KV-IR. Ideally we would add support for forcing clp-s to use a specific_file_name, e.g., by allowing metadata keys from KV-IR to override default properties inserted by clp-s such as_file_name.Checklist
breaking change.
Validation performed
log-convertercan produce accurate KV-IR for unstructured log-text fileslog-convertercan convert content stored on S3Summary by CodeRabbit
New Features
Chores
Maintenance
Tests
Documentation