Skip to content

Conversation

@Bill-hbrhbr
Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr commented Nov 12, 2025

Description

This PR adds integration tests for #1460 by testing examples mentioned in #1308.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • uv run pytest -k test_log_converter_clp_s_identity_transform passes.

Summary by CodeRabbit

  • Tests

    • Added an identity-transformation integration test for the log-conversion pipeline.
    • Added multiple unstructured log test datasets (simple, escaped-sequence, Hive 24hr, OpenStack 24hr, Hadoop multiline).
  • Chores

    • Added test infrastructure support for the log-converter binary and a canonical-output filename constant.
    • Improved test cleanup to remove prior outputs before re-running.

@Bill-hbrhbr Bill-hbrhbr requested a review from a team as a code owner November 12, 2025 21:01
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 12, 2025

Walkthrough

Adds multiple unstructured-log JSON fixtures, a new identity transformation integration test that invokes the log-converter and verifies CLP-S compress/decompress canonical output, and exposes a log_converter_binary_path property in test config.

Changes

Cohort / File(s) Change Summary
Unstructured log test fixtures
integration-tests/tests/data/unstructured-logs/00-hello-world/converted.json, integration-tests/tests/data/unstructured-logs/01-escape-seq/converted.json, integration-tests/tests/data/unstructured-logs/02-hive-24hr/converted.json, integration-tests/tests/data/unstructured-logs/03-openstack-24hr/converted.json, integration-tests/tests/data/unstructured-logs/04-hadoop-multiline/converted.json
Added JSON files containing expected converted unstructured log entries (single- and multi-line records) used as ground-truth for identity transformation tests.
Identity transformation tests
integration-tests/tests/test_identity_transformation.py
Added constant CLP_S_CANONICAL_OUTPUT_FILENAME, introduced test_log_converter_clp_s_identity_transform to run the log-converter binary, clean previous outputs, produce KV-IR, run CLP‑S compress/decompress, and compare canonical output to converted.json. Minor refactor to reuse the filename constant.
Test configuration utility
integration-tests/tests/utils/config.py
Added log_converter_binary_path property to CoreConfig returning the Path to the log-converter executable.

Sequence Diagram(s)

sequenceDiagram
    %% Styling: subtle coloured notes for key components
    participant Test as Test (pytest)
    participant FS as File System
    participant LogConv as log-converter
    participant CLPS as CLP-S pipeline
    participant Validator as Comparator

    Test->>FS: enumerate test case dirs
    Test->>FS: remove previous KV-IR outputs
    Test->>LogConv: invoke `log-converter` (via log_converter_binary_path)
    LogConv->>FS: write KV-IR output files
    Test->>CLPS: run compress then decompress (CLP-S)
    CLPS->>FS: produce canonical output file (`original`)
    Test->>FS: read `converted.json` (expected)
    Test->>Validator: compare canonical output vs expected
    Validator-->>Test: result (pass/fail)
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

  • New tests and multiple data fixtures increase review surface but follow consistent patterns.
  • Areas to pay extra attention:
    • test_log_converter_clp_s_identity_transform — ensure subprocess invocation, cleanup, and paths are correct and portable.
    • log_converter_binary_path in integration-tests/tests/utils/config.py — verify resolution logic and test environment expectations.
    • The new converted.json fixtures for correct encoding/escaping of multiline and escaped sequences.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding integration tests for the log-converter binary and clp-s pipeline, which is the core objective of the PR.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89cfc6b and 54407a6.

⛔ Files ignored due to path filters (3)
  • integration-tests/tests/data/unstructured-logs/02-hive-24hr/raw.log is excluded by !**/*.log
  • integration-tests/tests/data/unstructured-logs/03-openstack-24hr/raw.log is excluded by !**/*.log
  • integration-tests/tests/data/unstructured-logs/04-hadoop-multiline/raw.log is excluded by !**/*.log
📒 Files selected for processing (3)
  • integration-tests/tests/data/unstructured-logs/02-hive-24hr/converted.json (1 hunks)
  • integration-tests/tests/data/unstructured-logs/03-openstack-24hr/converted.json (1 hunks)
  • integration-tests/tests/data/unstructured-logs/04-hadoop-multiline/converted.json (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
🪛 Biome (2.1.2)
integration-tests/tests/data/unstructured-logs/02-hive-24hr/converted.json

[error] 1-2: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 2-3: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 3-4: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 4-5: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 5-6: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 6-7: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 7-8: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 8-9: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 9-10: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 10-11: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 11-12: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 12-13: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 13-14: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 14-15: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 15-16: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 16-17: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 17-18: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)

integration-tests/tests/data/unstructured-logs/03-openstack-24hr/converted.json

[error] 1-2: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 2-3: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 3-4: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (3)
integration-tests/tests/data/unstructured-logs/02-hive-24hr/converted.json (1)

1-18: Static analysis hints are false positives for JSONL format.

The Biome linter is flagging each line as "End of file expected" and suggesting array format, but these are false positives. This file correctly uses JSONL (JSON Lines) format—one JSON object per line—which is the standard format for streaming and incremental log processing. This format is appropriate for integration test log fixtures and matches the structure of related fixture files in this PR.

integration-tests/tests/data/unstructured-logs/04-hadoop-multiline/converted.json (1)

1-1: Multiline log entry correctly formatted and appropriate for test coverage.

The exception trace with embedded newlines is properly escaped and provides good test coverage for handling multiline logs during conversion. The Hadoop BlockPlacementPolicy failure is realistic test data for HDFS logging scenarios.

integration-tests/tests/data/unstructured-logs/03-openstack-24hr/converted.json (1)

1-4: Verify intent of shell prompt and service termination markers in entry 4.

Entry 4 (lines 4) includes "Terminated", "c-sch failed to start", and a shell prompt ("openstack@openstack-0:~/devstack$") embedded in the log message. Confirm whether this mixed content is intentional for testing edge cases in log conversion (e.g., robustness against non-standard terminal output captured in logs) or if this represents a data collection issue.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Bill-hbrhbr
Copy link
Contributor Author

@coderabbitai
open an issue for

        assert is_json_file_structurally_equal(expected_out, actual_out), (
            f"Mismatch between {expected_out}(expected) and {actual_out}(actual)."
        )

where the pytest fail msg may be truncated due to path being too long. I don't know what's a good solution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 12, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the file containing the is_json_file_structurally_equal assertion
rg -n "is_json_file_structurally_equal" --type=py -C 3

Length of output: 2497


🏁 Script executed:

#!/bin/bash
# Get the current PR URL and comment details for reference
gh pr view 1591 --json url,number,title

Length of output: 167


🏁 Script executed:

#!/bin/bash
# Create the GitHub issue
gh issue create \
  --title "Pytest assertion message truncation with long file paths in JSON comparison" \
  --body "## Problem

In the integration tests, pytest assertion messages can be truncated when file paths are too long, making debugging difficult.

The issue occurs in assertions like:
\`\`\`python
assert is_json_file_structurally_equal(expected_out, actual_out), (
    f\"Mismatch between {expected_out}(expected) and {actual_out}(actual).\"
)
\`\`\`

When \`expected_out\` and \`actual_out\` contain long paths, pytest may truncate the failure message, hiding important information about which files were being compared.

## Context

This issue was identified during review of PR #1591.

**Referenced by:** @Bill-hbrhbr in https://github.com/y-scope/clp/pull/1591

## Additional Notes

A good solution for this issue is still under investigation." \
  --assignee Bill-hbrhbr

Length of output: 42


@Bill-hbrhbr I've opened issue #1592 to track the pytest assertion message truncation problem with long file paths in JSON comparison.

The issue documents:

  • The problem with is_json_file_structurally_equal assertions where long paths cause pytest to truncate failure messages
  • The specific code location (lines 192-194 in integration-tests/tests/test_identity_transformation.py)
  • A backlink to this PR and your comment
  • Notes that a solution is still under investigation

The issue has been assigned to you. You can view it here: #1592


🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:133-137
Timestamp: 2025-07-28T08:33:57.487Z
Learning: In CLP integration tests, the `run_and_assert` utility function should only be used for commands that test CLP package functionality (like clp/clp-s compression/decompression operations). For processing or helper commands (like jq, sort, etc.), direct `subprocess.run` calls should be used instead, as they serve different purposes and don't need the same error handling patterns.

Learnt from: haiqi96
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/test_identity_transformation.py:87-95
Timestamp: 2025-08-08T21:15:10.905Z
Learning: In the CLP project's integration tests (Python code), variable names should use "logs" (plural) rather than "log" (singular) when referring to test logs or log-related entities, as this aligns with the naming conventions used throughout the codebase.

Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good; a few comments in the code, and some high-level comments as well:

  1. Fundamentally, I think that test_log_converter_clp_s_identity_transform is testing the functionality of the log-converter binary, but in some ways it seems like it's testing the clp-s binary's ability to compress/decompress KV-IR as well. I think that if we're going to test the veracity of KV-IR compression/decompression, that should happen in a separate test, and we should make it clear that test_log_converter_clp_s_identity_transform is testing the veracity of the functionality of the log-converter binary by way of compressing and decompressing the KV-IR that log-converter produces. To that end, I've suggested a new docstring; have a look.
  2. The concept of having the test cases enumerated by the directory structure in integration-tests/tests/data doesn't really sit right with me, although I do see how it simplifies the code. Feel free to disregard, I just wanted to bring it up.
  3. I think that overall, the tests should be logged better. The logging doesn't need to be verbose at all, but it would be good to have some little indications telling the dev what the test is doing while it's doing it.

Comment on lines +153 to +157
Validate the end-to-end functionality of the `log-converter` and `clp-s` pipeline.
This test ensures that:
1. `log-converter` correctly transforms unstructured logs into key-value IR format.
2. The kv-IR output can be compressed and decompressed by `clp-s` without data loss.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Validate the end-to-end functionality of the `log-converter` and `clp-s` pipeline.
This test ensures that:
1. `log-converter` correctly transforms unstructured logs into key-value IR format.
2. The kv-IR output can be compressed and decompressed by `clp-s` without data loss.
Validates the end-to-end functionality of the `log-converter` binary by compressing and
decompressing the KV-IR it produces upon ingesting an unstructured log file.

See high-level comment.

log_converter_bin_path_str = str(core_config.log_converter_binary_path)

unstructured_logs_dir = Path(__file__).resolve().parent / "data" / "unstructured-logs"
for test_case_dir in unstructured_logs_dir.iterdir():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my high-level comment about the dependence on directory structure to enumerate test cases.

log_converter_out_dir.mkdir(parents=True, exist_ok=True)
log_converter_bin_path_str = str(core_config.log_converter_binary_path)

unstructured_logs_dir = Path(__file__).resolve().parent / "data" / "unstructured-logs"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unstructured_logs_dir = Path(__file__).resolve().parent / "data" / "unstructured-logs"
unstructured_logs_dir = (
Path(get_env_var("INTEGRATION_TESTS_DIR")).expanduser().resolve()
/ "data"
/ "unstructured-logs"
)

How would you feel about having this be dependent on a new env variable in .pytest.ini? It clearly makes the code here longer, but I'm worried that having file-location-dependent path buried here in this code will be more difficult to notice/remember to change, if we ever need to.


test_name = test_case_dir.name
kv_ir_out = log_converter_out_dir / test_name
unlink(kv_ir_out)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels odd to have this unlink here; I realize it isn't wrong per se, but is there a reason you've put it here at the beginning of the test case rather than at the end with the test_paths.clear_test_outputs() line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because there may be leftover files from a previous test that we'd like to remove

Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but if the kv_ir_out directory is only ever used during this test, wouldn't it make sense to just unlink it when we're done, rather than unlinking it only when a dev runs the test again? kind of like a "clean up everything when you're done" policy?

_clp_s_compress_and_decompress(core_config, test_paths)

expected_out = test_case_dir / "converted.json"
actual_out = test_paths.decompression_dir / CLP_S_CANONICAL_OUTPUT_FILENAME
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a bit of trouble with the value of CLP_S_CANONICAL_OUTPUT_FILENAME being "original", because it seems to me that the output of a decompression task is, by definition, not the original file (even though it should be structurally identical to the original). What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants