Skip to content

feat: clp-s ordered compression and decompression identity transformation integration tests. #1657

@Bill-hbrhbr

Description

@Bill-hbrhbr

Request

This issue tracks the changes needed to support a complete clp-s identity transformation testing workflow under both ordered and unordered modes.

Motivation

The primary motivation is to introduce stronger and more explicit assertions for clp-s identity transformation tests. clp-s compression does not preserve the input directory structure, but it is able to preserve the log ordering.

End result

Testing compression and decompression without data loss in the following two ways:

  • Run clp-s c with defaults and clp-s x with --ordered
  • Run clp-s c with --disable-log-order and clp-s x with defaults

Implementation Plan

  1. Remove the potentially redundant clp-s compression and decompression merge step

For comparison purposes, both the input and output of clp-s workflow must be a single JSON file. However, some downloaded datasets contain split JSON logs.

The current approach merges them by running an additional cycle of clp-s compression and decompression.

  • This is inefficient and should be replaced by a simple merge operation that follows the same log ingestion order used internally by clp-s.
  • This merge step should be skipped entirely when the downloaded logs are already a single JSON file.
  1. Strip the top level directory when extracting downloaded dataset tarballs.

For example, build/integration-tests/postgresql.tar.gz currently extracts into build/integration-tests/postgresql/postgresql/postgresql.log, introducing an unnecessary extra directory layer. Stripping this top level layer enables the tests to accurately detect when a dataset contains a single log file, which is required for the merge skip logic described in step 1.

  1. Add a dedicated assertion for clp-s decompression producing a single file.

Since clp-s does not preserve the input directory structure, and the default flag value --target-ordered-chunk-size=0 guarantees a single output file, the tests should include an explicit assertion verifying this condition.

The output file may also have a uniquely generated UUID name that differs across test runs (see step 4), so the assertion must focus on cardinality rather than filename.

  1. Locate the output filename dynamically for comparison.

The current implementation assumes that the single output file described in step 3 will always be named original, which only holds when clp-s x is run with default flags. When --ordered is provided during decompression, the output filename becomes a UUID. The comparison logic must therefore be updated to locate the single output file dynamically rather than relying on a fixed name.

  1. Extend the JSON sorting workflow to support ordered comparisons.

The deterministic JSON sorting function currently sorts both keys and rows.

The sorter should allow preserving the original row order when comparing results from clp-s ordered compression and decompression.

  1. Add explicit test workflows for both ordered and unordered clp-s modes

As described in the End result section, using the updated JSON sorter.

Optional features:

A. Update CompressionTestPathConfig's field logs_source_dir to logs_source_path, since the input to clp-s c can either be a JSON file, a key-value IR file, or a directory containing a list of files.

B. See issue #1645

C. Add support for testing log-converter key-value IR input files integrity using ordered clp-s workflow.

Implementation

The implementation PRs are as follow:

  1. TBA
  2. feat(integration-tests): Use tar for extracting tarball downloads with optional leading directory component stripping. #1661
  3. TBA
  4. TBA
  5. (draft) feat(integration-tests): Test ordered and default clp-s decompressions; add key/row-level sorting controls for JSON normalization in clp-s test files. #1605
  6. TBA

A. TBA
B. TBA
C. (hold) #1591

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions