Skip to content

Conversation

@Bill-hbrhbr
Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr commented Nov 25, 2025

Description

This PR belongs to the integration-tests feature plan described in #1657.

This change replaces the previous archive extraction workflow, which used shutil.unpack_archive, with an explicit tar based implementation. The new approach correctly handles stripping a leading directory component when the tarball includes a top level folder. For example, the old workflow extracted build/integration-tests/postgresql.tar.gz into:

build/integration-tests/postgresql/postgresql/postgresql.log

which adds an unnecessary extra level of nesting due to the preserved top level directory inside the archive. Since we wish to control the extraction directory name, we explicitly remove that leading component. The extraction path becomes:

build/integration-tests/postgresql/postgresql.log

Because shutil.unpack_archive does not support stripping directory components, switching to tar provides the required extraction results. The chosen extraction flags remain compatible with both macOS and Linux.

Checklist

  • The PR satisfies the contribution guidelines.
  • Necessary docs have been updated, OR no docs need to be updated.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.

Validation performed

Summary by CodeRabbit

  • Chores
    • Enhanced internal testing infrastructure with improved reliability and robustness. Modernized dataset extraction workflows with better error handling mechanisms. Implemented flexible options for directory component management during extraction. Updated permission handling to ensure consistent access across extraction operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@Bill-hbrhbr Bill-hbrhbr requested a review from a team as a code owner November 25, 2025 20:12
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 25, 2025

Walkthrough

Modified _download_and_extract_dataset function to replace shutil.unpack_archive with curl and tar-based extraction. Added has_leading_directory_component parameter to optionally strip leading directory components. Updated permission handling for both tarball and extraction directory, with added tar executable validation and per-step error handling.

Changes

Cohort / Cohort / File(s) Summary
Download and Extract Dataset Function
integration-tests/tests/fixtures/integration_test_logs.py
Replaced shutil.unpack_archive-based extraction with curl and tar subprocess calls; added has_leading_directory_component parameter (defaults True); updated permission handling to apply gu+w to both tarball and extraction directory; added tar executable lookup and per-step error handling; added function docstring documenting new parameter behaviour.

Sequence Diagram

sequenceDiagram
    participant Test as Test Code
    participant Func as _download_and_extract_dataset
    participant Curl as curl
    participant Tar as tar
    participant FS as Filesystem

    Test->>Func: Call with dataset name & URL
    
    rect rgb(200, 220, 240)
    Note over Func: Old Flow (shutil)
    Func->>Curl: Download tarball
    Curl-->>Func: Save file
    Func->>FS: shutil.unpack_archive()
    FS-->>Func: Extract complete
    end
    
    rect rgb(220, 240, 200)
    Note over Func: New Flow (curl + tar)
    Func->>Curl: Download tarball
    Curl-->>Func: Save file
    Func->>FS: Apply permissions (gu+w)
    Func->>Func: Locate tar executable
    Func->>Tar: tar -xf with -C target dir
    Tar-->>Func: Extract complete
    alt has_leading_directory_component = true
        Func->>FS: Strip leading directory
        FS-->>Func: Done
    end
    end
    
    Func-->>Test: Return IntegrationTestLogs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Subprocess error handling: Verify tar executable validation and per-step error handling for curl and tar commands
  • Permission handling changes: Confirm that applying gu+w to both tarball and extraction directory is intentional and secure
  • Directory stripping logic: Review conditional logic for has_leading_directory_component parameter and edge cases
  • Backward compatibility: Validate that defaulting has_leading_directory_component to True maintains existing behaviour for current callers

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: switching to tar-based extraction with optional leading directory component stripping for tarball downloads in integration tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21b8da6 and 3ad919f.

📒 Files selected for processing (1)
  • integration-tests/tests/fixtures/integration_test_logs.py (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/utils/config.py:121-123
Timestamp: 2025-08-20T22:07:04.953Z
Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:59.010Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
🧬 Code graph analysis (1)
integration-tests/tests/fixtures/integration_test_logs.py (2)
integration-tests/tests/utils/config.py (1)
  • IntegrationTestLogs (104-128)
integration-tests/tests/utils/utils.py (1)
  • unlink (65-86)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
integration-tests/tests/fixtures/integration_test_logs.py (1)

83-112: curl/tar/chmod pipeline and has_leading_directory_component flag look solid

The explicit curl+tar flow with --fail/--location, -C into the extraction dir, and the optional --strip-components 1 governed by has_leading_directory_component is clear and should behave well on both macOS and Linux. The post-extraction chmod -R gu+w on both the tarball and extraction directory aligns with the goal of making cached datasets easily overwritable between runs.

Also applies to: 118-120

Comment on lines +52 to +61
"""
:param request:
:param integration_test_path_config:
"param name:
:param tarball_url:
:param has_leading_directory_component: Whether all files inside the tarball are stored under a
single top level directory. Defaults to True.
:return: The IntegrationTestPathConfig object with its associated logs properly downloaded,
extracted, and permission changed to be overritable.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring is malformed and does not match the actual return type

The function docstring has a stray quote, a missing :param name: entry, a typo (“overritable”), and still claims to return IntegrationTestPathConfig instead of IntegrationTestLogs. This can mislead future readers.

-    """
-    :param request:
-    :param integration_test_path_config:
-    "param name:
-    :param tarball_url:
-    :param has_leading_directory_component: Whether all files inside the tarball are stored under a
-    single top level directory. Defaults to True.
-    :return: The IntegrationTestPathConfig object with its associated logs properly downloaded,
-    extracted, and permission changed to be overritable.
-    """
+    """
+    Download and extract an integration-test log dataset tarball.
+
+    :param request: The pytest fixture request, used for accessing the cache.
+    :param integration_test_path_config: Paths configuration for integration tests.
+    :param name: Dataset name (used for the cache key and file/directory naming).
+    :param tarball_url: URL of the dataset tarball.
+    :param has_leading_directory_component: Whether all files inside the tarball are stored under a
+        single top-level directory. Defaults to True.
+    :return: The IntegrationTestLogs describing the tarball and extraction directory after download.
+    """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""
:param request:
:param integration_test_path_config:
"param name:
:param tarball_url:
:param has_leading_directory_component: Whether all files inside the tarball are stored under a
single top level directory. Defaults to True.
:return: The IntegrationTestPathConfig object with its associated logs properly downloaded,
extracted, and permission changed to be overritable.
"""
"""
Download and extract an integration-test log dataset tarball.
:param request: The pytest fixture request, used for accessing the cache.
:param integration_test_path_config: Paths configuration for integration tests.
:param name: Dataset name (used for the cache key and file/directory naming).
:param tarball_url: URL of the dataset tarball.
:param has_leading_directory_component: Whether all files inside the tarball are stored under a
single top-level directory. Defaults to True.
:return: The IntegrationTestLogs describing the tarball and extraction directory after download.
"""
🤖 Prompt for AI Agents
In integration-tests/tests/fixtures/integration_test_logs.py around lines 52 to
61, the function docstring is malformed (a stray quote, a missing :param name:
entry), contains a typo (“overritable”), and lists the wrong return type; update
the docstring to remove the stray quote, add a proper :param name: description
matching the function signature, fix the spelling to “overwritable”, and change
the :return: to correctly state and describe IntegrationTestLogs (and ensure all
param names and descriptions exactly match the function parameters).

Comment on lines +71 to +73
integration_test_logs.tarball_path.unlink(missing_ok=True)
shutil.rmtree(integration_test_logs.extraction_dir)
integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

shutil.rmtree will raise on first run when the extraction directory does not exist

Calling shutil.rmtree(integration_test_logs.extraction_dir) unconditionally will raise a FileNotFoundError on the very first run (before the directory has ever been created), causing the fixture to fail before any download/extraction happens. The earlier helper-based cleanup was tolerant of missing paths.

-    integration_test_logs.tarball_path.unlink(missing_ok=True)
-    shutil.rmtree(integration_test_logs.extraction_dir)
-    integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
+    integration_test_logs.tarball_path.unlink(missing_ok=True)
+    try:
+        shutil.rmtree(integration_test_logs.extraction_dir)
+    except FileNotFoundError:
+        pass
+    integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)

If you still want the previous “force remove with sudo” behaviour for stubborn paths, you could instead wire this back through tests.utils.utils.unlink, but the above is the minimal fix for the first-run failure.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
integration_test_logs.tarball_path.unlink(missing_ok=True)
shutil.rmtree(integration_test_logs.extraction_dir)
integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
integration_test_logs.tarball_path.unlink(missing_ok=True)
try:
shutil.rmtree(integration_test_logs.extraction_dir)
except FileNotFoundError:
pass
integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
🤖 Prompt for AI Agents
In integration-tests/tests/fixtures/integration_test_logs.py around lines 71 to
73, calling shutil.rmtree(integration_test_logs.extraction_dir) will raise
FileNotFoundError on first run; update the cleanup to tolerate missing
directories by either checking existence before removal or calling
shutil.rmtree(integration_test_logs.extraction_dir, ignore_errors=True) so the
fixture no longer fails when the extraction dir doesn't exist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, I did get an error when I ran this directly after a fresh build. The rabbit's committable suggestion works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quinntaylormitchell, thank you for confirming the issue and validating the fix! I appreciate you testing it on a fresh build. The try/except approach should ensure the fixture works reliably on first run while still cleaning up stale directories on subsequent runs.

   (\(\
   ( -.-)
   o_(")(")
   [Happy to help!]

🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/utils/config.py:121-123
Timestamp: 2025-08-20T22:07:04.953Z
Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Copy link
Collaborator

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good; a few comments.

Also, I would write a little blurb in the "Validation performed" section.

Comment on lines +55 to +61
"param name:
:param tarball_url:
:param has_leading_directory_component: Whether all files inside the tarball are stored under a
single top level directory. Defaults to True.
:return: The IntegrationTestPathConfig object with its associated logs properly downloaded,
extracted, and permission changed to be overritable.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"param name:
:param tarball_url:
:param has_leading_directory_component: Whether all files inside the tarball are stored under a
single top level directory. Defaults to True.
:return: The IntegrationTestPathConfig object with its associated logs properly downloaded,
extracted, and permission changed to be overritable.
"""
:param name:
:param tarball_url:
:param has_leading_directory_component: Whether all files inside the tarball are stored under a
single top-level directory.
:return: The IntegrationTestLogs object describing the donloaded and extracted logs.
"""

clarity and typo fixes.

Comment on lines +71 to +73
integration_test_logs.tarball_path.unlink(missing_ok=True)
shutil.rmtree(integration_test_logs.extraction_dir)
integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, I did get an error when I ran this directly after a fresh build. The rabbit's committable suggestion works.

Comment on lines +71 to +73
integration_test_logs.tarball_path.unlink(missing_ok=True)
shutil.rmtree(integration_test_logs.extraction_dir)
integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things about these three lines (in addition to the rabbit comment, which is correct):

  1. I remember that in tests.utils.utils, you define a function called unlink; you should use it. That being said, I've realized that function doesn't actually do what it says in the docstring. It says Remove a file or directory at path., but it only uses shutil.rmtree, which doesn't remove files (afaik). I think that we should do either one of two things: A. we should update that function to something like def remove_file_or_dir, make it do what the docstring says, and then use it everywhere in our integration tests, or B. we should remove that function entirely and use only a mixture of shutil.rmtree for directories and os.unlink for files.
  2. Why don't we make extraction_dir in the IntegrationTestLogs __post_init__?

integration_test_path_config: IntegrationTestPathConfig,
name: str,
tarball_url: str,
has_leading_directory_component: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is a bit awkward; how about

Suggested change
has_leading_directory_component: bool = True,
strip_leading_dir: bool = True,

or

Suggested change
has_leading_directory_component: bool = True,
strip_top_dir: bool = True,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants