-
Notifications
You must be signed in to change notification settings - Fork 83
feat(integration-tests): Use tar for extracting tarball downloads with optional leading directory component stripping.
#1661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughModified Changes
Sequence DiagramsequenceDiagram
participant Test as Test Code
participant Func as _download_and_extract_dataset
participant Curl as curl
participant Tar as tar
participant FS as Filesystem
Test->>Func: Call with dataset name & URL
rect rgb(200, 220, 240)
Note over Func: Old Flow (shutil)
Func->>Curl: Download tarball
Curl-->>Func: Save file
Func->>FS: shutil.unpack_archive()
FS-->>Func: Extract complete
end
rect rgb(220, 240, 200)
Note over Func: New Flow (curl + tar)
Func->>Curl: Download tarball
Curl-->>Func: Save file
Func->>FS: Apply permissions (gu+w)
Func->>Func: Locate tar executable
Func->>Tar: tar -xf with -C target dir
Tar-->>Func: Extract complete
alt has_leading_directory_component = true
Func->>FS: Strip leading directory
FS-->>Func: Done
end
end
Func-->>Test: Return IntegrationTestLogs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
integration-tests/tests/fixtures/integration_test_logs.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/utils/config.py:121-123
Timestamp: 2025-08-20T22:07:04.953Z
Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1050
File: components/clp-package-utils/clp_package_utils/scripts/archive_manager.py:178-178
Timestamp: 2025-08-14T05:06:59.010Z
Learning: In the decompress.py script, the --dataset argument was only added to the json_extraction_parser but not to the ir_extraction_parser, causing AttributeError when handle_extract_stream_cmd tries to access parsed_args.dataset for IR extraction commands.
🧬 Code graph analysis (1)
integration-tests/tests/fixtures/integration_test_logs.py (2)
integration-tests/tests/utils/config.py (1)
IntegrationTestLogs(104-128)integration-tests/tests/utils/utils.py (1)
unlink(65-86)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
integration-tests/tests/fixtures/integration_test_logs.py (1)
83-112: curl/tar/chmod pipeline andhas_leading_directory_componentflag look solidThe explicit curl+tar flow with
--fail/--location,-Cinto the extraction dir, and the optional--strip-components 1governed byhas_leading_directory_componentis clear and should behave well on both macOS and Linux. The post-extractionchmod -R gu+won both the tarball and extraction directory aligns with the goal of making cached datasets easily overwritable between runs.Also applies to: 118-120
| """ | ||
| :param request: | ||
| :param integration_test_path_config: | ||
| "param name: | ||
| :param tarball_url: | ||
| :param has_leading_directory_component: Whether all files inside the tarball are stored under a | ||
| single top level directory. Defaults to True. | ||
| :return: The IntegrationTestPathConfig object with its associated logs properly downloaded, | ||
| extracted, and permission changed to be overritable. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring is malformed and does not match the actual return type
The function docstring has a stray quote, a missing :param name: entry, a typo (“overritable”), and still claims to return IntegrationTestPathConfig instead of IntegrationTestLogs. This can mislead future readers.
- """
- :param request:
- :param integration_test_path_config:
- "param name:
- :param tarball_url:
- :param has_leading_directory_component: Whether all files inside the tarball are stored under a
- single top level directory. Defaults to True.
- :return: The IntegrationTestPathConfig object with its associated logs properly downloaded,
- extracted, and permission changed to be overritable.
- """
+ """
+ Download and extract an integration-test log dataset tarball.
+
+ :param request: The pytest fixture request, used for accessing the cache.
+ :param integration_test_path_config: Paths configuration for integration tests.
+ :param name: Dataset name (used for the cache key and file/directory naming).
+ :param tarball_url: URL of the dataset tarball.
+ :param has_leading_directory_component: Whether all files inside the tarball are stored under a
+ single top-level directory. Defaults to True.
+ :return: The IntegrationTestLogs describing the tarball and extraction directory after download.
+ """📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """ | |
| :param request: | |
| :param integration_test_path_config: | |
| "param name: | |
| :param tarball_url: | |
| :param has_leading_directory_component: Whether all files inside the tarball are stored under a | |
| single top level directory. Defaults to True. | |
| :return: The IntegrationTestPathConfig object with its associated logs properly downloaded, | |
| extracted, and permission changed to be overritable. | |
| """ | |
| """ | |
| Download and extract an integration-test log dataset tarball. | |
| :param request: The pytest fixture request, used for accessing the cache. | |
| :param integration_test_path_config: Paths configuration for integration tests. | |
| :param name: Dataset name (used for the cache key and file/directory naming). | |
| :param tarball_url: URL of the dataset tarball. | |
| :param has_leading_directory_component: Whether all files inside the tarball are stored under a | |
| single top-level directory. Defaults to True. | |
| :return: The IntegrationTestLogs describing the tarball and extraction directory after download. | |
| """ |
🤖 Prompt for AI Agents
In integration-tests/tests/fixtures/integration_test_logs.py around lines 52 to
61, the function docstring is malformed (a stray quote, a missing :param name:
entry), contains a typo (“overritable”), and lists the wrong return type; update
the docstring to remove the stray quote, add a proper :param name: description
matching the function signature, fix the spelling to “overwritable”, and change
the :return: to correctly state and describe IntegrationTestLogs (and ensure all
param names and descriptions exactly match the function parameters).
| integration_test_logs.tarball_path.unlink(missing_ok=True) | ||
| shutil.rmtree(integration_test_logs.extraction_dir) | ||
| integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shutil.rmtree will raise on first run when the extraction directory does not exist
Calling shutil.rmtree(integration_test_logs.extraction_dir) unconditionally will raise a FileNotFoundError on the very first run (before the directory has ever been created), causing the fixture to fail before any download/extraction happens. The earlier helper-based cleanup was tolerant of missing paths.
- integration_test_logs.tarball_path.unlink(missing_ok=True)
- shutil.rmtree(integration_test_logs.extraction_dir)
- integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)
+ integration_test_logs.tarball_path.unlink(missing_ok=True)
+ try:
+ shutil.rmtree(integration_test_logs.extraction_dir)
+ except FileNotFoundError:
+ pass
+ integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False)If you still want the previous “force remove with sudo” behaviour for stubborn paths, you could instead wire this back through tests.utils.utils.unlink, but the above is the minimal fix for the first-run failure.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| integration_test_logs.tarball_path.unlink(missing_ok=True) | |
| shutil.rmtree(integration_test_logs.extraction_dir) | |
| integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) | |
| integration_test_logs.tarball_path.unlink(missing_ok=True) | |
| try: | |
| shutil.rmtree(integration_test_logs.extraction_dir) | |
| except FileNotFoundError: | |
| pass | |
| integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) |
🤖 Prompt for AI Agents
In integration-tests/tests/fixtures/integration_test_logs.py around lines 71 to
73, calling shutil.rmtree(integration_test_logs.extraction_dir) will raise
FileNotFoundError on first run; update the cleanup to tolerate missing
directories by either checking existence before removal or calling
shutil.rmtree(integration_test_logs.extraction_dir, ignore_errors=True) so the
fixture no longer fails when the extraction dir doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, I did get an error when I ran this directly after a fresh build. The rabbit's committable suggestion works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quinntaylormitchell, thank you for confirming the issue and validating the fix! I appreciate you testing it on a fresh build. The try/except approach should ensure the fixture works reliably on first run while still cleaning up stale directories on subsequent runs.
(\(\
( -.-)
o_(")(")
[Happy to help!]
🧠 Learnings used
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/utils/config.py:121-123
Timestamp: 2025-08-20T22:07:04.953Z
Learning: For the CLP integration tests codebase, do not suggest deriving tarball filenames from URLs instead of hard-coding ".tar.gz" extension. The user has explicitly rejected this suggestion.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
quinntaylormitchell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good; a few comments.
Also, I would write a little blurb in the "Validation performed" section.
| "param name: | ||
| :param tarball_url: | ||
| :param has_leading_directory_component: Whether all files inside the tarball are stored under a | ||
| single top level directory. Defaults to True. | ||
| :return: The IntegrationTestPathConfig object with its associated logs properly downloaded, | ||
| extracted, and permission changed to be overritable. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "param name: | |
| :param tarball_url: | |
| :param has_leading_directory_component: Whether all files inside the tarball are stored under a | |
| single top level directory. Defaults to True. | |
| :return: The IntegrationTestPathConfig object with its associated logs properly downloaded, | |
| extracted, and permission changed to be overritable. | |
| """ | |
| :param name: | |
| :param tarball_url: | |
| :param has_leading_directory_component: Whether all files inside the tarball are stored under a | |
| single top-level directory. | |
| :return: The IntegrationTestLogs object describing the donloaded and extracted logs. | |
| """ |
clarity and typo fixes.
| integration_test_logs.tarball_path.unlink(missing_ok=True) | ||
| shutil.rmtree(integration_test_logs.extraction_dir) | ||
| integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, I did get an error when I ran this directly after a fresh build. The rabbit's committable suggestion works.
| integration_test_logs.tarball_path.unlink(missing_ok=True) | ||
| shutil.rmtree(integration_test_logs.extraction_dir) | ||
| integration_test_logs.extraction_dir.mkdir(parents=True, exist_ok=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple things about these three lines (in addition to the rabbit comment, which is correct):
- I remember that in
tests.utils.utils, you define a function calledunlink; you should use it. That being said, I've realized that function doesn't actually do what it says in the docstring. It saysRemove a file or directory atpath., but it only usesshutil.rmtree, which doesn't remove files (afaik). I think that we should do either one of two things: A. we should update that function to something likedef remove_file_or_dir, make it do what the docstring says, and then use it everywhere in our integration tests, or B. we should remove that function entirely and use only a mixture ofshutil.rmtreefor directories andos.unlinkfor files. - Why don't we make
extraction_dirin theIntegrationTestLogs__post_init__?
| integration_test_path_config: IntegrationTestPathConfig, | ||
| name: str, | ||
| tarball_url: str, | ||
| has_leading_directory_component: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is a bit awkward; how about
| has_leading_directory_component: bool = True, | |
| strip_leading_dir: bool = True, |
or
| has_leading_directory_component: bool = True, | |
| strip_top_dir: bool = True, |
Description
This PR belongs to the integration-tests feature plan described in #1657.
This change replaces the previous archive extraction workflow, which used
shutil.unpack_archive, with an explicittarbased implementation. The new approach correctly handles stripping a leading directory component when the tarball includes a top level folder. For example, the old workflow extractedbuild/integration-tests/postgresql.tar.gzinto:which adds an unnecessary extra level of nesting due to the preserved top level directory inside the archive. Since we wish to control the extraction directory name, we explicitly remove that leading component. The extraction path becomes:
Because
shutil.unpack_archivedoes not support stripping directory components, switching totarprovides the required extraction results. The chosen extraction flags remain compatible with both macOS and Linux.Checklist
breaking change.
Validation performed
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.