-
Notifications
You must be signed in to change notification settings - Fork 84
docs(clp-s): Describe more compression options; Update out-of-date description of archive-path option for decompression and search.
#1030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(clp-s): Describe more compression options; Update out-of-date description of archive-path option for decompression and search.
#1030
Conversation
…details about search and decompression usage.
WalkthroughThe documentation for the Changes
Suggested reviewers
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/user-guide/core-clp-s.md(4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/src/user-guide/core-clp-s.md
[uncategorized] ~29-~29: Possible missing comma found.
Context: ...ath is a URL. When S3 authentication is enabled we issue a GET request following th...
(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~32-~32: Possible missing comma found.
Context: ... For more information on usage with S3 see our [dedicated guide](#guides-u...
(AI_HYDRA_LEO_MISSING_COMMA)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/core-clp-s.md
33-33: Link fragments should be valid
null
(MD051, link-fragments)
82-82: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
105-105: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (2)
docs/src/user-guide/core-clp-s.md (2)
33-33: Verify link fragment – may be invalid
[dedicated guide](#guides-using-object-storage/index)triggers MD051. Confirm that the fragment exists or replace it with the correct target.
56-63: Great real-world exampleIncluding an authenticated S3 KV-IR example dramatically improves usability. No issues spotted here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
docs/src/user-guide/core-clp-s.md (3)
29-29: Still missing a comma after “enabled”This grammatical fix was pointed out in the earlier review but hasn’t been applied yet.
82-82: Trailing whitespace still present (MD009)Line 82 ends with an extra space; please trim it.
105-105: Trailing whitespace still present (MD009)Line 105 ends with an extra space; please trim it.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/user-guide/core-clp-s.md(4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/src/user-guide/core-clp-s.md
[uncategorized] ~29-~29: Possible missing comma found.
Context: ...ath is a URL. When S3 authentication is enabled we issue a GET request following th...
(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~32-~32: Possible missing comma found.
Context: ... For more information on usage with S3 see our [dedicated guide](#guides-u...
(AI_HYDRA_LEO_MISSING_COMMA)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/core-clp-s.md
33-33: Link fragments should be valid
null
(MD051, link-fragments)
82-82: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
105-105: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
docs/src/user-guide/core-clp-s.md
Outdated
| timestamp (`--timestamp-key <field-path>`), or whether to fully parse array entries and encode | ||
| them into dedicated columns (`--structurize-arrays`). | ||
| * For a complete list, run `./clp-s c --help` | ||
| * `input-path` is any new-line-delimited JSON (ndjson) log file, KV-IR file, URL pointing to such |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The author of the ndjson spec has expressed willingness to deprecate the spec: https://www.github.com/ndjson/ndjson-spec/issues/35
Instead, JSON Lines (JSONL) was recommended. Shall we rename the references to JSONL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Right now I think we use ndjson very consistently throughout all of our documentation though, so it might be better to put this up as an issue and change all of the references at once in a separate PR so that the docs stay internally consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up as #1034
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
docs/src/user-guide/core-clp-s.md (3)
82-83: Remove trailing spaces on line 82.Markdown-lint (MD009) still flags this line. A quick whitespace trim keeps the linter quiet.
-* `archives-path` is a directory containing archives, a path to an archive, or a URL pointing to a +* `archives-path` is a directory containing archives, a path to an archive, or a URL pointing to a
29-30: Insert the missing comma after “enabled”.Minor punctuation slip ‒ the comma separates the introductory clause and improves readability.
- if the input path is a URL. When S3 authentication is enabled we issue a GET request following + if the input path is a URL. When S3 authentication is enabled, we issue a GET request following
32-33: Fix the broken anchor and add file extension.
#guides-using-object-storage/indexis not a valid local fragment and triggers MD051. Point to the actual Markdown file so the link resolves both in-repo and on rendered sites.- [dedicated guide](#guides-using-object-storage/index). + [dedicated guide](../guides/using-object-storage/index.md).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/user-guide/core-clp-s.md(4 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/src/user-guide/core-clp-s.md
[uncategorized] ~29-~29: Possible missing comma found.
Context: ...ath is a URL. When S3 authentication is enabled we issue a GET request following th...
(AI_HYDRA_LEO_MISSING_COMMA)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/core-clp-s.md
33-33: Link fragments should be valid
null
(MD051, link-fragments)
82-82: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
105-105: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
docs/src/user-guide/core-clp-s.md
Outdated
| * `input-path` is any new-line-delimited JSON (ndjson) log file, KV-IR file, URL pointing to such | ||
| files, or directory containing such files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Consider adopting “JSON Lines (JSONL)” nomenclature.
The NDJSON spec is being deprecated in favour of JSON Lines. Updating the wording now avoids future churn and keeps terminology modern.
🤖 Prompt for AI Agents
In docs/src/user-guide/core-clp-s.md around lines 15 to 16, the term "ndjson" is
used to describe new-line-delimited JSON files, but the NDJSON specification is
being deprecated. Replace "ndjson" with "JSON Lines (JSONL)" to adopt the modern
and preferred nomenclature, ensuring the documentation stays current and avoids
future updates.
kirkrodrigues
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions to simplify the language and break up the text for readability.
Fwiw, I don't like many levels of list indents, but it seems like the best solution for now.
docs/src/user-guide/core-clp-s.md
Outdated
| encoded into dedicated columns. | ||
| * `--auth <s3|none>` specifies the authentication method that should be used for network requests | ||
| if the input path is a URL. When S3 authentication is enabled we issue a GET request following | ||
| the presigned URL v4 specification. This request draws on the environment variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we link to the spec?
Co-authored-by: kirkrodrigues <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
docs/src/user-guide/core-clp-s.md (2)
15-18: Still uses “ndjson” despite earlier deprecation discussionThe bullet list re-introduces the “ndjson” term that previous reviews flagged for future replacement by “JSON Lines (JSONL)”.
If the decision is still to switch terminology repo-wide later, feel free to ignore; otherwise, consider updating now for consistency.
22-24: Same terminology drift inside--file-typedescriptionThe option description again says “ndjson”. Align with whatever final wording you choose for the earlier bullet to avoid mixed vocabulary.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/user-guide/core-clp-s.md(5 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/src/user-guide/core-clp-s.md
[uncategorized] ~28-~28: You might be missing the article “the” here.
Context: ... * This option significantly affects compression ratio. * --structurize-arrays speci...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/core-clp-s.md
159-159: Files should end with a single newline character
null
(MD047, single-trailing-newline)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (1)
docs/src/user-guide/core-clp-s.md (1)
33-37: Let’s confirm the exact filenames in theguides-using-object-storagedirectory:#!/bin/bash # List all files in the S3 guide directory ls docs/src/user-guide/guides-using-object-storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
docs/src/user-guide/core-clp-s.md (1)
36-37: Broken relative link to the object-storage guide
guides-using-object-storage/indexlacks both the correct relative path and the.mdsuffix, tripping MD051 and leading to a 404 when rendered on GitHub.- * For more information on usage with S3, see our - [dedicated guide](guides-using-object-storage/index). + * For more information on usage with S3, see our + [dedicated guide](../guides/using-object-storage/index.md).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/user-guide/core-clp-s.md(5 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/src/user-guide/core-clp-s.md
[uncategorized] ~28-~28: You might be missing the article “the” here.
Context: ... * This option significantly affects compression ratio. * --structurize-arrays speci...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
🔇 Additional comments (1)
docs/src/user-guide/core-clp-s.md (1)
11-14: Parameter naming drifts between commandsCompression still uses
<archives-dir>while decompression and search use<archives-path>. Unless there is a deliberate semantic difference (output dir vs. input path), consider harmonising the placeholder to avoid reader confusion.
Co-authored-by: kirkrodrigues <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
docs/src/user-guide/core-clp-s.md (2)
56-58: Minor grammar improvements in the tip block.Add the article and a comma for clarity.
-Specifying the timestamp-key will create a range-index for the timestamp column which can increase compression ratio and search performance. +Specifying the timestamp key will create a range index for the timestamp column, which can increase the compression ratio and search performance.
10-15: Consider harmonising “archives-dir” vs “archives-path”.Using two different parameter names for conceptually similar inputs may confuse users. Unless the semantics differ intentionally, adopting a single term across compression, decompression, and search would streamline the CLI.
♻️ Duplicate comments (2)
docs/src/user-guide/core-clp-s.md (2)
26-28: Insert the definite article before “compression ratio”.The noun phrase is missing the, which reads unidiomatically.
- * This option significantly affects compression ratio. + * This option significantly affects the compression ratio.
37-38: Correct the broken relative link to the S3 guide.The current fragment
guides-using-object-storage/indexwon’t resolve during doc builds.- [dedicated guide](guides-using-object-storage/index). + [dedicated guide](../guides/using-object-storage/index.md).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/user-guide/core-clp-s.md(5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: quinntaylormitchell
PR: y-scope/clp#968
File: docs/src/user-guide/quick-start/overview.md:73-109
Timestamp: 2025-06-18T20:39:05.899Z
Learning: The CLP project team prefers to use video content to demonstrate detailed procedural steps (like tarball extraction) rather than including every step in the written documentation, keeping the docs focused on conceptual guidance.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: quinntaylormitchell
PR: y-scope/clp#961
File: docs/src/dev-guide/design-clp-structured/single-file-archive-format.md:216-219
Timestamp: 2025-06-18T14:35:20.485Z
Learning: In clp-s documentation, technical abbreviations like "MPT" (Merged Parse Tree) should be defined at first use to improve reader clarity and comprehension.
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Learnt from: quinntaylormitchell
PR: y-scope/clp#968
File: docs/src/user-guide/quick-start/overview.md:53-54
Timestamp: 2025-06-18T20:48:48.990Z
Learning: CLP is designed to run on Linux systems where Python is typically pre-installed, so Python installation links are generally not needed in CLP documentation.
docs/src/user-guide/core-clp-s.md (11)
Learnt from: quinntaylormitchell
PR: y-scope/clp#968
File: docs/src/user-guide/quick-start/overview.md:73-109
Timestamp: 2025-06-18T20:39:05.899Z
Learning: The CLP project team prefers to use video content to demonstrate detailed procedural steps (like tarball extraction) rather than including every step in the written documentation, keeping the docs focused on conceptual guidance.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: quinntaylormitchell
PR: y-scope/clp#961
File: docs/src/dev-guide/design-clp-structured/single-file-archive-format.md:216-219
Timestamp: 2025-06-18T14:35:20.485Z
Learning: In clp-s documentation, technical abbreviations like "MPT" (Merged Parse Tree) should be defined at first use to improve reader clarity and comprehension.
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Learnt from: LinZhihao-723
PR: y-scope/clp#593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.
Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/macos/install-all.sh:11-12
Timestamp: 2025-05-06T10:07:04.654Z
Learning: In CLP installation scripts, temporary directories with downloaded files should not be automatically cleaned up on failure (e.g., with EXIT traps) to preserve artifacts for debugging purposes.
Learnt from: jackluo923
PR: y-scope/clp#718
File: components/core/tools/scripts/utils/create-debian-package.py:41-41
Timestamp: 2025-02-12T22:24:17.723Z
Learning: For the clp-core Debian package creation script, strict version format validation is considered unnecessary complexity and should be avoided.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/job-orchestration/job_orchestration/scheduler/job_config.py:29-39
Timestamp: 2025-01-15T16:36:48.932Z
Learning: The S3 service in the clp codebase only supports AWS S3, where region_code is mandatory. Other S3-like services are not supported.
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/clp-py-utils/clp_py_utils/s3_utils.py:11-16
Timestamp: 2024-12-12T19:20:59.778Z
Learning: S3 roles provided may not have permission to perform `head_bucket` and `delete_object` operations; verification logic should avoid using these methods.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
kirkrodrigues
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about:
docs(clp-s): Describe more compression options; Update out-of-date description of `archive-path` option for decompression and search.
archive-path option for clp-s decompression and search.archive-path option for decompression and search.
…scription of `archive-path` option for decompression and search. (y-scope#1030) Co-authored-by: kirkrodrigues <[email protected]>
Description
This PR adds more detail to the usage guide for clp-s compression in order to make users aware of options to:
We also correct the description of the
archives-pathargument for clp-s decompression and search to make it clear that we support:Checklist
breaking change.
Validation performed
Summary by CodeRabbit
Summary by CodeRabbit