Skip to content

Conversation

@sitaowang1998
Copy link
Contributor

@sitaowang1998 sitaowang1998 commented Nov 21, 2025

Description

This PR:

  • Adds guides-using-spider.md for instructions on setting up Spider config in CLP package.
  • Adds the guides-using-spider.md into the toctree.
  • Updates guides-multi-node.md for instructions on running Spider containers manually.
  • Updates design-deployment-orchestration for Spider containers and some minor improvements on existing graph.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • task docs:serve runs successfully.
  • Website served looks good.

Summary by CodeRabbit

  • Documentation
    • Added a new user guide for configuring and using Spider with prerequisites and setup steps.
    • Added a Guides navigation entry pointing to the new Spider guide.
    • Updated deployment and orchestration docs and diagrams to include Spider components and four deployment types.
    • Extended service tables and architecture visuals to show spider-scheduler and spider-compression-worker.
    • Clarified that queue and Redis services are optional when using Celery.

✏️ Tip: You can customize this high-level summary in your review settings.

@sitaowang1998 sitaowang1998 requested a review from a team as a code owner November 21, 2025 00:19
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

Warning

Rate limit exceeded

@sitaowang1998 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 6 minutes and 21 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 8caa973 and 1c1636e.

📒 Files selected for processing (1)
  • docs/src/user-docs/guides-using-spider.md (1 hunks)

Walkthrough

Documentation updates add Spider integration: new "Using Spider" guide and index entry, mark Celery queue/Redis as optional, and introduce Spider services (spider-scheduler, spider-compression-worker) across orchestration, deployment, and multi-host docs. No public API or code changes.

Changes

Cohort / File(s) Summary
New user guide & index
docs/src/user-docs/guides-using-spider.md, docs/src/user-docs/index.md
Added a "Using Spider" guide and registered it in the user docs toctree; includes requirements, config examples (set compression_scheduler.type: "spider"), optional Spider DB/credentials, and startup steps.
Multi-host doc updates
docs/src/user-docs/guides-multi-host.md
Annotated Celery/Spider components; clarified that queue and Redis are optional when using Celery; documented service starts for spider-scheduler and spider-compression-worker.
Design / deployment orchestration
docs/src/dev-docs/design-deployment-orchestration.md
Integrated Spider components into architecture diagrams, tables and deployment model: added spider_scheduler and spider_compression_worker entities, updated dependencies/edges/subgraphs, expanded deployment types (including Spider Base/Full) and referenced new docker-compose variants; updated service table and styling for Spider visuals.

Sequence Diagram(s)

sequenceDiagram
    participant User as CLI/User
    participant CLP as CLP Package
    participant SpiderSched as spider-scheduler
    participant SpiderWorker as spider-compression-worker
    participant DB as spider-db (optional)
    participant Storage as ObjectStore

    rect rgba(173,216,230,0.12)
    Note right of CLP: When compression_scheduler.type = "spider"
    end

    User->>CLP: start package
    CLP->>SpiderSched: submit compression job
    SpiderSched->>SpiderWorker: schedule/compression task
    SpiderWorker->>Storage: read input blobs
    SpiderWorker->>Storage: write compressed output
    SpiderWorker->>DB: record task metadata (optional)
    SpiderSched->>CLP: report job status
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus review on:
    • docs/src/dev-docs/design-deployment-orchestration.md — diagram consistency, subgraph/link styling, deployment-type references.
    • docs/src/user-docs/guides-using-spider.md — command/version accuracy and YAML examples.
    • docs/src/user-docs/guides-multi-host.md — placement and clarity of optional annotations for queue/Redis and service startup steps.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding documentation about using Spider in Docker Compose packaging, which aligns directly with the PR's primary objective and file changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sitaowang1998 sitaowang1998 marked this pull request as ready for review November 24, 2025 17:21
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b6c1fb and b12c48d.

📒 Files selected for processing (4)
  • docs/src/dev-docs/design-deployment-orchestration.md (8 hunks)
  • docs/src/user-docs/guides-multi-host.md (3 hunks)
  • docs/src/user-docs/guides-using-spider.md (1 hunks)
  • docs/src/user-docs/index.md (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-10-27T07:07:37.901Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.

Applied to files:

  • docs/src/dev-docs/design-deployment-orchestration.md
📚 Learning: 2025-08-25T16:27:50.549Z
Learnt from: davemarco
Repo: y-scope/clp PR: 1198
File: components/webui/server/src/plugins/app/Presto.ts:38-43
Timestamp: 2025-08-25T16:27:50.549Z
Learning: In the CLP webui Presto configuration, host and port are set via package settings (configurable), while user, catalog, and schema are set via environment variables (environment-specific). This mixed approach is intentional - settings are typically set by package and some values don't need to be package-configurable.

Applied to files:

  • docs/src/dev-docs/design-deployment-orchestration.md
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

  • docs/src/user-docs/guides-multi-host.md
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • docs/src/user-docs/guides-multi-host.md
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

  • docs/src/user-docs/guides-multi-host.md
🪛 LanguageTool
docs/src/dev-docs/design-deployment-orchestration.md

[duplication] ~87-~87: Possible typo: you repeated a word.
Context: ...Engine] queue redis end end subgraph Initialization jobs db_...

(ENGLISH_WORD_REPEAT_RULE)

docs/src/user-docs/guides-using-spider.md

[typographical] ~7-~7: It appears that a comma is missing after this introductory phrase.
Context: ...tion with CLP may change in the future. Right now Spider only supports executing CLP comp...

(RIGHT_NOW_PUNCTUATION_COMMA)


[style] ~50-~50: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ..." port: 6000 ``` 3. Optionally, before starting the package, update th...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🔇 Additional comments (13)
docs/src/user-docs/guides-multi-host.md (3)

165-175: Clear annotations for optional Celery infrastructure services.

The addition of "(optional, only if using Celery)" comments provides helpful context for users choosing between compression frameworks.


199-203: Spider scheduler service startup instruction is clear.

The new spider-scheduler service follows the established pattern and includes appropriate optional annotation. The formatting and docker compose syntax are consistent with surrounding commands.


239-249: Spider compression worker service startup instructions align with guide structure.

Both the standard compression-worker (now marked as Celery-optional) and the new spider-compression-worker service follow consistent patterns. The optional annotations properly guide users on when to start each service.

docs/src/dev-docs/design-deployment-orchestration.md (5)

32-33: Theme variables for diagram styling are correctly added.

The new tertiaryColor and clusterBkg theme variables provide proper styling support for the Spider and Celery subgraph visualizations that follow.


45-45: Spider component nodes and dependencies are logically structured.

The new spider-scheduler and spider-compression-worker nodes are correctly positioned in the diagram, and their dependencies from db_table_creator match the pattern used for other initialization-dependent services. This aligns with the learnings about compression scheduler initialization requirements.

Also applies to: 47-47, 69-70


84-102: Subgraph organization and styling effectively distinguish framework choices.

The Celery and Spider subgraphs clearly separate framework-specific components, with appropriate colour-coding (gold for Celery, cyan for Spider). The styling helps users visually understand the architectural differences between deployment types. However, ensure the deployed docker-compose files (docker-compose.yaml, docker-compose-base.yaml, docker-compose-spider.yaml, docker-compose-spider-base.yaml) actually exist and match the deployment types table.

Also applies to: 123-129


141-157: Service table descriptions are clear and concise.

The new rows for spider_scheduler, spider_compression_worker, queue, and redis correctly describe their roles. The optional nature of queue and redis for different configurations is implicit in the context of the deployment types section.


236-244: Deployment types table provides essential configuration guidance.

The table clearly explains how package.compression_scheduler.type and package.query_engine settings map to deployment variants and their corresponding docker-compose files. This is a valuable addition for users selecting their deployment strategy.

docs/src/user-docs/index.md (1)

69-69: Guides toctree updated correctly with Spider guide entry.

The new guides-using-spider entry is properly positioned in the Guides section and follows the established toctree structure.

docs/src/user-docs/guides-using-spider.md (4)

1-10: Title and introduction are clear; note properly contextualizes Spider's current scope.

The introductory note appropriately sets expectations that Spider is under active development and currently supports only compression tasks. This manages user expectations effectively.


12-17: Requirements section is specific and actionable.

Version constraints are clearly specified, making it straightforward for users to verify their environment. The inclusion of python3-venv is a helpful detail often overlooked.


25-58: Configuration examples are clear and well-structured.

The YAML snippets demonstrate the required compression_scheduler.type: "spider" setting clearly, and the optional configurations for spider_db, spider_scheduler, and credentials are appropriately documented. The reference back to the quick-start guide for next steps maintains good flow between documents.


60-63: Reference links are consistent with project documentation patterns.

External references to CLP releases, Docker, Docker Compose, and Spider are properly formatted and link to authoritative sources.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b12c48d and 45f05ba.

📒 Files selected for processing (1)
  • docs/src/user-docs/guides-using-spider.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/src/user-docs/guides-using-spider.md

[style] ~7-~7: For conciseness, consider replacing this expression with an adverb.
Context: ...tion with CLP may change in the future. Right now, Spider only supports executing CLP com...

(AT_THE_MOMENT)

🪛 markdownlint-cli2 (0.18.1)
docs/src/user-docs/guides-using-spider.md

12-12: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


19-19: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)


35-35: Trailing spaces
Expected: 0 or 2; Actual: 6

(MD009, no-trailing-spaces)


37-37: Trailing spaces
Expected: 0 or 2; Actual: 3

(MD009, no-trailing-spaces)


44-44: Trailing spaces
Expected: 0 or 2; Actual: 3

(MD009, no-trailing-spaces)


49-49: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


57-57: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


63-63: Files should end with a single newline character

(MD047, single-trailing-newline)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
docs/src/user-docs/guides-using-spider.md (2)

25-62: Consider consolidating optional configuration into a dedicated subsection.

The three optional configuration items (spider_db, spider_scheduler, and credentials.yaml updates) are currently interspersed with step 2 as separate sub-bullets, which creates a somewhat confusing structure with "(Optional)" repeated multiple times. This has been noted in prior reviews.

For improved readability and to reduce repetition, consider creating a dedicated "Optional Configuration" subsection after the main numbered steps to group all optional settings together:

 1. Follow the [quick-start](quick-start/index.md) guide to download and extract the CLP package,
    but don't start the package just yet.
+
 2. Before starting the package, update the package's config file (`etc/clp-config.yaml`) as follows:
-
-    * Set the `compression_scheduler.type` key to `"spider"`.
+
+    Set the `compression_scheduler.type` key to `"spider"`:
 
       ```yaml
       compression_scheduler:
         type: "spider"
       ```
 
-    * (Optional) Set the `spider_db`.
-   
-        ```yaml
-        spider_db:
-            db_name: "spider-db"
-        ```
-
-    * (Optional) Set the `spider_scheduler`.
-   
-        ```yaml
-        spider_scheduler:
-            host: "localhost"
-            port: 6000
-        ```
-
-3. (Optional) Before starting the package, update the package's credential file (`etc/credentials.yaml`)
+
+3. Continue following the [quick-start](./quick-start/index.md#using-clp) guide to start CLP.
+
+### Optional Configuration
+
+You can optionally customize the following settings:
+
+* **spider_db**: Set the Spider database configuration.
+
+  ```yaml
+  spider_db:
+      db_name: "spider-db"
+  ```
+
+* **spider_scheduler**: Set the Spider scheduler connection details.
+
+  ```yaml
+  spider_scheduler:
+      host: "localhost"
+      port: 6000
+  ```
+
+* **Spider database credentials**: Before starting the package, update the package's credential file (`etc/credentials.yaml`)
    to add Spider database credentials as follows:
 
     ```yaml
     spider_db:
       username: "spider_user"
       password: "spider_password"
     ```
-
-4. Continue following the [quick-start](./quick-start/index.md#using-clp) guide to start CLP.

This improves the flow by keeping the main procedural steps sequential while clearly separating optional customizations.


64-67: Add trailing newline to end of file.

The file is missing a trailing newline character, which is required by the MD047 linting rule.

 [clp-releases]: https://github.com/y-scope/clp/releases
 [docker-compose]: https://docs.docker.com/compose/install/
 [Docker]: https://docs.docker.com/engine/install/
 [Spider]: https://github.com/y-scope/spider
+
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45f05ba and 8caa973.

📒 Files selected for processing (1)
  • docs/src/user-docs/guides-using-spider.md (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-17T16:19:24.478Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/README.md:5-6
Timestamp: 2025-08-17T16:19:24.478Z
Learning: In the CLP project, README.md files can have multiple H1 headings when it makes sense structurally (e.g., "Contributing" as a top-level section alongside the main title). The MD025 markdown linting rule should be ignored in such cases where multiple H1s are intentional.

Applied to files:

  • docs/src/user-docs/guides-using-spider.md
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.

Applied to files:

  • docs/src/user-docs/guides-using-spider.md
🪛 LanguageTool
docs/src/user-docs/guides-using-spider.md

[style] ~7-~7: For conciseness, consider replacing this expression with an adverb.
Context: ...tion with CLP may change in the future. Right now, Spider only supports executing CLP com...

(AT_THE_MOMENT)

🪛 markdownlint-cli2 (0.18.1)
docs/src/user-docs/guides-using-spider.md

39-39: Trailing spaces
Expected: 0 or 2; Actual: 3

(MD009, no-trailing-spaces)


46-46: Trailing spaces
Expected: 0 or 2; Actual: 3

(MD009, no-trailing-spaces)


67-67: Files should end with a single newline character

(MD047, single-trailing-newline)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: antlr-code-committed (macos-15)
🔇 Additional comments (1)
docs/src/user-docs/guides-using-spider.md (1)

1-10: Documentation content is accurate and well-structured.

The guide provides clear setup instructions for integrating Spider with CLP. The introduction appropriately notes that Spider is under active development, and the requirements section specifies concrete version dependencies. Setup steps follow a logical flow from downloading the package to configuration to starting CLP.

@kirkrodrigues kirkrodrigues requested review from junhaoliao and removed request for LinZhihao-723 November 28, 2025 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant