feat(package)!: Add Spider compression orchestration in docker compose package. #1606

sitaowang1998 · 2025-11-15T03:58:18Z

Description

This PR integrates spider compression orchestration into docker compose package by

Adds Spider into ClpConfig. Minor cleanup for unnecessary dump_to_primitive_type.
Adds setup scripts for Spider database.
Adds native scripts to start multiple Spider workers in container.
Adds docker compose files to start Spider scheduler and worker containers.

This PR also:

Bumps mariadb python dependency to >=1.1.14.
Bumps spider-py version to 0.2.0.
Bumps Spider to latest version for fix in starting python workers.

This PR is a breaking change because it updates the dependencies versions.

The doc change will be put in #1647.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Spider compression workflow works end-to-end.
Search works with the compressed archives produces by previous step.
GitHub workflows pass.

Summary by CodeRabbit

New Features
- Optional Spider orchestration: Spider scheduler, Spider DB provisioning/init tool, spider-worker CLI, and Spider credentials validation/loading.
Improved Behaviour
- Task payloads/results use bytes/UTF‑8 end‑to‑end; runtime selects Spider or Celery task manager. Environment setup and compose selection now respect optional queue/redis/spider components and inject Spider env when enabled.
Chores
- New Spider-focused compose packages and dependency version updates.
Removed
- Legacy byte-list helper utilities replaced by direct UTF‑8 handling.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…se plugin versions to requirements list.

This reverts commit 7357a7e.

…ng about plan for replacement.

# Conflicts: # docs/src/dev-docs/building-package.md

…lp-package Dockerfile (fixes y-scope#1379); Reduce layers using multi-line ENV (fixes y-scope#1378).

…r` and `spider_worker`) in the `clp-package` image; Set `mariadb-connector-cpp` build type to `Release` (fixes y-scope#1410).

…t working directory.

…ckerfile.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

components/clp-py-utils/clp_py_utils/initialize-spider-db.py (2)
269-271: Critical: Parameterize Spider DB password in CREATE USER statement.

The password is embedded directly in the f-string, creating two concrete problems:

Correctness: Passwords containing quotes or special characters will break the SQL statement.

Security hygiene: This bypasses the project's standard practice of using parameterized SQL with %s placeholders and tuple arguments to prevent injection risks. Based on learnings

The database and user names are already validated by _validate_name, so they remain safely interpolated. The password must be passed as a bound parameter:

Apply this diff:
             db_cursor.execute(
-                f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'"""
+                f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY %s""",
+                (spider_db_password,)
             )
This lets the MariaDB driver handle escaping the secret value correctly.

245-250: Add clarifying comment on SqlAdapter initialization for future maintainability.

The SqlAdapter is initialized with clp_config.database because root operations (CREATE DATABASE, CREATE USER, GRANT) must execute against the CLP database server where root credentials are available. This is correct for the current single-MariaDB-instance Docker setup where Spider DB shares the same server.

However, a clarifying comment would help prevent confusion during future refactors that might introduce separate Spider DB hosts.

Add a comment above line 246:
+        # Note: SqlAdapter uses clp_config.database to connect with root credentials
+        # for CREATE DATABASE/USER/GRANT operations. In the current Docker setup,
+        # Spider DB resides on the same MariaDB instance as CLP metadata DB.
         sql_adapter = SqlAdapter(clp_config.database)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93a5af2 and a2c449d.

📒 Files selected for processing (4)

components/clp-py-utils/clp_py_utils/create-db-tables.py (2 hunks)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1 hunks)
components/package-template/src/etc/clp-config.yaml (1 hunks)
components/package-template/src/etc/credentials.template.yaml (1 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/package-template/src/etc/clp-config.yaml

📚 Learning: 2025-04-17T16:55:23.796Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.796Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.

Applied to files:

components/clp-py-utils/clp_py_utils/initialize-spider-db.py

📚 Learning: 2025-01-23T17:08:55.566Z

Learnt from: gibber9809
Repo: y-scope/clp PR: 672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.

Applied to files:

components/clp-py-utils/clp_py_utils/initialize-spider-db.py

🧬 Code graph analysis (2)

components/clp-py-utils/clp_py_utils/create-db-tables.py (2)

components/clp-py-utils/clp_py_utils/clp_config.py (6)

ClpConfig (682-952)

StorageEngine (101-103)

load_credentials_from_env (254-259)

load_credentials_from_env (284-289)

load_credentials_from_env (367-371)

load_credentials_from_env (432-437)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (58-64)

components/clp-py-utils/clp_py_utils/initialize-spider-db.py (4)

components/clp-py-utils/clp_py_utils/clp_config.py (5)

ClpConfig (682-952)

load_credentials_from_env (254-259)

load_credentials_from_env (284-289)

load_credentials_from_env (367-371)

load_credentials_from_env (432-437)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (58-64)

components/clp-py-utils/clp_py_utils/sql_adapter.py (2)

SqlAdapter (60-145)

create_root_mariadb_connection (109-119)

components/clp-py-utils/clp_py_utils/create-db-tables.py (1)

main (21-72)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build-macos (macos-15, true)

🔇 Additional comments (4)

components/clp-py-utils/clp_py_utils/create-db-tables.py (1)

7-8: LGTM!

The new imports are correctly scoped and necessary for Spider DB initialization logic.

components/package-template/src/etc/credentials.template.yaml (1)

14-18: LGTM!

The Spider DB credentials template follows the same pattern as existing credential sections and provides appropriate scaffolding for Spider-based deployments.

components/clp-py-utils/clp_py_utils/initialize-spider-db.py (2)

228-241: LGTM!

Configuration loading and credential validation are correctly implemented. The Spider-specific credential loader properly reads SPIDER_DB_USER and SPIDER_DB_PASS environment variables, and exception handling appropriately distinguishes between validation errors and unexpected failures.

289-298: LGTM!

The name validation pattern correctly restricts database and user identifiers to alphanumeric characters, underscores, and hyphens, preventing SQL injection through identifier interpolation. This aligns with the team's preference for sanitizing even trusted inputs for robustness. Based on learnings

components/clp-py-utils/clp_py_utils/create-db-tables.py

components/package-template/src/etc/clp-config.yaml

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (3)

tools/deployment/package/docker-compose-all.yaml (2)
305-305: Rename hostname to avoid ambiguity with compression-worker.

Both the compression-worker (line 270) and spider-compression-worker (line 305) services use the same hostname "compression_worker", which creates ambiguity in logs and metrics. Change the spider-compression-worker hostname to a distinct value.
  spider-compression-worker:
    <<: *service_defaults
-   hostname: "compression_worker"
+   hostname: "spider_compression_worker"
303-334: Restore Spider scheduler connection arguments in spider-compression-worker command.

The spider-compression-worker service is missing the --host and --port arguments required to connect to the Spider scheduler for task placement and coordination. Previous review iterations explicitly flagged this inconsistency, and the spider-scheduler service (lines 227–228) properly demonstrates the pattern using environment variables.

Additionally, SPIDER_SCHEDULER_HOST and SPIDER_SCHEDULER_PORT are not defined in the environment section (lines 306–313), even though the command needs them.

Apply this diff to add the missing scheduler connection arguments and environment variables:
  spider-compression-worker:
    <<: *service_defaults
    hostname: "spider_compression_worker"
    environment:
      CLP_CONFIG_PATH: "/etc/clp-config.yaml"
      CLP_HOME: "/opt/clp"
      CLP_LOGGING_LEVEL: "${CLP_COMPRESSION_WORKER_LOGGING_LEVEL:-INFO}"
      CLP_LOGS_DIR: "/var/log/compression_worker"
      CLP_WORKER_LOG_PATH: "/var/log/compression_worker/worker.log"
      PYTHONPATH: "/opt/clp/lib/python3/site-packages"
      SPIDER_LOG_DIR: "/var/log/compression_worker"
+     SPIDER_SCHEDULER_HOST: "${SPIDER_SCHEDULER_HOST:?Please set a value.}"
+     SPIDER_SCHEDULER_PORT: "${SPIDER_SCHEDULER_PORT:?Please set a value.}"
    volumes:
      - *volume_clp_config_readonly
      - *volume_clp_logs
      - *volume_clp_tmp
      - "${CLP_ARCHIVE_OUTPUT_DIR_HOST:-empty}:/var/data/archives"
      - "${CLP_AWS_CONFIG_DIR_HOST:-empty}:/opt/clp/.aws:ro"
      - "${CLP_LOGS_INPUT_DIR_HOST:-empty}:${CLP_LOGS_INPUT_DIR_CONTAINER:-/mnt/logs}"
      - "${CLP_STAGED_ARCHIVE_OUTPUT_DIR_HOST:-empty}:/var/data/staged-archives"
    depends_on:
      db-table-creator:
        condition: "service_completed_successfully"
    command: [
      "python3",
      "-u",
      "/opt/clp/lib/python3/site-packages/clp_py_utils/start-spider-worker.py",
      "--concurrency", "${CLP_COMPRESSION_WORKER_CONCURRENCY:-1}",
      "--storage-url", "jdbc:mariadb://database:${CLP_DB_PORT:-3306}/\
        ${SPIDER_DB_NAME:-spider-db}?\
        user=${SPIDER_DB_USER:?Please set a value.}\
        &password=${SPIDER_DB_PASS:?Please set a value.}",
+     "--host", "${SPIDER_SCHEDULER_HOST}",
+     "--port", "${SPIDER_SCHEDULER_PORT}",
    ]
components/clp-py-utils/clp_py_utils/start-spider-worker.py (1)
57-61: Wait for terminated processes to prevent zombies.

The exception handler terminates processes but doesn't wait for them to finish cleanup, which can leave zombie processes. This issue was previously flagged in past reviews and remains unresolved.

Apply this diff to ensure proper cleanup:
     except OSError as e:
         logger.error(f"Failed to start spider worker: {e}")
         for process in processes:
             process.terminate()
+            try:
+                process.wait(timeout=5)
+            except subprocess.TimeoutExpired:
+                process.kill()
+                process.wait()
         exit(1)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2293d50 and 60ef08b.

📒 Files selected for processing (2)

components/clp-py-utils/clp_py_utils/start-spider-worker.py (1 hunks)
tools/deployment/package/docker-compose-all.yaml (5 hunks)

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

📚 Learning: 2025-10-17T19:59:25.596Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

components/clp-py-utils/clp_py_utils/start-spider-worker.py
tools/deployment/package/docker-compose-all.yaml

📚 Learning: 2025-10-07T07:54:32.427Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

tools/deployment/package/docker-compose-all.yaml

📚 Learning: 2025-10-20T21:05:30.417Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).

Applied to files:

tools/deployment/package/docker-compose-all.yaml

📚 Learning: 2025-10-13T03:32:19.293Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

tools/deployment/package/docker-compose-all.yaml

📚 Learning: 2025-09-28T15:00:22.170Z

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

tools/deployment/package/docker-compose-all.yaml

📚 Learning: 2025-09-25T05:13:13.298Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

tools/deployment/package/docker-compose-all.yaml

🔇 Additional comments (7)

tools/deployment/package/docker-compose-all.yaml (2)

238-238: Verify intentionality of compression-scheduler stop_grace_period increase.

The compression-scheduler service now has stop_grace_period: "300s" instead of the previous 60 seconds. Confirm this longer timeout is necessary for Spider integration and document the reason if it persists across the codebase.

42-45: Volume anchor consolidation applied successfully.

The new volume_clp_tmp anchor eliminates duplication across compression-worker and spider-compression-worker services. This aligns with earlier review suggestions and improves maintainability.

components/clp-py-utils/clp_py_utils/start-spider-worker.py (5)

1-16: LGTM!

The imports and logging setup are appropriate and follow standard Python practices.

35-40: LGTM!

The argument parsing and concurrency validation correctly ensure at least one worker is spawned. This addresses the previous review feedback about validating the concurrency parameter.

43-47: LGTM!

The worker path resolution and existence check are well-implemented and address previous feedback about validating the executable before attempting to spawn subprocesses.

63-70: LGTM!

The process monitoring and failure handling correctly track and report worker failures, ensuring the script exits with a non-zero status if any worker fails. This addresses previous feedback about handling non-zero exit codes.

73-74: LGTM!

Standard Python entry point.

components/clp-py-utils/clp_py_utils/start-spider-worker.py

tools/deployment/package/docker-compose-all.yaml

LinZhihao-723

Overall lgtm.

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

components/clp-py-utils/clp_py_utils/start-spider-worker.py

Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

components/clp-py-utils/clp_py_utils/start-spider-worker.py (1)
66-68: Add wait() after terminate() to prevent zombie processes.

When an error occurs during worker startup, the script terminates already-started processes but doesn't wait for them to complete cleanup. This can leave zombie processes.

Apply this diff to ensure proper cleanup:
     except OSError as e:
         logger.error(f"Failed to start Spider worker: {e}")
         for process in processes:
             process.terminate()
+            process.wait()
         exit(1)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74a3149 and 01c60a2.

📒 Files selected for processing (1)

components/clp-py-utils/clp_py_utils/start-spider-worker.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

📚 Learning: 2025-10-17T19:59:25.596Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

components/clp-py-utils/clp_py_utils/start-spider-worker.py

🧬 Code graph analysis (1)

components/clp-py-utils/clp_py_utils/start-spider-worker.py (2)

components/clp-py-utils/clp_py_utils/create-db-tables.py (1)

main (21-72)

components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1)

main (223-286)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: package-image
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: musllinux_1_2-x86_64-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: manylinux_2_28-x86_64-static-linked-bins
GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: lint-check (macos-15)
GitHub Check: build-macos (macos-15, true)
GitHub Check: build-macos (macos-15, false)
GitHub Check: rust-checks

🔇 Additional comments (4)

components/clp-py-utils/clp_py_utils/start-spider-worker.py (4)

1-16: LGTM! Logger setup is appropriate.

The imports are correct and the logger configuration follows Python best practices. The script consistently uses the logger instead of print statements throughout.

38-40: Concurrency validation is correct.

The check concurrency >= 1 is appropriate because running a single worker (concurrency=1) is a valid use case. There's no need to require multiple workers (>1).

43-54: LGTM! Proper error handling and validation.

The hostname resolution includes appropriate exception handling for socket errors, and the script validates the spider_worker executable exists before attempting to spawn processes. This prevents common failure modes.

70-77: LGTM! Exit code handling is well-implemented.

The script correctly waits for all worker processes, logs non-zero exit codes, and returns an appropriate exit status to indicate failure. This follows best practices for process orchestration.

components/clp-py-utils/clp_py_utils/start-spider-worker.py

Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01c60a2 and 3c502d2.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3 hunks)

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

📚 Learning: 2025-09-28T15:00:22.170Z

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: 2025-09-19T18:28:26.747Z

Learnt from: gibber9809
Repo: y-scope/clp PR: 1169
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:463-469
Timestamp: 2025-09-19T18:28:26.747Z
Learning: In the compression scheduler (components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py), the SIGTERM handling uses a coarse-grained approach that only checks received_sigterm before calling search_and_schedule_new_tasks. The maintainers consider the race condition where SIGTERM arrives mid-execution to be benign, as scheduling a few extra tasks during shutdown is acceptable and will be handled by cleanup mechanisms.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: 2025-08-24T17:10:44.565Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: 2025-09-25T05:13:13.298Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

🧬 Code graph analysis (1)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3)

components/clp-py-utils/clp_py_utils/clp_config.py (6)

OrchestrationType (117-119)

load_credentials_from_env (254-259)

load_credentials_from_env (280-285)

load_credentials_from_env (363-367)

load_credentials_from_env (428-433)

get_container_url (276-278)

components/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.py (1)

SpiderTaskManager (14-48)

components/job-orchestration/job_orchestration/scheduler/compress/task_manager/celery_task_manager.py (1)

CeleryTaskManager (12-27)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: manylinux_2_28-x86_64-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
GitHub Check: package-image
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
GitHub Check: musllinux_1_2-x86_64-static-linked-bins
GitHub Check: build-macos (macos-15, true)
GitHub Check: build-macos (macos-15, false)
GitHub Check: rust-checks
GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: lint-check (macos-15)

🔇 Additional comments (2)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

20-20: LGTM!

The new imports for OrchestrationType and SpiderTaskManager are correctly added and necessary for the Spider orchestration integration.

Also applies to: 37-37

523-523: LGTM!

The forward declaration of task_manager with a union type annotation improves scope clarity and addresses the suggestion from the previous review.

coderabbitai · 2025-11-25T01:23:35Z

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

+    elif clp_config.compression_scheduler.type == OrchestrationType.spider:
+        clp_config.spider_db.load_credentials_from_env()
+        task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())


⚠️ Potential issue | 🔴 Critical

Wrap Spider credential loading in exception handling.

The call to clp_config.spider_db.load_credentials_from_env() at line 527 is outside the try-except block (lines 509-518), which means if required environment variables are missing, the program will crash with an unhandled ValueError instead of logging an error and exiting gracefully. This is inconsistent with the database credential loading at line 511, which is properly wrapped in exception handling.

Apply this diff to add proper exception handling:

logger.info(f"Starting {COMPRESSION_SCHEDULER_COMPONENT_NAME}") sql_adapter = SqlAdapter(clp_config.database) task_manager: CeleryTaskManager | SpiderTaskManager if clp_config.compression_scheduler.type == OrchestrationType.celery: task_manager = CeleryTaskManager() elif clp_config.compression_scheduler.type == OrchestrationType.spider: - clp_config.spider_db.load_credentials_from_env() - task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url()) + try: + clp_config.spider_db.load_credentials_from_env() + task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url()) + except (ValueError, Exception) as err: + logger.error(f"Failed to initialize Spider task manager: {err}") + return -1 else: logger.error( f"Unsupported compression scheduler type: {clp_config.compression_scheduler.type}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

elif clp_config.compression_scheduler.type == OrchestrationType.spider:

clp_config.spider_db.load_credentials_from_env()

task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())

elif clp_config.compression_scheduler.type == OrchestrationType.spider:

try:

clp_config.spider_db.load_credentials_from_env()

task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())

except (ValueError, Exception) as err:

logger.error(f"Failed to initialize Spider task manager: {err}")

return -1

🤖 Prompt for AI Agents

In components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py around lines 526 to 528, the call to clp_config.spider_db.load_credentials_from_env() must be wrapped in the same try/except used earlier for DB credential loading: enclose the load_credentials_from_env() call in a try block and catch ValueError (or the same exception type used at lines 509-518), log a clear error with the existing logger, and terminate the process (use the same exit path as the surrounding code) so missing Spider env credentials do not raise an unhandled exception.

LinZhihao-723

The orchestration part lgtm.
Leave the rest to @junhaoliao

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

components/clp-package-utils/clp_package_utils/general.py (1)
509-513: Add None check for spider_db to prevent AttributeError.

The function validate_and_load_spider_db_credentials_file calls clp_config.spider_db.load_credentials_from_file() without verifying that spider_db is not None. If called when spider_db is unconfigured, this will raise an AttributeError.

Apply this diff to add defensive None handling:
 def validate_and_load_spider_db_credentials_file(
     clp_config: ClpConfig, clp_home: pathlib.Path, generate_default_file: bool
 ):
     validate_credentials_file_path(clp_config, clp_home, generate_default_file)
+    if clp_config.spider_db is None:
+        raise ValueError("Spider DB configuration must be present to load credentials.")
     clp_config.spider_db.load_credentials_from_file(clp_config.credentials_file_path)
Alternatively, if callers are expected to check spider_db presence before calling, add a docstring clarifying this precondition.
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1)
269-271: Critical: Parameterize the Spider DB password in CREATE USER statement.

The CREATE USER statement embeds spider_db_password directly into an f-string, which:

Breaks correctness for passwords containing quotes or special characters

Bypasses parameterized SQL practice established in this codebase

Based on learnings, SQL queries should use parameterized queries with placeholders and pass values as a tuple to prevent injection and ensure proper escaping.

Apply this diff to parameterize the password:
             db_cursor.execute(
-                f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'"""
+                f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY %s""",
+                (spider_db_password,)
             )
Note: The database/user names remain interpolated since they are validated by _validate_name() which restricts to alphanumeric, underscore, and hyphen characters.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c312c19 and bd9a78d.

📒 Files selected for processing (10)

components/clp-package-utils/clp_package_utils/controller.py (7 hunks)
components/clp-package-utils/clp_package_utils/general.py (3 hunks)
components/clp-py-utils/clp_py_utils/clp_config.py (8 hunks)
components/clp-py-utils/clp_py_utils/create-db-tables.py (2 hunks)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1 hunks)
components/clp-py-utils/clp_py_utils/sql_adapter.py (2 hunks)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1 hunks)
components/package-template/src/etc/clp-config.yaml (1 hunks)
components/package-template/src/etc/credentials.template.yaml (1 hunks)
tools/deployment/package/docker-compose-all.yaml (5 hunks)

🧰 Additional context used

🧠 Learnings (23)

📓 Common learnings

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

📚 Learning: 2025-11-28T15:12:53.506Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1655
File: components/clp-package-utils/clp_package_utils/controller.py:183-189
Timestamp: 2025-11-28T15:12:53.506Z
Learning: In the y-scope/clp project, root database credentials (root username and password) are required configuration items for all deployments. The `credentials[ClpDbUserType.ROOT]` entry is guaranteed to exist and can be accessed directly without optional handling.

Applied to files:

components/clp-py-utils/clp_py_utils/sql_adapter.py

📚 Learning: 2025-10-17T19:59:25.596Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

components/clp-py-utils/clp_py_utils/sql_adapter.py
components/clp-package-utils/clp_package_utils/controller.py
tools/deployment/package/docker-compose-all.yaml

📚 Learning: 2025-10-27T07:07:37.901Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.

Applied to files:

components/clp-py-utils/clp_py_utils/initialize-spider-db.py
components/clp-package-utils/clp_package_utils/controller.py
components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-04-17T16:55:23.796Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.796Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.

Applied to files:

components/clp-py-utils/clp_py_utils/initialize-spider-db.py

📚 Learning: 2025-01-23T17:08:55.566Z

Learnt from: gibber9809
Repo: y-scope/clp PR: 672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.

Applied to files:

components/clp-py-utils/clp_py_utils/initialize-spider-db.py

📚 Learning: 2025-09-28T15:00:22.170Z

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py
tools/deployment/package/docker-compose-all.yaml

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py
components/package-template/src/etc/clp-config.yaml

📚 Learning: 2025-09-15T22:20:40.750Z

Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

📚 Learning: 2025-08-17T16:10:38.722Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Applied to files:

components/clp-package-utils/clp_package_utils/general.py

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/clp-package-utils/clp_package_utils/controller.py
components/clp-py-utils/clp_py_utils/clp_config.py
components/package-template/src/etc/clp-config.yaml

📚 Learning: 2025-10-13T03:32:19.293Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

components/clp-package-utils/clp_package_utils/controller.py
components/clp-py-utils/clp_py_utils/clp_config.py
tools/deployment/package/docker-compose-all.yaml

📚 Learning: 2025-11-10T05:19:56.600Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-07-23T09:54:45.185Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-08-09T04:07:27.083Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-11-17T22:58:50.056Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-10-22T21:02:31.113Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-08-25T06:29:59.610Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-10-22T21:14:12.225Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-07-07T17:41:15.655Z

Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-10-07T07:54:32.427Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

tools/deployment/package/docker-compose-all.yaml

📚 Learning: 2025-10-20T21:05:30.417Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).

Applied to files:

tools/deployment/package/docker-compose-all.yaml

📚 Learning: 2025-09-25T05:13:13.298Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

tools/deployment/package/docker-compose-all.yaml

🧬 Code graph analysis (1)

components/clp-py-utils/clp_py_utils/create-db-tables.py (2)

components/clp-py-utils/clp_py_utils/clp_config.py (5)

ClpConfig (774-1050)

load_credentials_from_env (326-345)

load_credentials_from_env (372-377)

load_credentials_from_env (459-463)

load_credentials_from_env (524-529)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (58-64)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)

GitHub Check: package-image
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: musllinux_1_2-x86_64-static-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
GitHub Check: manylinux_2_28-x86_64-static-linked-bins
GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: build-macos (macos-15, true)
GitHub Check: build-macos (macos-15, false)
GitHub Check: rust-checks
GitHub Check: uv-checks (ubuntu-24.04)

🔇 Additional comments (21)

components/package-template/src/etc/credentials.template.yaml (1)

16-20: Template addition for Spider DB credentials looks good.

The new commented section follows the established pattern and remains inactive (all entries commented). Ensure the example username and password are consistent with the setup scripts and initialization tooling mentioned in the PR objectives.

Please verify that the example credentials "spider-user" and "pass" match what is used in the Spider DB setup scripts added elsewhere in this PR.

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)

618-623: LGTM! Consistent serialization in error path.

The addition of .model_dump() ensures that both the error path and success path (line 676) return a serialized dictionary, which aligns with the PR's objective of switching task payloads to raw bytes.

tools/deployment/package/docker-compose-all.yaml (4)

42-45: ✅ Volume anchor for clp-tmp is well-structured.

The new volume definition follows the established pattern and eliminates repetition across services.

96-97: ✅ db-table-creator environment variables correctly include Spider credentials.

The SPIDER_DB_PASS and SPIDER_DB_USER variables enable proper Spider database initialization during setup.

227-248: ✅ spider-scheduler service is correctly configured with proper dependencies and port exposure.

The service properly depends on db-table-creator and exposes port 6000. The JDBC storage URL is correctly constructed with required credentials.

299-299: I apologize, but I'm unable to complete the verification of this review comment due to persistent access issues in the sandbox environment. Despite multiple attempts using different approaches (direct shell access, GitHub CLI, web search), I cannot:

Access the docker-compose-all.yaml file to verify the claimed duplicate hostname at lines 284 and 319

Examine the start-spider-worker.py script to understand how scheduler connection parameters are handled (via CLI flags, environment variables, or defaults)

Validate the specific claims about missing --host and --port arguments

Confirm the proposed diff is accurate and complete

Without access to the actual codebase, I cannot determine whether:

The duplicate hostname issue actually exists

The spider-compression-worker service truly lacks scheduler connection parameters

The suggested fix is complete and addresses the root cause

There are any undocumented defaults or environment variable mechanisms that make the explicit flags unnecessary

Recommendation: Please verify this review comment manually by:

Checking lines 284 and 319 of tools/deployment/package/docker-compose-all.yaml for the hostname values

Examining start-spider-worker.py to understand its parameter expectations

Comparing the spider-scheduler and spider-compression-worker service definitions for consistency

components/package-template/src/etc/clp-config.yaml (1)

29-38: LGTM! Spider configuration scaffolding is complete.

The commented configuration blocks for compression_scheduler.type, spider_db, and spider_scheduler correctly match the corresponding model definitions in clp_config.py. The port field for spider_scheduler now correctly defaults to 6000, aligning with SpiderScheduler.DEFAULT_PORT.

components/clp-py-utils/clp_py_utils/sql_adapter.py (1)

73-73: LGTM! Return type annotations improved for better type safety.

The updated return type annotations correctly reflect the actual return types from mysql.connector.connect() and mariadb.connect(). This aligns with the MariaDB driver version bump (>=1.1.14) in this PR.

Also applies to: 154-154

components/clp-py-utils/clp_py_utils/initialize-spider-db.py (2)

245-249: Good: Root connection approach is correct.

The script correctly uses SqlAdapter(clp_config.database) with ClpDbUserType.ROOT to obtain root privileges for database/user creation. This is the expected pattern since Spider DB runs on the same MariaDB instance as the CLP database, sharing root credentials.

26-220: LGTM! Spider database schema is comprehensive.

The DDL statements define a complete schema for Spider task orchestration including:

Core tables: drivers, schedulers, jobs, tasks

Relationship tables: input_tasks, output_tasks, task_dependencies

Data storage: data, task_inputs, task_outputs

Operational tables: task_instances, scheduler_leases, data_locality

KV storage: client_kv_data, task_kv_data

Foreign key constraints with appropriate ON DELETE CASCADE behaviour ensure referential integrity.

components/clp-py-utils/clp_py_utils/create-db-tables.py (1)

53-68: LGTM! Conditional Spider DB initialization is correctly implemented.

The logic properly:

Loads CLP configuration and credentials

Checks if spider_db is configured before attempting Spider DB initialization

Logs an informative message and exits successfully when Spider DB is not configured

Only runs initialize-spider-db when Spider DB configuration is present

The broad Exception catch at line 59 is consistent with existing patterns in this file and provides adequate error reporting for configuration loading failures.

components/clp-package-utils/clp_package_utils/general.py (1)

463-463: LGTM! Spider DB credentials generation follows established pattern.

The Spider DB entry correctly mirrors the existing credential generation pattern with a distinct username (spider-user) and appropriately sized password (8 characters, consistent with the database user).

components/clp-package-utils/clp_package_utils/controller.py (4)

330-350: LGTM! Spider DB environment setup is correctly implemented.

The _set_up_env_for_spider_db method properly:

Uses the SPIDER_DB_COMPONENT_NAME constant for logging

Sets SPIDER_DB_NAME from configuration

Uses the imported environment variable name constants for credentials

The credentials access pattern (spider_db.credentials.username/password) correctly matches the SpiderDb model's credentials: DbUserCredentials field.

352-369: LGTM! Spider scheduler environment setup follows established patterns.

The _set_up_env_for_spider_scheduler method correctly:

Uses SPIDER_SCHEDULER_COMPONENT_NAME for logging

Resolves the hostname to IP using _get_ip_from_hostname

Sets both SPIDER_SCHEDULER_HOST and SPIDER_SCHEDULER_PORT

916-922: LGTM! Conditional environment setup prevents errors when components are unconfigured.

The conditional checks properly guard against AttributeError when queue, redis, or Spider components are not configured:

Queue/Redis are now Optional and only set up when present

Spider components are only set up when OrchestrationType.spider is active

1010-1016: LGTM! Docker Compose file selection correctly handles Spider orchestration.

The logic properly selects the appropriate Docker Compose file based on deployment type and orchestration:

BASE + Spider → docker-compose-spider-base.yaml

BASE + Celery → docker-compose.base.yaml

FULL + Spider → docker-compose-spider.yaml

FULL + Celery → docker-compose.yaml

components/clp-py-utils/clp_py_utils/clp_config.py (5)

352-394: SpiderDb implementation is well-structured with proper credential separation.

The SpiderDb class correctly:

Extends Database with Spider-specific defaults

Enforces MariaDB via validate_type with proper return value

Uses separate DbUserCredentials instead of the dict-based credentials

Implements dedicated load_credentials_from_env/file methods reading Spider-specific env vars and config keys

396-405: LGTM! SpiderScheduler follows established patterns.

The SpiderScheduler class correctly implements:

DEFAULT_PORT ClassVar for consistency

DomainStr type for host

transform_for_container() method for container orchestration

997-1021: LGTM! Orchestration validators enforce correct configuration combinations.

The validators properly ensure:

Spider orchestration requires both spider_db and spider_scheduler, plus MariaDB for the main database

Celery orchestration requires both queue and redis

This prevents runtime errors from misconfigured orchestration setups.

978-983: LGTM! dump_to_primitive_dict correctly handles optional fields.

The updated serialization:

Includes spider_db in the credential-redacting custom fields

Properly handles None values for optional components (queue, redis, spider_db)

This prevents credential leakage while supporting the new optional field patterns.

1038-1045: LGTM! Container transformation properly guards optional components.

The conditional calls to transform_for_container() prevent AttributeError when optional components (queue, redis, spider_db, spider_scheduler) are None.

components/clp-py-utils/clp_py_utils/clp_config.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
456-466: Handle Spider DB credential / task-manager initialisation errors gracefully

The Spider branch currently calls clp_config.spider_db.load_credentials_from_env() and SpiderTaskManager(...) without protection. If Spider DB env vars are missing or get_container_url() fails, the scheduler will crash with an uncaught exception instead of logging and exiting cleanly, unlike the earlier DB-credential path.

Wrap the Spider initialisation in a try/except and return -1 on failure, e.g.:
-    task_manager: CeleryTaskManager | SpiderTaskManager
-    if clp_config.compression_scheduler.type == OrchestrationType.celery:
-        task_manager = CeleryTaskManager()
-    elif clp_config.compression_scheduler.type == OrchestrationType.spider:
-        clp_config.spider_db.load_credentials_from_env()
-        task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
-    else:
-        logger.error(
-            f"Unsupported compression scheduler type: {clp_config.compression_scheduler.type}"
-        )
-        return -1
+    task_manager: CeleryTaskManager | SpiderTaskManager
+    if clp_config.compression_scheduler.type == OrchestrationType.celery:
+        task_manager = CeleryTaskManager()
+    elif clp_config.compression_scheduler.type == OrchestrationType.spider:
+        try:
+            clp_config.spider_db.load_credentials_from_env()
+            task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
+        except Exception:
+            logger.exception("Failed to initialise Spider task manager.")
+            return -1
+    else:
+        logger.error(
+            "Unsupported compression scheduler type: %s",
+            clp_config.compression_scheduler.type,
+        )
+        return -1
This keeps Spider initialisation behaviour aligned with the existing configuration/DB error handling and avoids uncaught ValueErrors on missing Spider env vars.
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1)
247-279: Parameterise Spider DB password in CREATE USER DDL

spider_db_password is interpolated directly into the CREATE USER statement:
db_cursor.execute(
    f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'"""
)
Now that password validation allows arbitrary characters, this is brittle (breaks on quotes) and bypasses the project’s practice of using parameterised SQL for values. Identifiers here are already validated via _validate_name, so the remaining risk/value is the password.

Use a %s placeholder for the password and let the driver escape it:
-            db_cursor.execute(f"""CREATE DATABASE IF NOT EXISTS `{spider_db_name}`""")
+            db_cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{spider_db_name}`")

             if spider_db_password is None:
                 logger.error("Password must be set for Spider database user.")
                 return -1
             db_cursor.execute(
-                f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'"""
+                f"CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY %s",
+                (spider_db_password,),
             )
This keeps identifier handling as-is while ensuring passwords with special characters work correctly and follow the repo’s parameterisation guidelines. Based on learnings.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd9a78d and 5135fe8.

📒 Files selected for processing (5)

components/clp-package-utils/clp_package_utils/controller.py (7 hunks)
components/clp-py-utils/clp_py_utils/clp_config.py (8 hunks)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1 hunks)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3 hunks)
components/package-template/src/etc/clp-config.yaml (1 hunks)

🧰 Additional context used

🧠 Learnings (20)

📓 Common learnings

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

📚 Learning: 2025-09-28T15:00:22.170Z

Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: 2025-08-08T06:59:42.436Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
components/package-template/src/etc/clp-config.yaml
components/clp-py-utils/clp_py_utils/clp_config.py
components/clp-package-utils/clp_package_utils/controller.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
components/package-template/src/etc/clp-config.yaml

📚 Learning: 2025-09-19T18:28:26.747Z

Learnt from: gibber9809
Repo: y-scope/clp PR: 1169
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:463-469
Timestamp: 2025-09-19T18:28:26.747Z
Learning: In the compression scheduler (components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py), the SIGTERM handling uses a coarse-grained approach that only checks received_sigterm before calling search_and_schedule_new_tasks. The maintainers consider the race condition where SIGTERM arrives mid-execution to be benign, as scheduling a few extra tasks during shutdown is acceptable and will be handled by cleanup mechanisms.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: 2025-08-24T17:10:44.565Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: 2025-09-25T05:13:13.298Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

📚 Learning: 2025-10-27T07:07:37.901Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.

Applied to files:

components/clp-py-utils/clp_py_utils/initialize-spider-db.py
components/clp-py-utils/clp_py_utils/clp_config.py
components/clp-package-utils/clp_package_utils/controller.py

📚 Learning: 2025-04-17T16:55:23.796Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.796Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.

Applied to files:

components/clp-py-utils/clp_py_utils/initialize-spider-db.py

📚 Learning: 2025-01-23T17:08:55.566Z

Learnt from: gibber9809
Repo: y-scope/clp PR: 672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.

Applied to files:

components/clp-py-utils/clp_py_utils/initialize-spider-db.py

📚 Learning: 2025-11-10T05:19:56.600Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-10-13T03:32:19.293Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py
components/clp-package-utils/clp_package_utils/controller.py

📚 Learning: 2025-07-23T09:54:45.185Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-08-09T04:07:27.083Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-11-17T22:58:50.056Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-10-22T21:02:31.113Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-08-25T06:29:59.610Z

Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-10-22T21:14:12.225Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-07-07T17:41:15.655Z

Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-10-17T19:59:25.596Z

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

components/clp-package-utils/clp_package_utils/controller.py

🧬 Code graph analysis (2)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3)

components/clp-py-utils/clp_py_utils/clp_config.py (5)

OrchestrationType (130-132)

load_credentials_from_env (327-346)

load_credentials_from_env (382-387)

load_credentials_from_env (472-476)

load_credentials_from_env (537-542)

components/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.py (1)

SpiderTaskManager (14-48)

components/job-orchestration/job_orchestration/scheduler/compress/task_manager/celery_task_manager.py (1)

CeleryTaskManager (12-27)

components/clp-py-utils/clp_py_utils/initialize-spider-db.py (3)

components/clp-py-utils/clp_py_utils/clp_config.py (5)

ClpDbUserType (189-193)

load_credentials_from_env (327-346)

load_credentials_from_env (382-387)

load_credentials_from_env (472-476)

load_credentials_from_env (537-542)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (58-64)

components/clp-py-utils/clp_py_utils/sql_adapter.py (1)

create_connection (69-85)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: ubuntu-jammy-lint
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: package-image
GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
GitHub Check: musllinux_1_2-x86_64-static-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
GitHub Check: manylinux_2_28-x86_64-static-linked-bins
GitHub Check: rust-checks
GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: build-macos (macos-15, false)
GitHub Check: lint-check (macos-15)
GitHub Check: build-macos (macos-15, true)

🔇 Additional comments (6)

components/package-template/src/etc/clp-config.yaml (2)

32-39: Verify spider_db port configuration to avoid conflicts.

Both the primary database (line 23) and the new spider_db (line 34) are configured with port 3306 by default. This will cause a conflict if both services are bundled in the same deployment. Confirm whether spider_db should use a different default port or whether the configuration logic ensures these services run on separate hosts.

30-30: Approve compression_scheduler type addition.

The addition of type: "celery" to the compression_scheduler section aligns with the PR objective of integrating Spider compression orchestration. The configuration template now properly reflects the Celery-based scheduler setup.

components/clp-package-utils/clp_package_utils/controller.py (2)

13-50: Env wiring for queue/redis and Spider components looks consistent

Using CLP_QUEUE_* and CLP_REDIS_PASS_ENV_VAR_NAME to expose queue/redis credentials keeps env names centralised in clp_config and matches the corresponding load_credentials_from_env implementations.

New _set_up_env_for_spider_db() / _set_up_env_for_spider_scheduler() helpers follow the same pattern as other components and are only invoked when compression_scheduler.type == OrchestrationType.spider, which pairs well with ClpConfig.validate_spider_config.

Guarding _set_up_env_for_queue() / _set_up_env_for_redis() behind queue/redis being non-None makes the Optional orchestration config safe.

No functional issues spotted in these changes.

Also applies to: 248-375, 915-928

1015-1022: Docker Compose file selection for Spider orchestration is sane

Deriving compression_scheduler_type once and branching on both deployment type and orchestration type:

Uses docker-compose-spider-base.yaml / docker-compose-spider.yaml when Spider orchestration is enabled.

Falls back to the existing base/full compose files for Celery.

The control flow is straightforward and avoids ambiguity between base/full + spider/Celery combinations.

components/clp-py-utils/clp_py_utils/clp_config.py (2)

130-136: Orchestration types, Spider config, and container transforms are wired coherently

OrchestrationType and CompressionScheduler.type give a clear, enum-backed switch between Celery and Spider orchestration.

SpiderDb:

Enforces MariaDB via a field validator.

Uses dedicated env var names (SPIDER_DB_USER_ENV_VAR_NAME, SPIDER_DB_PASS_ENV_VAR_NAME) and credentials loading helpers.

Is redacted correctly in ClpConfig.dump_to_primitive_dict(), matching database/queue/redis behaviour.

SpiderScheduler:

Follows existing patterns (DEFAULT_PORT, DomainStr host, transform_for_container() using SPIDER_SCHEDULER_COMPONENT_NAME).

ClpConfig:

Makes queue and redis optional while validate_celery_config() enforces their presence for Celery.

validate_spider_config() enforces spider_db/spider_scheduler presence and a MariaDB main metadata DB for Spider, which matches the deployment model.

transform_for_container() now guards optional queue/redis and transforms Spider components when present, so container deployments see consistent hostnames.

These pieces fit together cleanly and should make the Celery/Spider split explicit and robust.

Also applies to: 353-415, 787-807, 991-1035, 1036-1058

353-381: Based on my verification attempts, I encountered limitations that prevent full confirmation of the review comment:

Web search findings:

Generic JDBC URL format confirmed: jdbc:subprotocol:subname

Standard JDBC practice: username/password passed as separate parameters (properties dict) rather than embedded in the URL query string

No authoritative documentation found for the specific spider_py package's Driver class

Codebase access:

Repository clone failed; unable to verify actual usage patterns, tests, or deployment context for get_container_url()

Cannot confirm whether spider_py actually expects credentials in the query string format used here, or if they're handled through a separate parameters dict

Key insight:
The review comment's concern about URL-encoding is theoretically valid from first principles (special characters like &, ?, = in raw credentials would corrupt a query string). However, the standard JDBC pattern suggests credentials should be passed separately, which would make this entire code path potentially incorrect—not just missing encoding. Without access to spider_py specifics or actual codebase context, I cannot determine:

Whether this code is actively used

Whether spider_py expects this exact URL format

Whether credentials with special characters actually fail in practice

Recommendation:
The original review comment's request for verification against spider_py.Driver's expected URL format remains appropriate and necessary before deciding on a fix approach.

coderabbitai · 2025-11-29T03:19:50Z

components/clp-py-utils/clp_py_utils/initialize-spider-db.py

+    config_path = pathlib.Path(parsed_args.config)
+    try:
+        clp_config = ClpConfig.model_validate(read_yaml_config_file(config_path))
+        clp_config.database.load_credentials_from_env()
+        clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.CLP)
+        clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.ROOT)
+        if clp_config.spider_db is None:
+            logger.error("Spider database configuration not found in CLP configuration.")
+            return -1
+        clp_config.spider_db.load_credentials_from_env()


🧹 Nitpick | 🔵 Trivial

Avoid redundant CLP DB credential loads

You call clp_config.database.load_credentials_from_env() for the default user and then immediately call it again for ClpDbUserType.CLP and ClpDbUserType.ROOT. The first call is redundant:

- clp_config = ClpConfig.model_validate(read_yaml_config_file(config_path)) - clp_config.database.load_credentials_from_env() - clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.CLP) - clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.ROOT) + clp_config = ClpConfig.model_validate(read_yaml_config_file(config_path)) + clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.CLP) + clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.ROOT)

Not a correctness issue, but it simplifies the intent.

🤖 Prompt for AI Agents

In components/clp-py-utils/clp_py_utils/initialize-spider-db.py around lines 228 to 237, the unparameterized call clp_config.database.load_credentials_from_env() is redundant because you immediately call the same method for ClpDbUserType.CLP and ClpDbUserType.ROOT; remove the first call and keep the explicit user_type calls (or replace the three calls with a single loop over the required user types) so credentials are loaded only once per intended user, preserving the spider_db null check and its subsequent load_credentials_from_env() call.

coderabbitai · 2025-11-29T03:19:50Z

components/package-template/src/etc/clp-config.yaml

+#spider_scheduler:
+#  host: "localhost"
+#  port: 6000


🧹 Nitpick | 🔵 Trivial

Consider adding logging_level field to spider_scheduler for consistency.

The compression_scheduler (line 29) and query_scheduler (line 46) both include logging_level configuration, but spider_scheduler omits it. For consistency and to allow users to easily configure logging verbosity for the Spider scheduler, consider adding a logging_level field (e.g., logging_level: "INFO").

Apply this diff to add the logging_level field:

#spider_scheduler: # host: "localhost" # port: 6000 +# logging_level: "INFO"

🤖 Prompt for AI Agents

In components/package-template/src/etc/clp-config.yaml around lines 37 to 39, the commented-out spider_scheduler block lacks a logging_level key for consistency with compression_scheduler and query_scheduler; restore or update the spider_scheduler block to include a logging_level field (for example logging_level: "INFO") alongside host and port so users can configure verbosity, ensuring indentation and quoting match the surrounding entries.

junhaoliao and others added 30 commits October 9, 2025 22:38

docs(quick-start): Add required containerd, Docker CE, CLI, and compo…

e58512a

…se plugin versions to requirements list.

Revert "Fix enum"

d3b34cf

This reverts commit 7357a7e.

refactor(config): Rename type Host -> DomainStr; Add TODO docstri…

cc9a1f4

…ng about plan for replacement.

Merge branch 'main' into pydantic-impf

96ac165

Add custom annotation for serialization

04929ea

Remove field without custom serialization

57b2789

Bug fix

0b6f43c

Merge branch 'main' into docker-compose

d7b40b0

# Conflicts: # docs/src/dev-docs/building-package.md

Add custom annotation for serialization

06d9c39

Remove field without custom serialization

43c2295

Bug fix

e09d497

Fix lint

8e0f163

restore verbose logging option to start_clp script.

37a0a8a

Fix merge

4878bd2

Fix orchestration type

cc71ad1

Bug fix

c608e7b

Fix compression scheduler

358fe4b

Add spider table creation

535b696

Fix none tag id

a405aea

fix(docker): Move USER and ENV directives after image flattening in c…

7d18129

…lp-package Dockerfile (fixes y-scope#1379); Reduce layers using multi-line ENV (fixes y-scope#1378).

fix(docker): Include libmariadbcpp.so (required by `spider_schedule…

ef046e5

…r` and `spider_worker`) in the `clp-package` image; Set `mariadb-connector-cpp` build type to `Release` (fixes y-scope#1410).

Make COPY arch-agnostic and include SONAME symlinks

cfa71d8

Merge branch 'main' into fix-clp-package-docker

b193c2f

Merge branch 'main' into bundle-mariadb

f83a4b6

feat(docker): Update clp-package Dockerfile; add non-root user and se…

3195ef0

…t working directory.

fix(docker): Reorder and restructure ENV directives in clp-package Do…

18a1b18

…ckerfile.

add \

3ef8714

Merge branch 'main' into spider-job-integration

5a4826f

Merge branch 'docker-compose' into spider-job-integration

37b4eda

Merge branch 'main' into pydantic-enum

3a57db7

coderabbitai bot reviewed Nov 20, 2025

View reviewed changes

components/clp-py-utils/clp_py_utils/create-db-tables.py Show resolved Hide resolved

components/package-template/src/etc/clp-config.yaml Show resolved Hide resolved

sitaowang1998 added 5 commits November 20, 2025 15:37

Use same volume

92a9744

Add missing template

84f016d

Construct spider db url in docker

2293d50

Merge branch 'main' into spider-job-integration

e072dcb

Move get ip into start worker script

60ef08b

coderabbitai bot reviewed Nov 20, 2025

View reviewed changes

components/clp-py-utils/clp_py_utils/start-spider-worker.py Show resolved Hide resolved

components/clp-py-utils/clp_py_utils/start-spider-worker.py Outdated Show resolved Hide resolved

tools/deployment/package/docker-compose-all.yaml Outdated Show resolved Hide resolved

sitaowang1998 added 3 commits November 21, 2025 13:39

Bug fix

dcc834a

Add back optional host, improve error handling

74a3149

Merge branch 'main' into spider-job-integration

9177d4e

sitaowang1998 requested a review from junhaoliao November 24, 2025 15:58

LinZhihao-723 requested changes Nov 24, 2025

View reviewed changes

Apply suggestions from code review

01c60a2

Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

components/clp-py-utils/clp_py_utils/start-spider-worker.py Outdated Show resolved Hide resolved

Apply suggestion from code review

3c502d2

Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

sitaowang1998 added 2 commits November 24, 2025 20:32

Rename argument

3d0c8ed

Fix name

c312c19

sitaowang1998 requested a review from LinZhihao-723 November 25, 2025 01:51

LinZhihao-723 previously approved these changes Nov 25, 2025

View reviewed changes

sitaowang1998 added 4 commits November 28, 2025 16:32

Merge branch 'main' into spider-job-integration

7b5f9ef

Bug fix

4a0e776

Fix spider db credential

c20f321

Fix mariadb type hint

bd9a78d

sitaowang1998 dismissed LinZhihao-723’s stale review via bd9a78d November 28, 2025 22:56

coderabbitai bot reviewed Nov 28, 2025

View reviewed changes

components/clp-py-utils/clp_py_utils/clp_config.py Show resolved Hide resolved

sitaowang1998 added 3 commits November 28, 2025 19:31

Fix db password

515a3d3

Merge branch 'main' into spider-job-integration

6a56869

Bug fix

5135fe8

coderabbitai bot reviewed Nov 29, 2025

View reviewed changes

feat(package)!: Add Spider compression orchestration in docker compose package. #1606

Are you sure you want to change the base?

feat(package)!: Add Spider compression orchestration in docker compose package. #1606

Conversation

sitaowang1998 commented Nov 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sitaowang1998 commented Nov 15, 2025 •

edited by coderabbitai bot

Loading