Skip to content

Conversation

@sitaowang1998
Copy link
Contributor

@sitaowang1998 sitaowang1998 commented Nov 15, 2025

Description

This PR integrates spider compression orchestration into docker compose package by

  • Adds Spider into ClpConfig. Minor cleanup for unnecessary dump_to_primitive_type.
  • Adds setup scripts for Spider database.
  • Adds native scripts to start multiple Spider workers in container.
  • Adds docker compose files to start Spider scheduler and worker containers.

This PR also:

  • Bumps mariadb python dependency to >=1.1.14.
  • Bumps spider-py version to 0.2.0.
  • Bumps Spider to latest version for fix in starting python workers.

This PR is a breaking change because it updates the dependencies versions.

The doc change will be put in #1647.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Spider compression workflow works end-to-end.
  • Search works with the compressed archives produces by previous step.
  • GitHub workflows pass.

Summary by CodeRabbit

  • New Features

    • Optional Spider orchestration: Spider scheduler, Spider DB provisioning/init tool, spider-worker CLI, and Spider credentials validation/loading.
  • Improved Behaviour

    • Task payloads/results use bytes/UTF‑8 end‑to‑end; runtime selects Spider or Celery task manager. Environment setup and compose selection now respect optional queue/redis/spider components and inject Spider env when enabled.
  • Chores

    • New Spider-focused compose packages and dependency version updates.
  • Removed

    • Legacy byte-list helper utilities replaced by direct UTF‑8 handling.

✏️ Tip: You can customize this high-level summary in your review settings.

junhaoliao and others added 30 commits October 9, 2025 22:38
This reverts commit 7357a7e.
# Conflicts:
#	docs/src/dev-docs/building-package.md
…lp-package Dockerfile (fixes y-scope#1379); Reduce layers using multi-line ENV (fixes y-scope#1378).
…r` and `spider_worker`) in the `clp-package` image; Set `mariadb-connector-cpp` build type to `Release` (fixes y-scope#1410).
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (2)

269-271: Critical: Parameterize Spider DB password in CREATE USER statement.

The password is embedded directly in the f-string, creating two concrete problems:

  • Correctness: Passwords containing quotes or special characters will break the SQL statement.
  • Security hygiene: This bypasses the project's standard practice of using parameterized SQL with %s placeholders and tuple arguments to prevent injection risks. Based on learnings

The database and user names are already validated by _validate_name, so they remain safely interpolated. The password must be passed as a bound parameter:

Apply this diff:

             db_cursor.execute(
-                f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'"""
+                f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY %s""",
+                (spider_db_password,)
             )

This lets the MariaDB driver handle escaping the secret value correctly.


245-250: Add clarifying comment on SqlAdapter initialization for future maintainability.

The SqlAdapter is initialized with clp_config.database because root operations (CREATE DATABASE, CREATE USER, GRANT) must execute against the CLP database server where root credentials are available. This is correct for the current single-MariaDB-instance Docker setup where Spider DB shares the same server.

However, a clarifying comment would help prevent confusion during future refactors that might introduce separate Spider DB hosts.

Add a comment above line 246:

+        # Note: SqlAdapter uses clp_config.database to connect with root credentials
+        # for CREATE DATABASE/USER/GRANT operations. In the current Docker setup,
+        # Spider DB resides on the same MariaDB instance as CLP metadata DB.
         sql_adapter = SqlAdapter(clp_config.database)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93a5af2 and a2c449d.

📒 Files selected for processing (4)
  • components/clp-py-utils/clp_py_utils/create-db-tables.py (2 hunks)
  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1 hunks)
  • components/package-template/src/etc/clp-config.yaml (1 hunks)
  • components/package-template/src/etc/credentials.template.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/package-template/src/etc/clp-config.yaml
📚 Learning: 2025-04-17T16:55:23.796Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.796Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.

Applied to files:

  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-01-23T17:08:55.566Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.

Applied to files:

  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py
🧬 Code graph analysis (2)
components/clp-py-utils/clp_py_utils/create-db-tables.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (6)
  • ClpConfig (682-952)
  • StorageEngine (101-103)
  • load_credentials_from_env (254-259)
  • load_credentials_from_env (284-289)
  • load_credentials_from_env (367-371)
  • load_credentials_from_env (432-437)
components/clp-py-utils/clp_py_utils/core.py (1)
  • read_yaml_config_file (58-64)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (4)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
  • ClpConfig (682-952)
  • load_credentials_from_env (254-259)
  • load_credentials_from_env (284-289)
  • load_credentials_from_env (367-371)
  • load_credentials_from_env (432-437)
components/clp-py-utils/clp_py_utils/core.py (1)
  • read_yaml_config_file (58-64)
components/clp-py-utils/clp_py_utils/sql_adapter.py (2)
  • SqlAdapter (60-145)
  • create_root_mariadb_connection (109-119)
components/clp-py-utils/clp_py_utils/create-db-tables.py (1)
  • main (21-72)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build-macos (macos-15, true)
🔇 Additional comments (4)
components/clp-py-utils/clp_py_utils/create-db-tables.py (1)

7-8: LGTM!

The new imports are correctly scoped and necessary for Spider DB initialization logic.

components/package-template/src/etc/credentials.template.yaml (1)

14-18: LGTM!

The Spider DB credentials template follows the same pattern as existing credential sections and provides appropriate scaffolding for Spider-based deployments.

components/clp-py-utils/clp_py_utils/initialize-spider-db.py (2)

228-241: LGTM!

Configuration loading and credential validation are correctly implemented. The Spider-specific credential loader properly reads SPIDER_DB_USER and SPIDER_DB_PASS environment variables, and exception handling appropriately distinguishes between validation errors and unexpected failures.


289-298: LGTM!

The name validation pattern correctly restricts database and user identifiers to alphanumeric characters, underscores, and hyphens, preventing SQL injection through identifier interpolation. This aligns with the team's preference for sanitizing even trusted inputs for robustness. Based on learnings

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
tools/deployment/package/docker-compose-all.yaml (2)

305-305: Rename hostname to avoid ambiguity with compression-worker.

Both the compression-worker (line 270) and spider-compression-worker (line 305) services use the same hostname "compression_worker", which creates ambiguity in logs and metrics. Change the spider-compression-worker hostname to a distinct value.

  spider-compression-worker:
    <<: *service_defaults
-   hostname: "compression_worker"
+   hostname: "spider_compression_worker"

303-334: Restore Spider scheduler connection arguments in spider-compression-worker command.

The spider-compression-worker service is missing the --host and --port arguments required to connect to the Spider scheduler for task placement and coordination. Previous review iterations explicitly flagged this inconsistency, and the spider-scheduler service (lines 227–228) properly demonstrates the pattern using environment variables.

Additionally, SPIDER_SCHEDULER_HOST and SPIDER_SCHEDULER_PORT are not defined in the environment section (lines 306–313), even though the command needs them.

Apply this diff to add the missing scheduler connection arguments and environment variables:

  spider-compression-worker:
    <<: *service_defaults
    hostname: "spider_compression_worker"
    environment:
      CLP_CONFIG_PATH: "/etc/clp-config.yaml"
      CLP_HOME: "/opt/clp"
      CLP_LOGGING_LEVEL: "${CLP_COMPRESSION_WORKER_LOGGING_LEVEL:-INFO}"
      CLP_LOGS_DIR: "/var/log/compression_worker"
      CLP_WORKER_LOG_PATH: "/var/log/compression_worker/worker.log"
      PYTHONPATH: "/opt/clp/lib/python3/site-packages"
      SPIDER_LOG_DIR: "/var/log/compression_worker"
+     SPIDER_SCHEDULER_HOST: "${SPIDER_SCHEDULER_HOST:?Please set a value.}"
+     SPIDER_SCHEDULER_PORT: "${SPIDER_SCHEDULER_PORT:?Please set a value.}"
    volumes:
      - *volume_clp_config_readonly
      - *volume_clp_logs
      - *volume_clp_tmp
      - "${CLP_ARCHIVE_OUTPUT_DIR_HOST:-empty}:/var/data/archives"
      - "${CLP_AWS_CONFIG_DIR_HOST:-empty}:/opt/clp/.aws:ro"
      - "${CLP_LOGS_INPUT_DIR_HOST:-empty}:${CLP_LOGS_INPUT_DIR_CONTAINER:-/mnt/logs}"
      - "${CLP_STAGED_ARCHIVE_OUTPUT_DIR_HOST:-empty}:/var/data/staged-archives"
    depends_on:
      db-table-creator:
        condition: "service_completed_successfully"
    command: [
      "python3",
      "-u",
      "/opt/clp/lib/python3/site-packages/clp_py_utils/start-spider-worker.py",
      "--concurrency", "${CLP_COMPRESSION_WORKER_CONCURRENCY:-1}",
      "--storage-url", "jdbc:mariadb://database:${CLP_DB_PORT:-3306}/\
        ${SPIDER_DB_NAME:-spider-db}?\
        user=${SPIDER_DB_USER:?Please set a value.}\
        &password=${SPIDER_DB_PASS:?Please set a value.}",
+     "--host", "${SPIDER_SCHEDULER_HOST}",
+     "--port", "${SPIDER_SCHEDULER_PORT}",
    ]
components/clp-py-utils/clp_py_utils/start-spider-worker.py (1)

57-61: Wait for terminated processes to prevent zombies.

The exception handler terminates processes but doesn't wait for them to finish cleanup, which can leave zombie processes. This issue was previously flagged in past reviews and remains unresolved.

Apply this diff to ensure proper cleanup:

     except OSError as e:
         logger.error(f"Failed to start spider worker: {e}")
         for process in processes:
             process.terminate()
+            try:
+                process.wait(timeout=5)
+            except subprocess.TimeoutExpired:
+                process.kill()
+                process.wait()
         exit(1)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2293d50 and 60ef08b.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/start-spider-worker.py (1 hunks)
  • tools/deployment/package/docker-compose-all.yaml (5 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

  • components/clp-py-utils/clp_py_utils/start-spider-worker.py
  • tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

  • tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-20T21:05:30.417Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).

Applied to files:

  • tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

  • tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

  • tools/deployment/package/docker-compose-all.yaml
🔇 Additional comments (7)
tools/deployment/package/docker-compose-all.yaml (2)

238-238: Verify intentionality of compression-scheduler stop_grace_period increase.

The compression-scheduler service now has stop_grace_period: "300s" instead of the previous 60 seconds. Confirm this longer timeout is necessary for Spider integration and document the reason if it persists across the codebase.


42-45: Volume anchor consolidation applied successfully.

The new volume_clp_tmp anchor eliminates duplication across compression-worker and spider-compression-worker services. This aligns with earlier review suggestions and improves maintainability.

components/clp-py-utils/clp_py_utils/start-spider-worker.py (5)

1-16: LGTM!

The imports and logging setup are appropriate and follow standard Python practices.


35-40: LGTM!

The argument parsing and concurrency validation correctly ensure at least one worker is spawned. This addresses the previous review feedback about validating the concurrency parameter.


43-47: LGTM!

The worker path resolution and existence check are well-implemented and address previous feedback about validating the executable before attempting to spawn subprocesses.


63-70: LGTM!

The process monitoring and failure handling correctly track and report worker failures, ensuring the script exits with a non-zero status if any worker fails. This addresses previous feedback about handling non-zero exit codes.


73-74: LGTM!

Standard Python entry point.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/start-spider-worker.py (1)

66-68: Add wait() after terminate() to prevent zombie processes.

When an error occurs during worker startup, the script terminates already-started processes but doesn't wait for them to complete cleanup. This can leave zombie processes.

Apply this diff to ensure proper cleanup:

     except OSError as e:
         logger.error(f"Failed to start Spider worker: {e}")
         for process in processes:
             process.terminate()
+            process.wait()
         exit(1)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74a3149 and 01c60a2.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/start-spider-worker.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

  • components/clp-py-utils/clp_py_utils/start-spider-worker.py
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/start-spider-worker.py (2)
components/clp-py-utils/clp_py_utils/create-db-tables.py (1)
  • main (21-72)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1)
  • main (223-286)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: rust-checks
🔇 Additional comments (4)
components/clp-py-utils/clp_py_utils/start-spider-worker.py (4)

1-16: LGTM! Logger setup is appropriate.

The imports are correct and the logger configuration follows Python best practices. The script consistently uses the logger instead of print statements throughout.


38-40: Concurrency validation is correct.

The check concurrency >= 1 is appropriate because running a single worker (concurrency=1) is a valid use case. There's no need to require multiple workers (>1).


43-54: LGTM! Proper error handling and validation.

The hostname resolution includes appropriate exception handling for socket errors, and the script validates the spider_worker executable exists before attempting to spawn processes. This prevents common failure modes.


70-77: LGTM! Exit code handling is well-implemented.

The script correctly waits for all worker processes, logs non-zero exit codes, and returns an appropriate exit status to indicate failure. This follows best practices for process orchestration.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01c60a2 and 3c502d2.

📒 Files selected for processing (1)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-19T18:28:26.747Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 1169
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:463-469
Timestamp: 2025-09-19T18:28:26.747Z
Learning: In the compression scheduler (components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py), the SIGTERM handling uses a coarse-grained approach that only checks received_sigterm before calling search_and_schedule_new_tasks. The maintainers consider the race condition where SIGTERM arrives mid-execution to be benign, as scheduling a few extra tasks during shutdown is acceptable and will be handled by cleanup mechanisms.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (6)
  • OrchestrationType (117-119)
  • load_credentials_from_env (254-259)
  • load_credentials_from_env (280-285)
  • load_credentials_from_env (363-367)
  • load_credentials_from_env (428-433)
  • get_container_url (276-278)
components/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.py (1)
  • SpiderTaskManager (14-48)
components/job-orchestration/job_orchestration/scheduler/compress/task_manager/celery_task_manager.py (1)
  • CeleryTaskManager (12-27)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: rust-checks
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)

20-20: LGTM!

The new imports for OrchestrationType and SpiderTaskManager are correctly added and necessary for the Spider orchestration integration.

Also applies to: 37-37


523-523: LGTM!

The forward declaration of task_manager with a union type annotation improves scope clarity and addresses the suggestion from the previous review.

Comment on lines +526 to +528
elif clp_config.compression_scheduler.type == OrchestrationType.spider:
clp_config.spider_db.load_credentials_from_env()
task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Wrap Spider credential loading in exception handling.

The call to clp_config.spider_db.load_credentials_from_env() at line 527 is outside the try-except block (lines 509-518), which means if required environment variables are missing, the program will crash with an unhandled ValueError instead of logging an error and exiting gracefully. This is inconsistent with the database credential loading at line 511, which is properly wrapped in exception handling.

Apply this diff to add proper exception handling:

 logger.info(f"Starting {COMPRESSION_SCHEDULER_COMPONENT_NAME}")
 sql_adapter = SqlAdapter(clp_config.database)

 task_manager: CeleryTaskManager | SpiderTaskManager
 if clp_config.compression_scheduler.type == OrchestrationType.celery:
     task_manager = CeleryTaskManager()
 elif clp_config.compression_scheduler.type == OrchestrationType.spider:
-    clp_config.spider_db.load_credentials_from_env()
-    task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
+    try:
+        clp_config.spider_db.load_credentials_from_env()
+        task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
+    except (ValueError, Exception) as err:
+        logger.error(f"Failed to initialize Spider task manager: {err}")
+        return -1
 else:
     logger.error(
         f"Unsupported compression scheduler type: {clp_config.compression_scheduler.type}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif clp_config.compression_scheduler.type == OrchestrationType.spider:
clp_config.spider_db.load_credentials_from_env()
task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
elif clp_config.compression_scheduler.type == OrchestrationType.spider:
try:
clp_config.spider_db.load_credentials_from_env()
task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
except (ValueError, Exception) as err:
logger.error(f"Failed to initialize Spider task manager: {err}")
return -1
🤖 Prompt for AI Agents
In
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
around lines 526 to 528, the call to
clp_config.spider_db.load_credentials_from_env() must be wrapped in the same
try/except used earlier for DB credential loading: enclose the
load_credentials_from_env() call in a try block and catch ValueError (or the
same exception type used at lines 509-518), log a clear error with the existing
logger, and terminate the process (use the same exit path as the surrounding
code) so missing Spider env credentials do not raise an unhandled exception.

LinZhihao-723
LinZhihao-723 previously approved these changes Nov 25, 2025
Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The orchestration part lgtm.
Leave the rest to @junhaoliao

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
components/clp-package-utils/clp_package_utils/general.py (1)

509-513: Add None check for spider_db to prevent AttributeError.

The function validate_and_load_spider_db_credentials_file calls clp_config.spider_db.load_credentials_from_file() without verifying that spider_db is not None. If called when spider_db is unconfigured, this will raise an AttributeError.

Apply this diff to add defensive None handling:

 def validate_and_load_spider_db_credentials_file(
     clp_config: ClpConfig, clp_home: pathlib.Path, generate_default_file: bool
 ):
     validate_credentials_file_path(clp_config, clp_home, generate_default_file)
+    if clp_config.spider_db is None:
+        raise ValueError("Spider DB configuration must be present to load credentials.")
     clp_config.spider_db.load_credentials_from_file(clp_config.credentials_file_path)

Alternatively, if callers are expected to check spider_db presence before calling, add a docstring clarifying this precondition.

components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1)

269-271: Critical: Parameterize the Spider DB password in CREATE USER statement.

The CREATE USER statement embeds spider_db_password directly into an f-string, which:

  1. Breaks correctness for passwords containing quotes or special characters
  2. Bypasses parameterized SQL practice established in this codebase

Based on learnings, SQL queries should use parameterized queries with placeholders and pass values as a tuple to prevent injection and ensure proper escaping.

Apply this diff to parameterize the password:

             db_cursor.execute(
-                f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'"""
+                f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY %s""",
+                (spider_db_password,)
             )

Note: The database/user names remain interpolated since they are validated by _validate_name() which restricts to alphanumeric, underscore, and hyphen characters.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c312c19 and bd9a78d.

📒 Files selected for processing (10)
  • components/clp-package-utils/clp_package_utils/controller.py (7 hunks)
  • components/clp-package-utils/clp_package_utils/general.py (3 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (8 hunks)
  • components/clp-py-utils/clp_py_utils/create-db-tables.py (2 hunks)
  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1 hunks)
  • components/clp-py-utils/clp_py_utils/sql_adapter.py (2 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1 hunks)
  • components/package-template/src/etc/clp-config.yaml (1 hunks)
  • components/package-template/src/etc/credentials.template.yaml (1 hunks)
  • tools/deployment/package/docker-compose-all.yaml (5 hunks)
🧰 Additional context used
🧠 Learnings (23)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-11-28T15:12:53.506Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1655
File: components/clp-package-utils/clp_package_utils/controller.py:183-189
Timestamp: 2025-11-28T15:12:53.506Z
Learning: In the y-scope/clp project, root database credentials (root username and password) are required configuration items for all deployments. The `credentials[ClpDbUserType.ROOT]` entry is guaranteed to exist and can be accessed directly without optional handling.

Applied to files:

  • components/clp-py-utils/clp_py_utils/sql_adapter.py
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

  • components/clp-py-utils/clp_py_utils/sql_adapter.py
  • components/clp-package-utils/clp_package_utils/controller.py
  • tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-27T07:07:37.901Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.

Applied to files:

  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py
  • components/clp-package-utils/clp_package_utils/controller.py
  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-04-17T16:55:23.796Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.796Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.

Applied to files:

  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-01-23T17:08:55.566Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.

Applied to files:

  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
  • tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
  • components/package-template/src/etc/clp-config.yaml
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-08-17T16:10:38.722Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Applied to files:

  • components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
  • components/clp-py-utils/clp_py_utils/clp_config.py
  • components/package-template/src/etc/clp-config.yaml
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
  • components/clp-py-utils/clp_py_utils/clp_config.py
  • tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-22T21:02:31.113Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-25T06:29:59.610Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-22T21:14:12.225Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

  • tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-20T21:05:30.417Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).

Applied to files:

  • tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

  • tools/deployment/package/docker-compose-all.yaml
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/create-db-tables.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
  • ClpConfig (774-1050)
  • load_credentials_from_env (326-345)
  • load_credentials_from_env (372-377)
  • load_credentials_from_env (459-463)
  • load_credentials_from_env (524-529)
components/clp-py-utils/clp_py_utils/core.py (1)
  • read_yaml_config_file (58-64)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: rust-checks
  • GitHub Check: uv-checks (ubuntu-24.04)
🔇 Additional comments (21)
components/package-template/src/etc/credentials.template.yaml (1)

16-20: Template addition for Spider DB credentials looks good.

The new commented section follows the established pattern and remains inactive (all entries commented). Ensure the example username and password are consistent with the setup scripts and initialization tooling mentioned in the PR objectives.

Please verify that the example credentials "spider-user" and "pass" match what is used in the Spider DB setup scripts added elsewhere in this PR.

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)

618-623: LGTM! Consistent serialization in error path.

The addition of .model_dump() ensures that both the error path and success path (line 676) return a serialized dictionary, which aligns with the PR's objective of switching task payloads to raw bytes.

tools/deployment/package/docker-compose-all.yaml (4)

42-45: ✅ Volume anchor for clp-tmp is well-structured.

The new volume definition follows the established pattern and eliminates repetition across services.


96-97: ✅ db-table-creator environment variables correctly include Spider credentials.

The SPIDER_DB_PASS and SPIDER_DB_USER variables enable proper Spider database initialization during setup.


227-248: ✅ spider-scheduler service is correctly configured with proper dependencies and port exposure.

The service properly depends on db-table-creator and exposes port 6000. The JDBC storage URL is correctly constructed with required credentials.


299-299: I apologize, but I'm unable to complete the verification of this review comment due to persistent access issues in the sandbox environment. Despite multiple attempts using different approaches (direct shell access, GitHub CLI, web search), I cannot:

  1. Access the docker-compose-all.yaml file to verify the claimed duplicate hostname at lines 284 and 319
  2. Examine the start-spider-worker.py script to understand how scheduler connection parameters are handled (via CLI flags, environment variables, or defaults)
  3. Validate the specific claims about missing --host and --port arguments
  4. Confirm the proposed diff is accurate and complete

Without access to the actual codebase, I cannot determine whether:

  • The duplicate hostname issue actually exists
  • The spider-compression-worker service truly lacks scheduler connection parameters
  • The suggested fix is complete and addresses the root cause
  • There are any undocumented defaults or environment variable mechanisms that make the explicit flags unnecessary

Recommendation: Please verify this review comment manually by:

  1. Checking lines 284 and 319 of tools/deployment/package/docker-compose-all.yaml for the hostname values
  2. Examining start-spider-worker.py to understand its parameter expectations
  3. Comparing the spider-scheduler and spider-compression-worker service definitions for consistency

components/package-template/src/etc/clp-config.yaml (1)

29-38: LGTM! Spider configuration scaffolding is complete.

The commented configuration blocks for compression_scheduler.type, spider_db, and spider_scheduler correctly match the corresponding model definitions in clp_config.py. The port field for spider_scheduler now correctly defaults to 6000, aligning with SpiderScheduler.DEFAULT_PORT.

components/clp-py-utils/clp_py_utils/sql_adapter.py (1)

73-73: LGTM! Return type annotations improved for better type safety.

The updated return type annotations correctly reflect the actual return types from mysql.connector.connect() and mariadb.connect(). This aligns with the MariaDB driver version bump (>=1.1.14) in this PR.

Also applies to: 154-154

components/clp-py-utils/clp_py_utils/initialize-spider-db.py (2)

245-249: Good: Root connection approach is correct.

The script correctly uses SqlAdapter(clp_config.database) with ClpDbUserType.ROOT to obtain root privileges for database/user creation. This is the expected pattern since Spider DB runs on the same MariaDB instance as the CLP database, sharing root credentials.


26-220: LGTM! Spider database schema is comprehensive.

The DDL statements define a complete schema for Spider task orchestration including:

  • Core tables: drivers, schedulers, jobs, tasks
  • Relationship tables: input_tasks, output_tasks, task_dependencies
  • Data storage: data, task_inputs, task_outputs
  • Operational tables: task_instances, scheduler_leases, data_locality
  • KV storage: client_kv_data, task_kv_data

Foreign key constraints with appropriate ON DELETE CASCADE behaviour ensure referential integrity.

components/clp-py-utils/clp_py_utils/create-db-tables.py (1)

53-68: LGTM! Conditional Spider DB initialization is correctly implemented.

The logic properly:

  1. Loads CLP configuration and credentials
  2. Checks if spider_db is configured before attempting Spider DB initialization
  3. Logs an informative message and exits successfully when Spider DB is not configured
  4. Only runs initialize-spider-db when Spider DB configuration is present

The broad Exception catch at line 59 is consistent with existing patterns in this file and provides adequate error reporting for configuration loading failures.

components/clp-package-utils/clp_package_utils/general.py (1)

463-463: LGTM! Spider DB credentials generation follows established pattern.

The Spider DB entry correctly mirrors the existing credential generation pattern with a distinct username (spider-user) and appropriately sized password (8 characters, consistent with the database user).

components/clp-package-utils/clp_package_utils/controller.py (4)

330-350: LGTM! Spider DB environment setup is correctly implemented.

The _set_up_env_for_spider_db method properly:

  • Uses the SPIDER_DB_COMPONENT_NAME constant for logging
  • Sets SPIDER_DB_NAME from configuration
  • Uses the imported environment variable name constants for credentials

The credentials access pattern (spider_db.credentials.username/password) correctly matches the SpiderDb model's credentials: DbUserCredentials field.


352-369: LGTM! Spider scheduler environment setup follows established patterns.

The _set_up_env_for_spider_scheduler method correctly:

  • Uses SPIDER_SCHEDULER_COMPONENT_NAME for logging
  • Resolves the hostname to IP using _get_ip_from_hostname
  • Sets both SPIDER_SCHEDULER_HOST and SPIDER_SCHEDULER_PORT

916-922: LGTM! Conditional environment setup prevents errors when components are unconfigured.

The conditional checks properly guard against AttributeError when queue, redis, or Spider components are not configured:

  • Queue/Redis are now Optional and only set up when present
  • Spider components are only set up when OrchestrationType.spider is active

1010-1016: LGTM! Docker Compose file selection correctly handles Spider orchestration.

The logic properly selects the appropriate Docker Compose file based on deployment type and orchestration:

  • BASE + Spider → docker-compose-spider-base.yaml
  • BASE + Celery → docker-compose.base.yaml
  • FULL + Spider → docker-compose-spider.yaml
  • FULL + Celery → docker-compose.yaml
components/clp-py-utils/clp_py_utils/clp_config.py (5)

352-394: SpiderDb implementation is well-structured with proper credential separation.

The SpiderDb class correctly:

  • Extends Database with Spider-specific defaults
  • Enforces MariaDB via validate_type with proper return value
  • Uses separate DbUserCredentials instead of the dict-based credentials
  • Implements dedicated load_credentials_from_env/file methods reading Spider-specific env vars and config keys

396-405: LGTM! SpiderScheduler follows established patterns.

The SpiderScheduler class correctly implements:

  • DEFAULT_PORT ClassVar for consistency
  • DomainStr type for host
  • transform_for_container() method for container orchestration

997-1021: LGTM! Orchestration validators enforce correct configuration combinations.

The validators properly ensure:

  • Spider orchestration requires both spider_db and spider_scheduler, plus MariaDB for the main database
  • Celery orchestration requires both queue and redis

This prevents runtime errors from misconfigured orchestration setups.


978-983: LGTM! dump_to_primitive_dict correctly handles optional fields.

The updated serialization:

  • Includes spider_db in the credential-redacting custom fields
  • Properly handles None values for optional components (queue, redis, spider_db)

This prevents credential leakage while supporting the new optional field patterns.


1038-1045: LGTM! Container transformation properly guards optional components.

The conditional calls to transform_for_container() prevent AttributeError when optional components (queue, redis, spider_db, spider_scheduler) are None.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)

456-466: Handle Spider DB credential / task-manager initialisation errors gracefully

The Spider branch currently calls clp_config.spider_db.load_credentials_from_env() and SpiderTaskManager(...) without protection. If Spider DB env vars are missing or get_container_url() fails, the scheduler will crash with an uncaught exception instead of logging and exiting cleanly, unlike the earlier DB-credential path.

Wrap the Spider initialisation in a try/except and return -1 on failure, e.g.:

-    task_manager: CeleryTaskManager | SpiderTaskManager
-    if clp_config.compression_scheduler.type == OrchestrationType.celery:
-        task_manager = CeleryTaskManager()
-    elif clp_config.compression_scheduler.type == OrchestrationType.spider:
-        clp_config.spider_db.load_credentials_from_env()
-        task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
-    else:
-        logger.error(
-            f"Unsupported compression scheduler type: {clp_config.compression_scheduler.type}"
-        )
-        return -1
+    task_manager: CeleryTaskManager | SpiderTaskManager
+    if clp_config.compression_scheduler.type == OrchestrationType.celery:
+        task_manager = CeleryTaskManager()
+    elif clp_config.compression_scheduler.type == OrchestrationType.spider:
+        try:
+            clp_config.spider_db.load_credentials_from_env()
+            task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
+        except Exception:
+            logger.exception("Failed to initialise Spider task manager.")
+            return -1
+    else:
+        logger.error(
+            "Unsupported compression scheduler type: %s",
+            clp_config.compression_scheduler.type,
+        )
+        return -1

This keeps Spider initialisation behaviour aligned with the existing configuration/DB error handling and avoids uncaught ValueErrors on missing Spider env vars.

components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1)

247-279: Parameterise Spider DB password in CREATE USER DDL

spider_db_password is interpolated directly into the CREATE USER statement:

db_cursor.execute(
    f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'"""
)

Now that password validation allows arbitrary characters, this is brittle (breaks on quotes) and bypasses the project’s practice of using parameterised SQL for values. Identifiers here are already validated via _validate_name, so the remaining risk/value is the password.

Use a %s placeholder for the password and let the driver escape it:

-            db_cursor.execute(f"""CREATE DATABASE IF NOT EXISTS `{spider_db_name}`""")
+            db_cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{spider_db_name}`")

             if spider_db_password is None:
                 logger.error("Password must be set for Spider database user.")
                 return -1
             db_cursor.execute(
-                f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'"""
+                f"CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY %s",
+                (spider_db_password,),
             )

This keeps identifier handling as-is while ensuring passwords with special characters work correctly and follow the repo’s parameterisation guidelines. Based on learnings.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd9a78d and 5135fe8.

📒 Files selected for processing (5)
  • components/clp-package-utils/clp_package_utils/controller.py (7 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (8 hunks)
  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1 hunks)
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3 hunks)
  • components/package-template/src/etc/clp-config.yaml (1 hunks)
🧰 Additional context used
🧠 Learnings (20)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
  • components/package-template/src/etc/clp-config.yaml
  • components/clp-py-utils/clp_py_utils/clp_config.py
  • components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
  • components/package-template/src/etc/clp-config.yaml
📚 Learning: 2025-09-19T18:28:26.747Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 1169
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:463-469
Timestamp: 2025-09-19T18:28:26.747Z
Learning: In the compression scheduler (components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py), the SIGTERM handling uses a coarse-grained approach that only checks received_sigterm before calling search_and_schedule_new_tasks. The maintainers consider the race condition where SIGTERM arrives mid-execution to be benign, as scheduling a few extra tasks during shutdown is acceptable and will be handled by cleanup mechanisms.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-10-27T07:07:37.901Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.

Applied to files:

  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py
  • components/clp-py-utils/clp_py_utils/clp_config.py
  • components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-04-17T16:55:23.796Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.796Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.

Applied to files:

  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-01-23T17:08:55.566Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.

Applied to files:

  • components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
  • components/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-22T21:02:31.113Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-25T06:29:59.610Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-22T21:14:12.225Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.

Applied to files:

  • components/clp-package-utils/clp_package_utils/controller.py
🧬 Code graph analysis (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
  • OrchestrationType (130-132)
  • load_credentials_from_env (327-346)
  • load_credentials_from_env (382-387)
  • load_credentials_from_env (472-476)
  • load_credentials_from_env (537-542)
components/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.py (1)
  • SpiderTaskManager (14-48)
components/job-orchestration/job_orchestration/scheduler/compress/task_manager/celery_task_manager.py (1)
  • CeleryTaskManager (12-27)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
  • ClpDbUserType (189-193)
  • load_credentials_from_env (327-346)
  • load_credentials_from_env (382-387)
  • load_credentials_from_env (472-476)
  • load_credentials_from_env (537-542)
components/clp-py-utils/clp_py_utils/core.py (1)
  • read_yaml_config_file (58-64)
components/clp-py-utils/clp_py_utils/sql_adapter.py (1)
  • create_connection (69-85)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: rust-checks
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: build-macos (macos-15, true)
🔇 Additional comments (6)
components/package-template/src/etc/clp-config.yaml (2)

32-39: Verify spider_db port configuration to avoid conflicts.

Both the primary database (line 23) and the new spider_db (line 34) are configured with port 3306 by default. This will cause a conflict if both services are bundled in the same deployment. Confirm whether spider_db should use a different default port or whether the configuration logic ensures these services run on separate hosts.


30-30: Approve compression_scheduler type addition.

The addition of type: "celery" to the compression_scheduler section aligns with the PR objective of integrating Spider compression orchestration. The configuration template now properly reflects the Celery-based scheduler setup.

components/clp-package-utils/clp_package_utils/controller.py (2)

13-50: Env wiring for queue/redis and Spider components looks consistent

  • Using CLP_QUEUE_* and CLP_REDIS_PASS_ENV_VAR_NAME to expose queue/redis credentials keeps env names centralised in clp_config and matches the corresponding load_credentials_from_env implementations.
  • New _set_up_env_for_spider_db() / _set_up_env_for_spider_scheduler() helpers follow the same pattern as other components and are only invoked when compression_scheduler.type == OrchestrationType.spider, which pairs well with ClpConfig.validate_spider_config.
  • Guarding _set_up_env_for_queue() / _set_up_env_for_redis() behind queue/redis being non-None makes the Optional orchestration config safe.

No functional issues spotted in these changes.

Also applies to: 248-375, 915-928


1015-1022: Docker Compose file selection for Spider orchestration is sane

Deriving compression_scheduler_type once and branching on both deployment type and orchestration type:

  • Uses docker-compose-spider-base.yaml / docker-compose-spider.yaml when Spider orchestration is enabled.
  • Falls back to the existing base/full compose files for Celery.

The control flow is straightforward and avoids ambiguity between base/full + spider/Celery combinations.

components/clp-py-utils/clp_py_utils/clp_config.py (2)

130-136: Orchestration types, Spider config, and container transforms are wired coherently

  • OrchestrationType and CompressionScheduler.type give a clear, enum-backed switch between Celery and Spider orchestration.
  • SpiderDb:
    • Enforces MariaDB via a field validator.
    • Uses dedicated env var names (SPIDER_DB_USER_ENV_VAR_NAME, SPIDER_DB_PASS_ENV_VAR_NAME) and credentials loading helpers.
    • Is redacted correctly in ClpConfig.dump_to_primitive_dict(), matching database/queue/redis behaviour.
  • SpiderScheduler:
    • Follows existing patterns (DEFAULT_PORT, DomainStr host, transform_for_container() using SPIDER_SCHEDULER_COMPONENT_NAME).
  • ClpConfig:
    • Makes queue and redis optional while validate_celery_config() enforces their presence for Celery.
    • validate_spider_config() enforces spider_db/spider_scheduler presence and a MariaDB main metadata DB for Spider, which matches the deployment model.
    • transform_for_container() now guards optional queue/redis and transforms Spider components when present, so container deployments see consistent hostnames.

These pieces fit together cleanly and should make the Celery/Spider split explicit and robust.

Also applies to: 353-415, 787-807, 991-1035, 1036-1058


353-381: Based on my verification attempts, I encountered limitations that prevent full confirmation of the review comment:

Web search findings:

  • Generic JDBC URL format confirmed: jdbc:subprotocol:subname
  • Standard JDBC practice: username/password passed as separate parameters (properties dict) rather than embedded in the URL query string
  • No authoritative documentation found for the specific spider_py package's Driver class

Codebase access:

  • Repository clone failed; unable to verify actual usage patterns, tests, or deployment context for get_container_url()
  • Cannot confirm whether spider_py actually expects credentials in the query string format used here, or if they're handled through a separate parameters dict

Key insight:
The review comment's concern about URL-encoding is theoretically valid from first principles (special characters like &, ?, = in raw credentials would corrupt a query string). However, the standard JDBC pattern suggests credentials should be passed separately, which would make this entire code path potentially incorrect—not just missing encoding. Without access to spider_py specifics or actual codebase context, I cannot determine:

  1. Whether this code is actively used
  2. Whether spider_py expects this exact URL format
  3. Whether credentials with special characters actually fail in practice

Recommendation:
The original review comment's request for verification against spider_py.Driver's expected URL format remains appropriate and necessary before deciding on a fix approach.

Comment on lines +228 to +237
config_path = pathlib.Path(parsed_args.config)
try:
clp_config = ClpConfig.model_validate(read_yaml_config_file(config_path))
clp_config.database.load_credentials_from_env()
clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.CLP)
clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.ROOT)
if clp_config.spider_db is None:
logger.error("Spider database configuration not found in CLP configuration.")
return -1
clp_config.spider_db.load_credentials_from_env()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Avoid redundant CLP DB credential loads

You call clp_config.database.load_credentials_from_env() for the default user and then immediately call it again for ClpDbUserType.CLP and ClpDbUserType.ROOT. The first call is redundant:

-        clp_config = ClpConfig.model_validate(read_yaml_config_file(config_path))
-        clp_config.database.load_credentials_from_env()
-        clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.CLP)
-        clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.ROOT)
+        clp_config = ClpConfig.model_validate(read_yaml_config_file(config_path))
+        clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.CLP)
+        clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.ROOT)

Not a correctness issue, but it simplifies the intent.

🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/initialize-spider-db.py around lines 228
to 237, the unparameterized call clp_config.database.load_credentials_from_env()
is redundant because you immediately call the same method for ClpDbUserType.CLP
and ClpDbUserType.ROOT; remove the first call and keep the explicit user_type
calls (or replace the three calls with a single loop over the required user
types) so credentials are loaded only once per intended user, preserving the
spider_db null check and its subsequent load_credentials_from_env() call.

Comment on lines +37 to +39
#spider_scheduler:
# host: "localhost"
# port: 6000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding logging_level field to spider_scheduler for consistency.

The compression_scheduler (line 29) and query_scheduler (line 46) both include logging_level configuration, but spider_scheduler omits it. For consistency and to allow users to easily configure logging verbosity for the Spider scheduler, consider adding a logging_level field (e.g., logging_level: "INFO").

Apply this diff to add the logging_level field:

 #spider_scheduler:
 #  host: "localhost"
 #  port: 6000
+#  logging_level: "INFO"
🤖 Prompt for AI Agents
In components/package-template/src/etc/clp-config.yaml around lines 37 to 39,
the commented-out spider_scheduler block lacks a logging_level key for
consistency with compression_scheduler and query_scheduler; restore or update
the spider_scheduler block to include a logging_level field (for example
logging_level: "INFO") alongside host and port so users can configure verbosity,
ensuring indentation and quoting match the surrounding entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants