-
Notifications
You must be signed in to change notification settings - Fork 83
feat(package)!: Add Spider compression orchestration in docker compose package. #1606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(package)!: Add Spider compression orchestration in docker compose package. #1606
Conversation
…se plugin versions to requirements list.
This reverts commit 7357a7e.
…ng about plan for replacement.
# Conflicts: # docs/src/dev-docs/building-package.md
…lp-package Dockerfile (fixes y-scope#1379); Reduce layers using multi-line ENV (fixes y-scope#1378).
…r` and `spider_worker`) in the `clp-package` image; Set `mariadb-connector-cpp` build type to `Release` (fixes y-scope#1410).
…t working directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (2)
269-271: Critical: Parameterize Spider DB password in CREATE USER statement.The password is embedded directly in the f-string, creating two concrete problems:
- Correctness: Passwords containing quotes or special characters will break the SQL statement.
- Security hygiene: This bypasses the project's standard practice of using parameterized SQL with
%splaceholders and tuple arguments to prevent injection risks. Based on learningsThe database and user names are already validated by
_validate_name, so they remain safely interpolated. The password must be passed as a bound parameter:Apply this diff:
db_cursor.execute( - f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'""" + f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY %s""", + (spider_db_password,) )This lets the MariaDB driver handle escaping the secret value correctly.
245-250: Add clarifying comment on SqlAdapter initialization for future maintainability.The
SqlAdapteris initialized withclp_config.databasebecause root operations (CREATE DATABASE, CREATE USER, GRANT) must execute against the CLP database server where root credentials are available. This is correct for the current single-MariaDB-instance Docker setup where Spider DB shares the same server.However, a clarifying comment would help prevent confusion during future refactors that might introduce separate Spider DB hosts.
Add a comment above line 246:
+ # Note: SqlAdapter uses clp_config.database to connect with root credentials + # for CREATE DATABASE/USER/GRANT operations. In the current Docker setup, + # Spider DB resides on the same MariaDB instance as CLP metadata DB. sql_adapter = SqlAdapter(clp_config.database)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/clp-py-utils/clp_py_utils/create-db-tables.py(2 hunks)components/clp-py-utils/clp_py_utils/initialize-spider-db.py(1 hunks)components/package-template/src/etc/clp-config.yaml(1 hunks)components/package-template/src/etc/credentials.template.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Applied to files:
components/package-template/src/etc/clp-config.yaml
📚 Learning: 2025-04-17T16:55:23.796Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.796Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-01-23T17:08:55.566Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.py
🧬 Code graph analysis (2)
components/clp-py-utils/clp_py_utils/create-db-tables.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (6)
ClpConfig(682-952)StorageEngine(101-103)load_credentials_from_env(254-259)load_credentials_from_env(284-289)load_credentials_from_env(367-371)load_credentials_from_env(432-437)components/clp-py-utils/clp_py_utils/core.py (1)
read_yaml_config_file(58-64)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (4)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
ClpConfig(682-952)load_credentials_from_env(254-259)load_credentials_from_env(284-289)load_credentials_from_env(367-371)load_credentials_from_env(432-437)components/clp-py-utils/clp_py_utils/core.py (1)
read_yaml_config_file(58-64)components/clp-py-utils/clp_py_utils/sql_adapter.py (2)
SqlAdapter(60-145)create_root_mariadb_connection(109-119)components/clp-py-utils/clp_py_utils/create-db-tables.py (1)
main(21-72)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-macos (macos-15, true)
🔇 Additional comments (4)
components/clp-py-utils/clp_py_utils/create-db-tables.py (1)
7-8: LGTM!The new imports are correctly scoped and necessary for Spider DB initialization logic.
components/package-template/src/etc/credentials.template.yaml (1)
14-18: LGTM!The Spider DB credentials template follows the same pattern as existing credential sections and provides appropriate scaffolding for Spider-based deployments.
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (2)
228-241: LGTM!Configuration loading and credential validation are correctly implemented. The Spider-specific credential loader properly reads
SPIDER_DB_USERandSPIDER_DB_PASSenvironment variables, and exception handling appropriately distinguishes between validation errors and unexpected failures.
289-298: LGTM!The name validation pattern correctly restricts database and user identifiers to alphanumeric characters, underscores, and hyphens, preventing SQL injection through identifier interpolation. This aligns with the team's preference for sanitizing even trusted inputs for robustness. Based on learnings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (3)
tools/deployment/package/docker-compose-all.yaml (2)
305-305: Rename hostname to avoid ambiguity with compression-worker.Both the
compression-worker(line 270) andspider-compression-worker(line 305) services use the same hostname"compression_worker", which creates ambiguity in logs and metrics. Change the spider-compression-worker hostname to a distinct value.spider-compression-worker: <<: *service_defaults - hostname: "compression_worker" + hostname: "spider_compression_worker"
303-334: Restore Spider scheduler connection arguments in spider-compression-worker command.The
spider-compression-workerservice is missing the--hostand--portarguments required to connect to the Spider scheduler for task placement and coordination. Previous review iterations explicitly flagged this inconsistency, and the spider-scheduler service (lines 227–228) properly demonstrates the pattern using environment variables.Additionally,
SPIDER_SCHEDULER_HOSTandSPIDER_SCHEDULER_PORTare not defined in the environment section (lines 306–313), even though the command needs them.Apply this diff to add the missing scheduler connection arguments and environment variables:
spider-compression-worker: <<: *service_defaults hostname: "spider_compression_worker" environment: CLP_CONFIG_PATH: "/etc/clp-config.yaml" CLP_HOME: "/opt/clp" CLP_LOGGING_LEVEL: "${CLP_COMPRESSION_WORKER_LOGGING_LEVEL:-INFO}" CLP_LOGS_DIR: "/var/log/compression_worker" CLP_WORKER_LOG_PATH: "/var/log/compression_worker/worker.log" PYTHONPATH: "/opt/clp/lib/python3/site-packages" SPIDER_LOG_DIR: "/var/log/compression_worker" + SPIDER_SCHEDULER_HOST: "${SPIDER_SCHEDULER_HOST:?Please set a value.}" + SPIDER_SCHEDULER_PORT: "${SPIDER_SCHEDULER_PORT:?Please set a value.}" volumes: - *volume_clp_config_readonly - *volume_clp_logs - *volume_clp_tmp - "${CLP_ARCHIVE_OUTPUT_DIR_HOST:-empty}:/var/data/archives" - "${CLP_AWS_CONFIG_DIR_HOST:-empty}:/opt/clp/.aws:ro" - "${CLP_LOGS_INPUT_DIR_HOST:-empty}:${CLP_LOGS_INPUT_DIR_CONTAINER:-/mnt/logs}" - "${CLP_STAGED_ARCHIVE_OUTPUT_DIR_HOST:-empty}:/var/data/staged-archives" depends_on: db-table-creator: condition: "service_completed_successfully" command: [ "python3", "-u", "/opt/clp/lib/python3/site-packages/clp_py_utils/start-spider-worker.py", "--concurrency", "${CLP_COMPRESSION_WORKER_CONCURRENCY:-1}", "--storage-url", "jdbc:mariadb://database:${CLP_DB_PORT:-3306}/\ ${SPIDER_DB_NAME:-spider-db}?\ user=${SPIDER_DB_USER:?Please set a value.}\ &password=${SPIDER_DB_PASS:?Please set a value.}", + "--host", "${SPIDER_SCHEDULER_HOST}", + "--port", "${SPIDER_SCHEDULER_PORT}", ]components/clp-py-utils/clp_py_utils/start-spider-worker.py (1)
57-61: Wait for terminated processes to prevent zombies.The exception handler terminates processes but doesn't wait for them to finish cleanup, which can leave zombie processes. This issue was previously flagged in past reviews and remains unresolved.
Apply this diff to ensure proper cleanup:
except OSError as e: logger.error(f"Failed to start spider worker: {e}") for process in processes: process.terminate() + try: + process.wait(timeout=5) + except subprocess.TimeoutExpired: + process.kill() + process.wait() exit(1)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/clp-py-utils/clp_py_utils/start-spider-worker.py(1 hunks)tools/deployment/package/docker-compose-all.yaml(5 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.
Applied to files:
components/clp-py-utils/clp_py_utils/start-spider-worker.pytools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.
Applied to files:
tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-20T21:05:30.417Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).
Applied to files:
tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Applied to files:
tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
Applied to files:
tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.
Applied to files:
tools/deployment/package/docker-compose-all.yaml
🔇 Additional comments (7)
tools/deployment/package/docker-compose-all.yaml (2)
238-238: Verify intentionality of compression-scheduler stop_grace_period increase.The
compression-schedulerservice now hasstop_grace_period: "300s"instead of the previous 60 seconds. Confirm this longer timeout is necessary for Spider integration and document the reason if it persists across the codebase.
42-45: Volume anchor consolidation applied successfully.The new
volume_clp_tmpanchor eliminates duplication acrosscompression-workerandspider-compression-workerservices. This aligns with earlier review suggestions and improves maintainability.components/clp-py-utils/clp_py_utils/start-spider-worker.py (5)
1-16: LGTM!The imports and logging setup are appropriate and follow standard Python practices.
35-40: LGTM!The argument parsing and concurrency validation correctly ensure at least one worker is spawned. This addresses the previous review feedback about validating the concurrency parameter.
43-47: LGTM!The worker path resolution and existence check are well-implemented and address previous feedback about validating the executable before attempting to spawn subprocesses.
63-70: LGTM!The process monitoring and failure handling correctly track and report worker failures, ensuring the script exits with a non-zero status if any worker fails. This addresses previous feedback about handling non-zero exit codes.
73-74: LGTM!Standard Python entry point.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm.
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
Show resolved
Hide resolved
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/start-spider-worker.py (1)
66-68: Add wait() after terminate() to prevent zombie processes.When an error occurs during worker startup, the script terminates already-started processes but doesn't wait for them to complete cleanup. This can leave zombie processes.
Apply this diff to ensure proper cleanup:
except OSError as e: logger.error(f"Failed to start Spider worker: {e}") for process in processes: process.terminate() + process.wait() exit(1)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/clp-py-utils/clp_py_utils/start-spider-worker.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.
Applied to files:
components/clp-py-utils/clp_py_utils/start-spider-worker.py
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/start-spider-worker.py (2)
components/clp-py-utils/clp_py_utils/create-db-tables.py (1)
main(21-72)components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1)
main(223-286)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: rust-checks
🔇 Additional comments (4)
components/clp-py-utils/clp_py_utils/start-spider-worker.py (4)
1-16: LGTM! Logger setup is appropriate.The imports are correct and the logger configuration follows Python best practices. The script consistently uses the logger instead of print statements throughout.
38-40: Concurrency validation is correct.The check
concurrency >= 1is appropriate because running a single worker (concurrency=1) is a valid use case. There's no need to require multiple workers (>1).
43-54: LGTM! Proper error handling and validation.The hostname resolution includes appropriate exception handling for socket errors, and the script validates the spider_worker executable exists before attempting to spawn processes. This prevents common failure modes.
70-77: LGTM! Exit code handling is well-implemented.The script correctly waits for all worker processes, logs non-zero exit codes, and returns an appropriate exit status to indicate failure. This follows best practices for process orchestration.
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py(3 hunks)
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-19T18:28:26.747Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 1169
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:463-469
Timestamp: 2025-09-19T18:28:26.747Z
Learning: In the compression scheduler (components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py), the SIGTERM handling uses a coarse-grained approach that only checks received_sigterm before calling search_and_schedule_new_tasks. The maintainers consider the race condition where SIGTERM arrives mid-execution to be benign, as scheduling a few extra tasks during shutdown is acceptable and will be handled by cleanup mechanisms.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧬 Code graph analysis (1)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (6)
OrchestrationType(117-119)load_credentials_from_env(254-259)load_credentials_from_env(280-285)load_credentials_from_env(363-367)load_credentials_from_env(428-433)get_container_url(276-278)components/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.py (1)
SpiderTaskManager(14-48)components/job-orchestration/job_orchestration/scheduler/compress/task_manager/celery_task_manager.py (1)
CeleryTaskManager(12-27)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: rust-checks
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (2)
20-20: LGTM!The new imports for
OrchestrationTypeandSpiderTaskManagerare correctly added and necessary for the Spider orchestration integration.Also applies to: 37-37
523-523: LGTM!The forward declaration of
task_managerwith a union type annotation improves scope clarity and addresses the suggestion from the previous review.
| elif clp_config.compression_scheduler.type == OrchestrationType.spider: | ||
| clp_config.spider_db.load_credentials_from_env() | ||
| task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap Spider credential loading in exception handling.
The call to clp_config.spider_db.load_credentials_from_env() at line 527 is outside the try-except block (lines 509-518), which means if required environment variables are missing, the program will crash with an unhandled ValueError instead of logging an error and exiting gracefully. This is inconsistent with the database credential loading at line 511, which is properly wrapped in exception handling.
Apply this diff to add proper exception handling:
logger.info(f"Starting {COMPRESSION_SCHEDULER_COMPONENT_NAME}")
sql_adapter = SqlAdapter(clp_config.database)
task_manager: CeleryTaskManager | SpiderTaskManager
if clp_config.compression_scheduler.type == OrchestrationType.celery:
task_manager = CeleryTaskManager()
elif clp_config.compression_scheduler.type == OrchestrationType.spider:
- clp_config.spider_db.load_credentials_from_env()
- task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
+ try:
+ clp_config.spider_db.load_credentials_from_env()
+ task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url())
+ except (ValueError, Exception) as err:
+ logger.error(f"Failed to initialize Spider task manager: {err}")
+ return -1
else:
logger.error(
f"Unsupported compression scheduler type: {clp_config.compression_scheduler.type}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif clp_config.compression_scheduler.type == OrchestrationType.spider: | |
| clp_config.spider_db.load_credentials_from_env() | |
| task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url()) | |
| elif clp_config.compression_scheduler.type == OrchestrationType.spider: | |
| try: | |
| clp_config.spider_db.load_credentials_from_env() | |
| task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url()) | |
| except (ValueError, Exception) as err: | |
| logger.error(f"Failed to initialize Spider task manager: {err}") | |
| return -1 |
🤖 Prompt for AI Agents
In
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
around lines 526 to 528, the call to
clp_config.spider_db.load_credentials_from_env() must be wrapped in the same
try/except used earlier for DB credential loading: enclose the
load_credentials_from_env() call in a try block and catch ValueError (or the
same exception type used at lines 509-518), log a clear error with the existing
logger, and terminate the process (use the same exit path as the surrounding
code) so missing Spider env credentials do not raise an unhandled exception.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The orchestration part lgtm.
Leave the rest to @junhaoliao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/clp-package-utils/clp_package_utils/general.py (1)
509-513: Add None check forspider_dbto prevent AttributeError.The function
validate_and_load_spider_db_credentials_filecallsclp_config.spider_db.load_credentials_from_file()without verifying thatspider_dbis notNone. If called whenspider_dbis unconfigured, this will raise anAttributeError.Apply this diff to add defensive None handling:
def validate_and_load_spider_db_credentials_file( clp_config: ClpConfig, clp_home: pathlib.Path, generate_default_file: bool ): validate_credentials_file_path(clp_config, clp_home, generate_default_file) + if clp_config.spider_db is None: + raise ValueError("Spider DB configuration must be present to load credentials.") clp_config.spider_db.load_credentials_from_file(clp_config.credentials_file_path)Alternatively, if callers are expected to check
spider_dbpresence before calling, add a docstring clarifying this precondition.components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1)
269-271: Critical: Parameterize the Spider DB password inCREATE USERstatement.The
CREATE USERstatement embedsspider_db_passworddirectly into an f-string, which:
- Breaks correctness for passwords containing quotes or special characters
- Bypasses parameterized SQL practice established in this codebase
Based on learnings, SQL queries should use parameterized queries with placeholders and pass values as a tuple to prevent injection and ensure proper escaping.
Apply this diff to parameterize the password:
db_cursor.execute( - f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'""" + f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY %s""", + (spider_db_password,) )Note: The database/user names remain interpolated since they are validated by
_validate_name()which restricts to alphanumeric, underscore, and hyphen characters.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (10)
components/clp-package-utils/clp_package_utils/controller.py(7 hunks)components/clp-package-utils/clp_package_utils/general.py(3 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(8 hunks)components/clp-py-utils/clp_py_utils/create-db-tables.py(2 hunks)components/clp-py-utils/clp_py_utils/initialize-spider-db.py(1 hunks)components/clp-py-utils/clp_py_utils/sql_adapter.py(2 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py(1 hunks)components/package-template/src/etc/clp-config.yaml(1 hunks)components/package-template/src/etc/credentials.template.yaml(1 hunks)tools/deployment/package/docker-compose-all.yaml(5 hunks)
🧰 Additional context used
🧠 Learnings (23)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-11-28T15:12:53.506Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1655
File: components/clp-package-utils/clp_package_utils/controller.py:183-189
Timestamp: 2025-11-28T15:12:53.506Z
Learning: In the y-scope/clp project, root database credentials (root username and password) are required configuration items for all deployments. The `credentials[ClpDbUserType.ROOT]` entry is guaranteed to exist and can be accessed directly without optional handling.
Applied to files:
components/clp-py-utils/clp_py_utils/sql_adapter.py
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.
Applied to files:
components/clp-py-utils/clp_py_utils/sql_adapter.pycomponents/clp-package-utils/clp_package_utils/controller.pytools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-27T07:07:37.901Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.pycomponents/clp-package-utils/clp_package_utils/controller.pycomponents/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-04-17T16:55:23.796Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.796Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-01-23T17:08:55.566Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.pytools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.pycomponents/package-template/src/etc/clp-config.yaml
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-08-17T16:10:38.722Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Applied to files:
components/clp-package-utils/clp_package_utils/controller.pycomponents/clp-py-utils/clp_py_utils/clp_config.pycomponents/package-template/src/etc/clp-config.yaml
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Applied to files:
components/clp-package-utils/clp_package_utils/controller.pycomponents/clp-py-utils/clp_py_utils/clp_config.pytools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-22T21:02:31.113Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-25T06:29:59.610Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-22T21:14:12.225Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.
Applied to files:
tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-10-20T21:05:30.417Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1413
File: tools/docker-images/clp-package/Dockerfile:22-24
Timestamp: 2025-10-20T21:05:30.417Z
Learning: In the clp repository's Dockerfiles, ENV directives should be consolidated into multi-line ENV statements when possible to reduce image layers. ENV statements should only be split into separate commands when consolidation is not possible due to dependencies (e.g., when later variables must reference earlier ones that need to be set first, or when PATH must be modified sequentially).
Applied to files:
tools/deployment/package/docker-compose-all.yaml
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.
Applied to files:
tools/deployment/package/docker-compose-all.yaml
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/create-db-tables.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
ClpConfig(774-1050)load_credentials_from_env(326-345)load_credentials_from_env(372-377)load_credentials_from_env(459-463)load_credentials_from_env(524-529)components/clp-py-utils/clp_py_utils/core.py (1)
read_yaml_config_file(58-64)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: rust-checks
- GitHub Check: uv-checks (ubuntu-24.04)
🔇 Additional comments (21)
components/package-template/src/etc/credentials.template.yaml (1)
16-20: Template addition for Spider DB credentials looks good.The new commented section follows the established pattern and remains inactive (all entries commented). Ensure the example username and password are consistent with the setup scripts and initialization tooling mentioned in the PR objectives.
Please verify that the example credentials "spider-user" and "pass" match what is used in the Spider DB setup scripts added elsewhere in this PR.
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
618-623: LGTM! Consistent serialization in error path.The addition of
.model_dump()ensures that both the error path and success path (line 676) return a serialized dictionary, which aligns with the PR's objective of switching task payloads to raw bytes.tools/deployment/package/docker-compose-all.yaml (4)
42-45: ✅ Volume anchor for clp-tmp is well-structured.The new volume definition follows the established pattern and eliminates repetition across services.
96-97: ✅ db-table-creator environment variables correctly include Spider credentials.The SPIDER_DB_PASS and SPIDER_DB_USER variables enable proper Spider database initialization during setup.
227-248: ✅ spider-scheduler service is correctly configured with proper dependencies and port exposure.The service properly depends on db-table-creator and exposes port 6000. The JDBC storage URL is correctly constructed with required credentials.
299-299: I apologize, but I'm unable to complete the verification of this review comment due to persistent access issues in the sandbox environment. Despite multiple attempts using different approaches (direct shell access, GitHub CLI, web search), I cannot:
- Access the docker-compose-all.yaml file to verify the claimed duplicate hostname at lines 284 and 319
- Examine the start-spider-worker.py script to understand how scheduler connection parameters are handled (via CLI flags, environment variables, or defaults)
- Validate the specific claims about missing
--hostand--portarguments- Confirm the proposed diff is accurate and complete
Without access to the actual codebase, I cannot determine whether:
- The duplicate hostname issue actually exists
- The spider-compression-worker service truly lacks scheduler connection parameters
- The suggested fix is complete and addresses the root cause
- There are any undocumented defaults or environment variable mechanisms that make the explicit flags unnecessary
Recommendation: Please verify this review comment manually by:
- Checking lines 284 and 319 of
tools/deployment/package/docker-compose-all.yamlfor the hostname values- Examining
start-spider-worker.pyto understand its parameter expectations- Comparing the spider-scheduler and spider-compression-worker service definitions for consistency
components/package-template/src/etc/clp-config.yaml (1)
29-38: LGTM! Spider configuration scaffolding is complete.The commented configuration blocks for
compression_scheduler.type,spider_db, andspider_schedulercorrectly match the corresponding model definitions inclp_config.py. The port field forspider_schedulernow correctly defaults to 6000, aligning withSpiderScheduler.DEFAULT_PORT.components/clp-py-utils/clp_py_utils/sql_adapter.py (1)
73-73: LGTM! Return type annotations improved for better type safety.The updated return type annotations correctly reflect the actual return types from
mysql.connector.connect()andmariadb.connect(). This aligns with the MariaDB driver version bump (>=1.1.14) in this PR.Also applies to: 154-154
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (2)
245-249: Good: Root connection approach is correct.The script correctly uses
SqlAdapter(clp_config.database)withClpDbUserType.ROOTto obtain root privileges for database/user creation. This is the expected pattern since Spider DB runs on the same MariaDB instance as the CLP database, sharing root credentials.
26-220: LGTM! Spider database schema is comprehensive.The DDL statements define a complete schema for Spider task orchestration including:
- Core tables:
drivers,schedulers,jobs,tasks- Relationship tables:
input_tasks,output_tasks,task_dependencies- Data storage:
data,task_inputs,task_outputs- Operational tables:
task_instances,scheduler_leases,data_locality- KV storage:
client_kv_data,task_kv_dataForeign key constraints with appropriate
ON DELETE CASCADEbehaviour ensure referential integrity.components/clp-py-utils/clp_py_utils/create-db-tables.py (1)
53-68: LGTM! Conditional Spider DB initialization is correctly implemented.The logic properly:
- Loads CLP configuration and credentials
- Checks if
spider_dbis configured before attempting Spider DB initialization- Logs an informative message and exits successfully when Spider DB is not configured
- Only runs
initialize-spider-dbwhen Spider DB configuration is presentThe broad
Exceptioncatch at line 59 is consistent with existing patterns in this file and provides adequate error reporting for configuration loading failures.components/clp-package-utils/clp_package_utils/general.py (1)
463-463: LGTM! Spider DB credentials generation follows established pattern.The Spider DB entry correctly mirrors the existing credential generation pattern with a distinct username (
spider-user) and appropriately sized password (8 characters, consistent with the database user).components/clp-package-utils/clp_package_utils/controller.py (4)
330-350: LGTM! Spider DB environment setup is correctly implemented.The
_set_up_env_for_spider_dbmethod properly:
- Uses the
SPIDER_DB_COMPONENT_NAMEconstant for logging- Sets
SPIDER_DB_NAMEfrom configuration- Uses the imported environment variable name constants for credentials
The credentials access pattern (
spider_db.credentials.username/password) correctly matches theSpiderDbmodel'scredentials: DbUserCredentialsfield.
352-369: LGTM! Spider scheduler environment setup follows established patterns.The
_set_up_env_for_spider_schedulermethod correctly:
- Uses
SPIDER_SCHEDULER_COMPONENT_NAMEfor logging- Resolves the hostname to IP using
_get_ip_from_hostname- Sets both
SPIDER_SCHEDULER_HOSTandSPIDER_SCHEDULER_PORT
916-922: LGTM! Conditional environment setup prevents errors when components are unconfigured.The conditional checks properly guard against
AttributeErrorwhenqueue,redis, or Spider components are not configured:
- Queue/Redis are now
Optionaland only set up when present- Spider components are only set up when
OrchestrationType.spideris active
1010-1016: LGTM! Docker Compose file selection correctly handles Spider orchestration.The logic properly selects the appropriate Docker Compose file based on deployment type and orchestration:
- BASE + Spider →
docker-compose-spider-base.yaml- BASE + Celery →
docker-compose.base.yaml- FULL + Spider →
docker-compose-spider.yaml- FULL + Celery →
docker-compose.yamlcomponents/clp-py-utils/clp_py_utils/clp_config.py (5)
352-394: SpiderDb implementation is well-structured with proper credential separation.The
SpiderDbclass correctly:
- Extends
Databasewith Spider-specific defaults- Enforces MariaDB via
validate_typewith proper return value- Uses separate
DbUserCredentialsinstead of the dict-based credentials- Implements dedicated
load_credentials_from_env/filemethods reading Spider-specific env vars and config keys
396-405: LGTM! SpiderScheduler follows established patterns.The
SpiderSchedulerclass correctly implements:
DEFAULT_PORTClassVar for consistencyDomainStrtype for hosttransform_for_container()method for container orchestration
997-1021: LGTM! Orchestration validators enforce correct configuration combinations.The validators properly ensure:
- Spider orchestration requires both
spider_dbandspider_scheduler, plus MariaDB for the main database- Celery orchestration requires both
queueandredisThis prevents runtime errors from misconfigured orchestration setups.
978-983: LGTM!dump_to_primitive_dictcorrectly handles optional fields.The updated serialization:
- Includes
spider_dbin the credential-redacting custom fields- Properly handles
Nonevalues for optional components (queue,redis,spider_db)This prevents credential leakage while supporting the new optional field patterns.
1038-1045: LGTM! Container transformation properly guards optional components.The conditional calls to
transform_for_container()preventAttributeErrorwhen optional components (queue,redis,spider_db,spider_scheduler) areNone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
456-466: Handle Spider DB credential / task-manager initialisation errors gracefullyThe Spider branch currently calls
clp_config.spider_db.load_credentials_from_env()andSpiderTaskManager(...)without protection. If Spider DB env vars are missing orget_container_url()fails, the scheduler will crash with an uncaught exception instead of logging and exiting cleanly, unlike the earlier DB-credential path.Wrap the Spider initialisation in a
try/exceptand return-1on failure, e.g.:- task_manager: CeleryTaskManager | SpiderTaskManager - if clp_config.compression_scheduler.type == OrchestrationType.celery: - task_manager = CeleryTaskManager() - elif clp_config.compression_scheduler.type == OrchestrationType.spider: - clp_config.spider_db.load_credentials_from_env() - task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url()) - else: - logger.error( - f"Unsupported compression scheduler type: {clp_config.compression_scheduler.type}" - ) - return -1 + task_manager: CeleryTaskManager | SpiderTaskManager + if clp_config.compression_scheduler.type == OrchestrationType.celery: + task_manager = CeleryTaskManager() + elif clp_config.compression_scheduler.type == OrchestrationType.spider: + try: + clp_config.spider_db.load_credentials_from_env() + task_manager = SpiderTaskManager(clp_config.spider_db.get_container_url()) + except Exception: + logger.exception("Failed to initialise Spider task manager.") + return -1 + else: + logger.error( + "Unsupported compression scheduler type: %s", + clp_config.compression_scheduler.type, + ) + return -1This keeps Spider initialisation behaviour aligned with the existing configuration/DB error handling and avoids uncaught
ValueErrors on missing Spider env vars.components/clp-py-utils/clp_py_utils/initialize-spider-db.py (1)
247-279: Parameterise Spider DB password inCREATE USERDDL
spider_db_passwordis interpolated directly into theCREATE USERstatement:db_cursor.execute( f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'""" )Now that password validation allows arbitrary characters, this is brittle (breaks on quotes) and bypasses the project’s practice of using parameterised SQL for values. Identifiers here are already validated via
_validate_name, so the remaining risk/value is the password.Use a
%splaceholder for the password and let the driver escape it:- db_cursor.execute(f"""CREATE DATABASE IF NOT EXISTS `{spider_db_name}`""") + db_cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{spider_db_name}`") if spider_db_password is None: logger.error("Password must be set for Spider database user.") return -1 db_cursor.execute( - f"""CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY '{spider_db_password}'""" + f"CREATE USER IF NOT EXISTS '{spider_db_user}'@'%' IDENTIFIED BY %s", + (spider_db_password,), )This keeps identifier handling as-is while ensuring passwords with special characters work correctly and follow the repo’s parameterisation guidelines. Based on learnings.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
components/clp-package-utils/clp_package_utils/controller.py(7 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(8 hunks)components/clp-py-utils/clp_py_utils/initialize-spider-db.py(1 hunks)components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py(3 hunks)components/package-template/src/etc/clp-config.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (20)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-08T06:59:42.436Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR #1152 discussion.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.pycomponents/package-template/src/etc/clp-config.yamlcomponents/clp-py-utils/clp_py_utils/clp_config.pycomponents/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.pycomponents/package-template/src/etc/clp-config.yaml
📚 Learning: 2025-09-19T18:28:26.747Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 1169
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:463-469
Timestamp: 2025-09-19T18:28:26.747Z
Learning: In the compression scheduler (components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py), the SIGTERM handling uses a coarse-grained approach that only checks received_sigterm before calling search_and_schedule_new_tasks. The maintainers consider the race condition where SIGTERM arrives mid-execution to be benign, as scheduling a few extra tasks during shutdown is acceptable and will be handled by cleanup mechanisms.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
📚 Learning: 2025-10-27T07:07:37.901Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.pycomponents/clp-py-utils/clp_py_utils/clp_config.pycomponents/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-04-17T16:55:23.796Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 831
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:0-0
Timestamp: 2025-04-17T16:55:23.796Z
Learning: In the CLP project, SQL queries should use parameterized queries with placeholders (%s) and pass values as a tuple to `db_cursor.execute()` to prevent SQL injection, rather than directly interpolating values into the query string.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-01-23T17:08:55.566Z
Learnt from: gibber9809
Repo: y-scope/clp PR: 672
File: components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:30-38
Timestamp: 2025-01-23T17:08:55.566Z
Learning: SQL identifiers (table names, column names) in the codebase are currently from trusted sources and directly interpolated into queries. However, the team prefers to implement sanitization for robustness, even for trusted inputs.
Applied to files:
components/clp-py-utils/clp_py_utils/initialize-spider-db.py
📚 Learning: 2025-11-10T05:19:56.600Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1575
File: components/clp-py-utils/clp_py_utils/clp_config.py:602-607
Timestamp: 2025-11-10T05:19:56.600Z
Learning: In the y-scope/clp repository, the `ApiServer` class in `components/clp-py-utils/clp_py_utils/clp_config.py` does not need a `transform_for_container()` method because no other containerized service depends on the API server - it's only accessed from the host, so no docker-network communication is expected.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.pycomponents/clp-package-utils/clp_package_utils/controller.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-22T21:02:31.113Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-08-25T06:29:59.610Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-22T21:14:12.225Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.
Applied to files:
components/clp-package-utils/clp_package_utils/controller.py
🧬 Code graph analysis (2)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
OrchestrationType(130-132)load_credentials_from_env(327-346)load_credentials_from_env(382-387)load_credentials_from_env(472-476)load_credentials_from_env(537-542)components/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.py (1)
SpiderTaskManager(14-48)components/job-orchestration/job_orchestration/scheduler/compress/task_manager/celery_task_manager.py (1)
CeleryTaskManager(12-27)
components/clp-py-utils/clp_py_utils/initialize-spider-db.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (5)
ClpDbUserType(189-193)load_credentials_from_env(327-346)load_credentials_from_env(382-387)load_credentials_from_env(472-476)load_credentials_from_env(537-542)components/clp-py-utils/clp_py_utils/core.py (1)
read_yaml_config_file(58-64)components/clp-py-utils/clp_py_utils/sql_adapter.py (1)
create_connection(69-85)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: rust-checks
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: lint-check (macos-15)
- GitHub Check: build-macos (macos-15, true)
🔇 Additional comments (6)
components/package-template/src/etc/clp-config.yaml (2)
32-39: Verify spider_db port configuration to avoid conflicts.Both the primary
database(line 23) and the newspider_db(line 34) are configured with port3306by default. This will cause a conflict if both services are bundled in the same deployment. Confirm whether spider_db should use a different default port or whether the configuration logic ensures these services run on separate hosts.
30-30: Approve compression_scheduler type addition.The addition of
type: "celery"to thecompression_schedulersection aligns with the PR objective of integrating Spider compression orchestration. The configuration template now properly reflects the Celery-based scheduler setup.components/clp-package-utils/clp_package_utils/controller.py (2)
13-50: Env wiring for queue/redis and Spider components looks consistent
- Using
CLP_QUEUE_*andCLP_REDIS_PASS_ENV_VAR_NAMEto expose queue/redis credentials keeps env names centralised inclp_configand matches the correspondingload_credentials_from_envimplementations.- New
_set_up_env_for_spider_db()/_set_up_env_for_spider_scheduler()helpers follow the same pattern as other components and are only invoked whencompression_scheduler.type == OrchestrationType.spider, which pairs well withClpConfig.validate_spider_config.- Guarding
_set_up_env_for_queue()/_set_up_env_for_redis()behindqueue/redisbeing non-None makes the Optional orchestration config safe.No functional issues spotted in these changes.
Also applies to: 248-375, 915-928
1015-1022: Docker Compose file selection for Spider orchestration is saneDeriving
compression_scheduler_typeonce and branching on both deployment type and orchestration type:
- Uses
docker-compose-spider-base.yaml/docker-compose-spider.yamlwhen Spider orchestration is enabled.- Falls back to the existing base/full compose files for Celery.
The control flow is straightforward and avoids ambiguity between base/full + spider/Celery combinations.
components/clp-py-utils/clp_py_utils/clp_config.py (2)
130-136: Orchestration types, Spider config, and container transforms are wired coherently
OrchestrationTypeandCompressionScheduler.typegive a clear, enum-backed switch between Celery and Spider orchestration.SpiderDb:
- Enforces MariaDB via a field validator.
- Uses dedicated env var names (
SPIDER_DB_USER_ENV_VAR_NAME,SPIDER_DB_PASS_ENV_VAR_NAME) and credentials loading helpers.- Is redacted correctly in
ClpConfig.dump_to_primitive_dict(), matching database/queue/redis behaviour.SpiderScheduler:
- Follows existing patterns (
DEFAULT_PORT,DomainStrhost,transform_for_container()usingSPIDER_SCHEDULER_COMPONENT_NAME).ClpConfig:
- Makes
queueandredisoptional whilevalidate_celery_config()enforces their presence for Celery.validate_spider_config()enforcesspider_db/spider_schedulerpresence and a MariaDB main metadata DB for Spider, which matches the deployment model.transform_for_container()now guards optional queue/redis and transforms Spider components when present, so container deployments see consistent hostnames.These pieces fit together cleanly and should make the Celery/Spider split explicit and robust.
Also applies to: 353-415, 787-807, 991-1035, 1036-1058
353-381: Based on my verification attempts, I encountered limitations that prevent full confirmation of the review comment:Web search findings:
- Generic JDBC URL format confirmed:
jdbc:subprotocol:subname- Standard JDBC practice: username/password passed as separate parameters (properties dict) rather than embedded in the URL query string
- No authoritative documentation found for the specific
spider_pypackage's Driver classCodebase access:
- Repository clone failed; unable to verify actual usage patterns, tests, or deployment context for
get_container_url()- Cannot confirm whether
spider_pyactually expects credentials in the query string format used here, or if they're handled through a separate parameters dictKey insight:
The review comment's concern about URL-encoding is theoretically valid from first principles (special characters like&,?,=in raw credentials would corrupt a query string). However, the standard JDBC pattern suggests credentials should be passed separately, which would make this entire code path potentially incorrect—not just missing encoding. Without access to spider_py specifics or actual codebase context, I cannot determine:
- Whether this code is actively used
- Whether spider_py expects this exact URL format
- Whether credentials with special characters actually fail in practice
Recommendation:
The original review comment's request for verification againstspider_py.Driver's expected URL format remains appropriate and necessary before deciding on a fix approach.
| config_path = pathlib.Path(parsed_args.config) | ||
| try: | ||
| clp_config = ClpConfig.model_validate(read_yaml_config_file(config_path)) | ||
| clp_config.database.load_credentials_from_env() | ||
| clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.CLP) | ||
| clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.ROOT) | ||
| if clp_config.spider_db is None: | ||
| logger.error("Spider database configuration not found in CLP configuration.") | ||
| return -1 | ||
| clp_config.spider_db.load_credentials_from_env() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Avoid redundant CLP DB credential loads
You call clp_config.database.load_credentials_from_env() for the default user and then immediately call it again for ClpDbUserType.CLP and ClpDbUserType.ROOT. The first call is redundant:
- clp_config = ClpConfig.model_validate(read_yaml_config_file(config_path))
- clp_config.database.load_credentials_from_env()
- clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.CLP)
- clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.ROOT)
+ clp_config = ClpConfig.model_validate(read_yaml_config_file(config_path))
+ clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.CLP)
+ clp_config.database.load_credentials_from_env(user_type=ClpDbUserType.ROOT)Not a correctness issue, but it simplifies the intent.
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/initialize-spider-db.py around lines 228
to 237, the unparameterized call clp_config.database.load_credentials_from_env()
is redundant because you immediately call the same method for ClpDbUserType.CLP
and ClpDbUserType.ROOT; remove the first call and keep the explicit user_type
calls (or replace the three calls with a single loop over the required user
types) so credentials are loaded only once per intended user, preserving the
spider_db null check and its subsequent load_credentials_from_env() call.
| #spider_scheduler: | ||
| # host: "localhost" | ||
| # port: 6000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Consider adding logging_level field to spider_scheduler for consistency.
The compression_scheduler (line 29) and query_scheduler (line 46) both include logging_level configuration, but spider_scheduler omits it. For consistency and to allow users to easily configure logging verbosity for the Spider scheduler, consider adding a logging_level field (e.g., logging_level: "INFO").
Apply this diff to add the logging_level field:
#spider_scheduler:
# host: "localhost"
# port: 6000
+# logging_level: "INFO"🤖 Prompt for AI Agents
In components/package-template/src/etc/clp-config.yaml around lines 37 to 39,
the commented-out spider_scheduler block lacks a logging_level key for
consistency with compression_scheduler and query_scheduler; restore or update
the spider_scheduler block to include a logging_level field (for example
logging_level: "INFO") alongside host and port so users can configure verbosity,
ensuring indentation and quoting match the surrounding entries.
Description
This PR integrates spider compression orchestration into docker compose package by
ClpConfig. Minor cleanup for unnecessarydump_to_primitive_type.This PR also:
mariadbpython dependency to>=1.1.14.spider-pyversion to0.2.0.Spiderto latest version for fix in starting python workers.This PR is a breaking change because it updates the dependencies versions.
The doc change will be put in #1647.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Improved Behaviour
Chores
Removed
✏️ Tip: You can customize this high-level summary in your review settings.