Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .yamllint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ rules:

line-length:
max: 100
allow-non-breakable-words: false
allow-non-breakable-words: true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has to be set as true; otherwise, the clp-config.yaml with the long URL inside would fail the rule.

@coderabbitai create an issue to track this lint rule change, and i will submit the lint rule change in a separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue to track the yamllint rule change
gh issue create \
  --repo y-scope/clp \
  --title "Update yamllint configuration to allow non-breakable words in long lines" \
  --body "## Context

This issue tracks the yamllint configuration change to set \`allow-non-breakable-words: true\` in the \`line-length\` rule. This change is necessary because \`clp-config.yaml\` contains long URLs that would fail the yamllint check with the current setting.

## Background

While working on PR #1681 (support for external third-party services configuration), it was identified that the yamllint \`allow-non-breakable-words\` setting needs to be changed from \`false\` to \`true\` to accommodate long URLs in configuration files.

## Related

- PR: https://github.com/y-scope/clp/pull/1681
- Comment: https://github.com/y-scope/clp/pull/1681#discussion_r2569846542
- Requested by: @junhaoliao

## Proposed Change

In \`.yamllint.yaml\`:
\`\`\`yaml
line-length:
  max: 100
  allow-non-breakable-words: true  # Changed from false
\`\`\`" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1683 to track this yamllint configuration change. The issue includes the context about why this change is needed (to accommodate long URLs in clp-config.yaml) and references this PR and comment as backlinks. The issue has been assigned to you for the follow-up PR.


🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.


octal-values:
forbid-implicit-octal: true
Expand Down
12 changes: 8 additions & 4 deletions components/clp-py-utils/clp_py_utils/clp_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -867,10 +867,14 @@ def transform_for_container(self):
self.archive_output.storage.transform_for_container()
self.stream_output.storage.transform_for_container()

self.database.transform_for_container()
self.queue.transform_for_container()
self.redis.transform_for_container()
self.results_cache.transform_for_container()
if BundledService.DATABASE in self.bundled:
self.database.transform_for_container()
if BundledService.QUEUE in self.bundled:
self.queue.transform_for_container()
if BundledService.REDIS in self.bundled:
self.redis.transform_for_container()
if BundledService.RESULTS_CACHE in self.bundled:
self.results_cache.transform_for_container()
self.query_scheduler.transform_for_container()
self.reducer.transform_for_container()
if self.package.query_engine == QueryEngine.PRESTO and self.presto is not None:
Expand Down
3 changes: 3 additions & 0 deletions components/package-template/src/etc/clp-config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# yaml-language-server: $schema=../usr/share/config-schemas/clp-config.schema.json
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#

#package:
# storage_engine: "clp-s"
# query_engine: "clp-s"
Expand All @@ -15,6 +16,8 @@
## File containing credentials for services
#credentials_file_path: "etc/credentials.yaml"
#
## Remove any bundled services below if you wish to use your own. For more information, see
## https://docs.yscope.com/clp/main/user-docs/guides-external-database.html#configuring-clp-to-use-external-databases
#bundled: ["database", "queue", "redis", "results_cache"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep consistent with the doc

Suggested change
#bundled: ["database", "queue", "redis", "results_cache"]
#bundled:
# - "database"
# - "queue"
# - "redis"
# - "results_cache"

#
#database:
Expand Down
59 changes: 39 additions & 20 deletions docs/src/user-docs/guides-external-database.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to run lint on this file
npx markdownlint-cli2 --config <root_dir>/tools/yscope-dev-utils/exports/lint-configs/.markdownlint-cli2.yaml guides-external-database.md --fix

Original file line number Diff line number Diff line change
Expand Up @@ -173,28 +173,47 @@ When using AWS DocumentDB or MongoDB Atlas:

## Configuring CLP to use external databases

After setting up your external databases, configure CLP to use them by editing `etc/clp-config.yaml`:

```yaml
database:
host: "<mariadb-hostname-or-ip>"
port: 3306
name: "clp-db"
# Credentials will be set in etc/credentials.yaml

results_cache:
host: "<mongodb-hostname-or-ip>"
port: 27017
name: "clp-query-results"
```
After setting up your external databases, configure CLP to use them:

Set the credentials in `etc/credentials.yaml`:
1. Edit `etc/clp-config.yaml` to specify which services are bundled (managed by the `clp-package`
Docker Compose project):

```yaml
database:
username: "clp-user"
password: "<your-mariadb-password>"
```
```yaml
# Remove "database" and "results_cache" from this list to use external instances
bundled:
# - "database"
- "queue"
- "redis"
# - "results_cache"
```

2. Configure the connection details for your external databases in `etc/clp-config.yaml`:

```yaml
database:
host: "<mariadb-hostname-or-ip>"
port: <mariadb-port>

results_cache:
host: "<mongodb-hostname-or-ip>"
port: <mongodb-port>
```

:::{important}
If your external service is running on the host machine, **do not use `127.0.0.1`** as the host.
The CLP components run in containers, so `127.0.0.1` would refer to the container itself.
Instead, use:
- `host.docker.internal` to refer to the host machine.
- The host's actual network IP address (e.g., `192.168.1.10`)
:::
Comment on lines +202 to +208
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Host networking guidance needs platform-specific clarification.

The note about 127.0.0.1 and host.docker.internal is helpful, but host.docker.internal is primarily available on Docker Desktop (Mac/Windows). For Linux Docker hosts without Docker Desktop, users need different guidance—typically using the actual host IP or DNS hostname. Consider adding a platform-specific note.

Suggested addition (after line 207):

   - `host.docker.internal` to refer to the host machine.
   - The host's actual network IP address (e.g., `192.168.1.10`)
+
+  ::::{note}
+  **For Linux Docker hosts**: `host.docker.internal` is not available. Use the actual host IP address or hostname instead.
+  ::::

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor

@hoophalab hoophalab Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure whether this coderabbit comment is valid or not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

host.docker.internal support Linux Docker 2024 2025

💡 Result:

Short answer: Yes — but with caveats.

  • Docker Desktop (macOS, Windows — and Docker Desktop for Linux) provides host.docker.internal by default so containers can resolve the host localhost. [1][2][6]
  • On plain Docker Engine on Linux you must create the mapping yourself (common options):
    • docker run --add-host=host.docker.internal:host-gateway (or in Compose: extra_hosts: - "host.docker.internal:host-gateway"). The dockerd daemon exposes the special host-gateway value and can be configured with host-gateway-ip. [1][3][5]
    • Or map the host gateway/docker0 IP directly (e.g. 172.17.0.1) in /etc/hosts or via extra_hosts. [4][2]

References

  • Docker docs — dockerd / host-gateway and --add-host examples. [1]
  • Docker docs — host.docker.internal behavior / Networking how‑tos. [2]
  • Docker Engine 20.10 release notes (adds Linux support hooks). [3]
  • Community notes / common workarounds (docker0 / extra_hosts). [4]

If you want, I can show the exact docker run or docker‑compose lines to use for your setup.


@hoophalab The comment is valid. On Docker Desktop (Mac/Windows/Docker Desktop for Linux), host.docker.internal is available by default, but on plain Docker Engine on Linux, you must create the mapping yourself.

For Linux Docker Engine users, they need to use docker run --add-host=host.docker.internal:host-gateway or configure it in Docker Compose with extra_hosts: - "host.docker.internal:host-gateway".

Suggested documentation improvement:

Update the important block in the documentation (around line 203-207) to include platform-specific guidance:

   :::{important}
   If your external service is running on the host machine, **do not use `127.0.0.1`** as the host.
   The CLP components run in containers, so `127.0.0.1` would refer to the container itself.
   Instead, use:
   - `host.docker.internal` to refer to the host machine (available by default on Docker Desktop)
   - The host's actual network IP address (e.g., `192.168.1.10`)
   
   **For Linux Docker Engine users** (not Docker Desktop): `host.docker.internal` requires manual 
   configuration. Add the following to your `docker-compose-all.yaml` services that need to 
   connect to the host:
   ```yaml
   extra_hosts:
     - "host.docker.internal:host-gateway"

Alternatively, use the host's actual IP address instead.
:::


This will help Linux users avoid confusion and provide them with a clear path forward.

---

<details>
<summary>🧠 Learnings used</summary>

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).


</details>



<!-- This is an auto-generated reply by CodeRabbit -->

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this requires more thoughts / investigations. Let's put this review on hold

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!


3. Set the credentials in `etc/credentials.yaml`:

```yaml
database:
username: "clp-user"
password: "<your-mariadb-password>"
```

:::{note}
When using external databases in a multi-host deployment, you do **not** need to start the
Expand Down
43 changes: 13 additions & 30 deletions docs/src/user-docs/guides-multi-host.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,7 @@ In a multi-host cluster:
To configure CLP for multi-host deployment, you'll need to:

1. [configure and run CLP's environment setup scripts](#clp-environment-setup).
2. [update CLP's *generated* configuration to support a multi-host deployment](
#updating-clps-generated-configuration).
3. [distribute and configure the CLP package on all hosts in your cluster](
2. [distribute and configure the CLP package on all hosts in your cluster](
#distributing-the-set-up-package).

### CLP environment setup
Expand All @@ -54,6 +52,18 @@ To configure CLP for multi-host deployment, you'll need to:
3. Edit CLP's configuration file:

* Open `etc/clp-config.yaml`.
* Configure which services should be bundled (managed by the `clp-package` Docker Compose
project) vs. external.

```yaml
bundled:
# Remove services you want to run on specific hosts or use external instances
- database # Remove if running on a dedicated host or using external MySQL-compatible DB
- queue # Remove if running on a dedicated host or using external RabbitMQ
- redis # Remove if running on a dedicated host or using external Redis
- results_cache # Remove if running on a dedicated host or using external MongoDB
```

* For each service, set the `host` and `port` fields to the actual hostname/IP and port where you
plan to run the specific service.
* When using local filesystem storage (i.e., not S3), set `logs_input.storage.directory`,
Expand All @@ -74,33 +84,6 @@ To configure CLP for multi-host deployment, you'll need to:
* Create `var/log/.clp-config.yaml` (the container-specific configuration file)
* Create `var/www/webui/server/dist/settings.json` (the `webui` server's configuration file)

### Updating CLP's generated configuration

The last step in the previous section (`sbin/start-clp.sh --setup-only`) will generate any necessary
configuration files, but they're unsuitable for use across multiple hosts (they're designed for use
on a single host).

:::{note}
As mentioned at the beginning of this guide, this setup will be made simpler in a future release.
:::

To update the generated configuration files for use across multiple hosts:

1. Edit `var/log/.clp-config.yaml`:

* Update all `host` fields to use the actual hostname or IP address where each service will run
(matching what you configured in `etc/clp-config.yaml`).
* Similarly, update any `port` fields.
* For example, if your database runs on `192.168.1.10:3306`, ensure `database.host` is set to
`192.168.1.10` and `database.port` is `3306`.

2. Edit `var/www/webui/server/dist/settings.json`:

* Update `SqlDbHost` to the actual hostname or IP address of your database service.
* Update `SqlDbPort` if you changed the database port.
* Update `MongoDbHost` to the actual hostname or IP address of your results cache service.
* Update `MongoDbPort` if you changed the results cache port.

### Distributing the set-up package

With the package set up, we can now distribute it to all hosts in the cluster:
Expand Down
6 changes: 4 additions & 2 deletions docs/src/user-docs/quick-start/clp-json.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,10 @@ sbin/start-clp.sh
```

:::{tip}
To validate configuration and prepare directories without launching services, add the
`--setup-only` flag (e.g., `sbin/start-clp.sh --setup-only`).
To validate configuration and prepare directories without launching services, add the `--setup-only`
flag (e.g., `sbin/start-clp.sh --setup-only`). To use external databases or other third-party
services instead of bundled services, see the
[external database guide](../guides-external-database.md).
:::

:::{note}
Expand Down
6 changes: 4 additions & 2 deletions docs/src/user-docs/quick-start/clp-text.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,10 @@ sbin/start-clp.sh
```

:::{tip}
To validate configuration and prepare directories without launching services, add the
`--setup-only` flag (e.g., `sbin/start-clp.sh --setup-only`).
To validate configuration and prepare directories without launching services, add the `--setup-only`
flag (e.g., `sbin/start-clp.sh --setup-only`). To use external databases or other third-party
services instead of bundled services, see the
[external database guide](../guides-external-database.md).
:::

:::{note}
Expand Down
27 changes: 18 additions & 9 deletions tools/deployment/package/docker-compose-all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,8 @@ services:
"python3",
"-u",
"-m", "clp_py_utils.initialize-results-cache",
"--uri", "mongodb://results_cache:27017/${CLP_RESULTS_CACHE_DB_NAME:-clp-query-results}",
"--uri", "mongodb://${CLP_RESULTS_CACHE_HOST:-results-cache}:${CLP_RESULTS_CACHE_PORT:-27017}\
/${CLP_RESULTS_CACHE_DB_NAME:-clp-query-results}",
"--stream-collection", "${CLP_RESULTS_CACHE_STREAM_COLLECTION_NAME:-stream-files}",
]

Expand All @@ -222,13 +223,15 @@ services:
stop_grace_period: "300s"
environment:
BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
:${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
:${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-queue}\
:${CLP_QUEUE_PORT:-5672}"
CLP_DB_PASS: "${CLP_DB_PASS:?Please set a value.}"
CLP_DB_USER: "${CLP_DB_USER:?Please set a value.}"
CLP_LOGGING_LEVEL: "${CLP_COMPRESSION_SCHEDULER_LOGGING_LEVEL:-INFO}"
CLP_LOGS_DIR: "/var/log/compression_scheduler"
PYTHONPATH: "/opt/clp/lib/python3/site-packages"
RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
@${CLP_REDIS_HOST:-redis}:${CLP_REDIS_PORT:-6379}\
/${CLP_REDIS_BACKEND_DB_COMPRESSION:-1}"
volumes:
- *volume_clp_config_readonly
Expand All @@ -254,14 +257,16 @@ services:
hostname: "compression_worker"
environment:
BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
:${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
:${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-queue}\
:${CLP_QUEUE_PORT:-5672}"
CLP_CONFIG_PATH: "/etc/clp-config.yaml"
CLP_HOME: "/opt/clp"
CLP_LOGGING_LEVEL: "${CLP_COMPRESSION_WORKER_LOGGING_LEVEL:-INFO}"
CLP_LOGS_DIR: "/var/log/compression_worker"
CLP_WORKER_LOG_PATH: "/var/log/compression_worker/worker.log"
PYTHONPATH: "/opt/clp/lib/python3/site-packages"
RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
@${CLP_REDIS_HOST:-redis}:${CLP_REDIS_PORT:-6379}\
/${CLP_REDIS_BACKEND_DB_COMPRESSION:-1}"
volumes:
- *volume_clp_config_readonly
Expand Down Expand Up @@ -369,13 +374,15 @@ services:
stop_grace_period: "10s"
environment:
BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
:${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
:${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-queue}\
:${CLP_QUEUE_PORT:-5672}"
CLP_DB_PASS: "${CLP_DB_PASS:?Please set a value.}"
CLP_DB_USER: "${CLP_DB_USER:?Please set a value.}"
CLP_LOGGING_LEVEL: "${CLP_QUERY_SCHEDULER_LOGGING_LEVEL:-INFO}"
CLP_LOGS_DIR: "/var/log/query_scheduler"
PYTHONPATH: "/opt/clp/lib/python3/site-packages"
RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
@${CLP_REDIS_HOST:-redis}:${CLP_REDIS_PORT:-6379}\
/${CLP_REDIS_BACKEND_DB_QUERY:-0}"
volumes:
- *volume_clp_config_readonly
Expand Down Expand Up @@ -407,14 +414,16 @@ services:
hostname: "query_worker"
environment:
BROKER_URL: "amqp://${CLP_QUEUE_USER:?Please set a value.}\
:${CLP_QUEUE_PASS:?Please set a value.}@queue:5672"
:${CLP_QUEUE_PASS:?Please set a value.}@${CLP_QUEUE_HOST:-queue}\
:${CLP_QUEUE_PORT:-5672}"
CLP_CONFIG_PATH: "/etc/clp-config.yaml"
CLP_HOME: "/opt/clp"
CLP_LOGGING_LEVEL: "${CLP_QUERY_WORKER_LOGGING_LEVEL:-INFO}"
CLP_LOGS_DIR: "/var/log/query_worker"
CLP_WORKER_LOG_PATH: "/var/log/query_worker/worker.log"
PYTHONPATH: "/opt/clp/lib/python3/site-packages"
RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}@redis:6379\
RESULT_BACKEND: "redis://default:${CLP_REDIS_PASS:?Please set a value.}\
@${CLP_REDIS_HOST:-redis}:${CLP_REDIS_PORT:-6379}\
/${CLP_REDIS_BACKEND_DB_QUERY:-0}"
volumes:
- *volume_clp_config_readonly
Expand Down
Loading