Relay improvements: configurable quota, structured logging, health endpoint, metrics

## Context

I maintain [getbased](https://github.com/elkimek/get-based), a blood work dashboard that uses Evolu for cross-device sync. I self-host the relay at sync.getbased.health using the official Docker image. I've been running it in production for a few days and filed #660 about logging and silent failures after a DB wipe.

Since then, I built a standalone wrapper project ([getbased-relay](https://github.com/elkimek/getbased-relay)) that wraps `@evolu/nodejs` with fixes for several operational issues I hit. The project is MIT-licensed and I'm happy to contribute any of these improvements upstream. Sharing them here as structured feedback.

## Issues and solutions

### 1. 1MB default quota is too small for real-world use

The official relay's `isOwnerWithinQuota` uses `maxBytes = 1024 * 1024` (1MB). A single user with 3 profiles and chat data exceeds this quickly because CRDT ops accumulate over time. When the quota is exceeded, the relay returns `ProtocolQuotaError`, but the client caches it silently — sync just stops with no user feedback.

The 1MB default is hardcoded in `apps/relay/src/index.ts` with no way to change it without rebuilding the image.

**Suggestion:** Make the quota configurable via environment variable (e.g., `QUOTA_PER_OWNER_MB`, defaulting to something more generous like 10MB). Our wrapper does this via [`src/lib/config.ts`](https://github.com/elkimek/getbased-relay/blob/main/src/lib/config.ts) — all settings come from env vars with sane defaults.

### 2. Logging is all-or-nothing

The relay has two modes: `enableLogging: false` (zero visibility — only the startup line) or `enableLogging: true` (dumps raw SQL queries via `createRelayLogger`, extremely noisy). There's no middle ground for operations.

Connection events, subscribe/unsubscribe, errors, broadcasts — all invisible unless you enable the SQL firehose. When sync breaks, there's nothing to diagnose (as described in #660).

**Suggestion:** The relay logger already emits structured events (`connection`, `close`, `subscribe`, `broadcast`, `storage` errors, etc.) via `createRelayLogger`. The issue is that all of these are gated behind a single `enableLogging` boolean. A leveled approach would help — e.g., connection lifecycle at `info`, message details at `debug`, errors always. Our wrapper implements this with a custom `Console` that intercepts relay logger calls and emits structured JSON at configurable levels (`LOG_LEVEL=info|debug|warn|error`). The key trick: we lock `console.enabled = true` via `Object.defineProperty` so all events reach our filter, then apply levels ourselves ([`src/lib/logger.ts`](https://github.com/elkimek/getbased-relay/blob/main/src/lib/logger.ts)).

### 3. No health check endpoint

The relay's HTTP server has zero request handlers — it only handles WebSocket upgrades. Health check probes (including the Docker image's own `HEALTHCHECK`, which just does a TCP connect) can't distinguish between "relay is running" and "relay is healthy and accepting sync."

Our client-side `checkRelayConnection()` tries to open a WebSocket to `/ping`, which causes `WS_ERR_EXPECTED_MASK` errors in the relay logs because there's no handler for it.

**Suggestion:** A simple `/health` HTTP endpoint on the relay port (or a separate admin port) returning `{"status":"ok","uptime":...}`. Our wrapper runs a separate admin HTTP server on a configurable port with `/health` (unauthenticated, for uptime monitors) and `/metrics` (token-gated). See [`src/lib/admin-server.ts`](https://github.com/elkimek/getbased-relay/blob/main/src/lib/admin-server.ts).

### 4. No usage metrics

The `evolu_usage` table tracks per-owner `storedBytes`, but there's no way to query it without direct DB access. For relay operators, basic questions are unanswerable: how many owners use the relay? How much storage does each use? What's the total DB size? Is the relay approaching disk limits?

**Suggestion:** Expose read-only metrics via an authenticated endpoint. Our wrapper opens the relay DB in `readonly` mode via `better-sqlite3` and serves per-owner usage, owner count, total stored bytes, and DB file size via `/metrics` ([`src/lib/metrics.ts`](https://github.com/elkimek/getbased-relay/blob/main/src/lib/metrics.ts)).

### 5. `isOwnerAllowed` side-effect: rejects all non-ownerId connections

Providing `isOwnerAllowed` (even one that always returns `true`) activates `parseOwnerIdFromOwnerWebSocketTransportUrl` in the `upgrade` handler. This rejects every WebSocket connection that doesn't have a valid ownerId in the URL path — including health check probes and monitoring tools — with a `400 Bad Request`. The side effect of rejecting all non-owner connections is surprising when all you want is activity tracking.

**Suggestion:** Allow owner tracking without requiring ownerId in the URL. Our wrapper tracks owners via `subscribe` events emitted by the relay logger instead of the `isOwnerAllowed` hook ([`src/lib/owner-tracker.ts`](https://github.com/elkimek/getbased-relay/blob/main/src/lib/owner-tracker.ts)).

### 6. No global disk quota

`isOwnerWithinQuota` handles per-owner limits, but nothing prevents the total relay storage from filling the disk. If a relay serves many owners, each within their individual quota, the aggregate can still exhaust disk space.

**Suggestion:** A global quota check alongside the per-owner check. Our wrapper checks total stored bytes against a configurable `QUOTA_GLOBAL_MB` in the same `isOwnerWithinQuota` callback ([`src/lib/quota.ts`](https://github.com/elkimek/getbased-relay/blob/main/src/lib/quota.ts)).

### 7. `subscribeSyncState` / `getSyncState` commented out

In the Evolu client, both `subscribeSyncState` and `getSyncState` are commented out with `"TODO: Update it for the owner-api"`. This means clients can't check whether they're actually syncing. I had to add a 30-second polling safety net because `subscribeQuery` doesn't reliably fire for remote changes in all cases. Having sync state observability would let clients show meaningful UI (syncing/synced/error/disconnected) instead of guessing.

Not filing this as a separate issue since it's marked as a TODO — just flagging it as something that would significantly help client-side UX.

### 8. Docker file ownership (minor)

The official Dockerfile creates data dirs as the `evolu` user (UID 1001). If you mount a host volume with different ownership, the relay can't write and fails with `SQLITE_READONLY` — but there's no error message, sync just silently stops.

## What could be upstreamed

**Good candidates for upstream PRs:**
- Configurable quota via env vars (trivial change to `apps/relay/src/index.ts`)
- Leveled logging (or at minimum, a `LOG_LEVEL` env var that filters existing relay logger events)
- `/health` endpoint on the relay server
- Re-enabling `subscribeSyncState` / `getSyncState`

**Pattern worth documenting:**
- Separate admin server with metrics endpoint
- Owner activity tracking via sidecar file
- Global disk quota
- DB startup integrity checks

## Reference implementation

All of the above is implemented in [getbased-relay](https://github.com/elkimek/getbased-relay) (~800 lines of TypeScript across 8 modules). It's a standalone TypeScript project that `npm install`s `@evolu/nodejs` and wraps `createNodeJsRelay` — no fork of the Evolu monorepo needed.

Happy to open PRs for any of the upstream-friendly items if that would be useful. Thanks for building Evolu — the local-first CRDT layer is excellent, these are just operational rough edges from running it in production.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relay improvements: configurable quota, structured logging, health endpoint, metrics #661

Context

Issues and solutions

1. 1MB default quota is too small for real-world use

2. Logging is all-or-nothing

3. No health check endpoint

4. No usage metrics

5. `isOwnerAllowed` side-effect: rejects all non-ownerId connections

6. No global disk quota

7. `subscribeSyncState` / `getSyncState` commented out

8. Docker file ownership (minor)

What could be upstreamed

Reference implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Relay improvements: configurable quota, structured logging, health endpoint, metrics #661

Description

Context

Issues and solutions

1. 1MB default quota is too small for real-world use

2. Logging is all-or-nothing

3. No health check endpoint

4. No usage metrics

5. isOwnerAllowed side-effect: rejects all non-ownerId connections

6. No global disk quota

7. subscribeSyncState / getSyncState commented out

8. Docker file ownership (minor)

What could be upstreamed

Reference implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

5. `isOwnerAllowed` side-effect: rejects all non-ownerId connections

7. `subscribeSyncState` / `getSyncState` commented out