Skip to content

Relay improvements: configurable quota, structured logging, health endpoint, metrics #661

@elkimek

Description

@elkimek

Context

I maintain getbased, a blood work dashboard that uses Evolu for cross-device sync. I self-host the relay at sync.getbased.health using the official Docker image. I've been running it in production for a few days and filed #660 about logging and silent failures after a DB wipe.

Since then, I built a standalone wrapper project (getbased-relay) that wraps @evolu/nodejs with fixes for several operational issues I hit. The project is MIT-licensed and I'm happy to contribute any of these improvements upstream. Sharing them here as structured feedback.

Issues and solutions

1. 1MB default quota is too small for real-world use

The official relay's isOwnerWithinQuota uses maxBytes = 1024 * 1024 (1MB). A single user with 3 profiles and chat data exceeds this quickly because CRDT ops accumulate over time. When the quota is exceeded, the relay returns ProtocolQuotaError, but the client caches it silently — sync just stops with no user feedback.

The 1MB default is hardcoded in apps/relay/src/index.ts with no way to change it without rebuilding the image.

Suggestion: Make the quota configurable via environment variable (e.g., QUOTA_PER_OWNER_MB, defaulting to something more generous like 10MB). Our wrapper does this via src/lib/config.ts — all settings come from env vars with sane defaults.

2. Logging is all-or-nothing

The relay has two modes: enableLogging: false (zero visibility — only the startup line) or enableLogging: true (dumps raw SQL queries via createRelayLogger, extremely noisy). There's no middle ground for operations.

Connection events, subscribe/unsubscribe, errors, broadcasts — all invisible unless you enable the SQL firehose. When sync breaks, there's nothing to diagnose (as described in #660).

Suggestion: The relay logger already emits structured events (connection, close, subscribe, broadcast, storage errors, etc.) via createRelayLogger. The issue is that all of these are gated behind a single enableLogging boolean. A leveled approach would help — e.g., connection lifecycle at info, message details at debug, errors always. Our wrapper implements this with a custom Console that intercepts relay logger calls and emits structured JSON at configurable levels (LOG_LEVEL=info|debug|warn|error). The key trick: we lock console.enabled = true via Object.defineProperty so all events reach our filter, then apply levels ourselves (src/lib/logger.ts).

3. No health check endpoint

The relay's HTTP server has zero request handlers — it only handles WebSocket upgrades. Health check probes (including the Docker image's own HEALTHCHECK, which just does a TCP connect) can't distinguish between "relay is running" and "relay is healthy and accepting sync."

Our client-side checkRelayConnection() tries to open a WebSocket to /ping, which causes WS_ERR_EXPECTED_MASK errors in the relay logs because there's no handler for it.

Suggestion: A simple /health HTTP endpoint on the relay port (or a separate admin port) returning {"status":"ok","uptime":...}. Our wrapper runs a separate admin HTTP server on a configurable port with /health (unauthenticated, for uptime monitors) and /metrics (token-gated). See src/lib/admin-server.ts.

4. No usage metrics

The evolu_usage table tracks per-owner storedBytes, but there's no way to query it without direct DB access. For relay operators, basic questions are unanswerable: how many owners use the relay? How much storage does each use? What's the total DB size? Is the relay approaching disk limits?

Suggestion: Expose read-only metrics via an authenticated endpoint. Our wrapper opens the relay DB in readonly mode via better-sqlite3 and serves per-owner usage, owner count, total stored bytes, and DB file size via /metrics (src/lib/metrics.ts).

5. isOwnerAllowed side-effect: rejects all non-ownerId connections

Providing isOwnerAllowed (even one that always returns true) activates parseOwnerIdFromOwnerWebSocketTransportUrl in the upgrade handler. This rejects every WebSocket connection that doesn't have a valid ownerId in the URL path — including health check probes and monitoring tools — with a 400 Bad Request. The side effect of rejecting all non-owner connections is surprising when all you want is activity tracking.

Suggestion: Allow owner tracking without requiring ownerId in the URL. Our wrapper tracks owners via subscribe events emitted by the relay logger instead of the isOwnerAllowed hook (src/lib/owner-tracker.ts).

6. No global disk quota

isOwnerWithinQuota handles per-owner limits, but nothing prevents the total relay storage from filling the disk. If a relay serves many owners, each within their individual quota, the aggregate can still exhaust disk space.

Suggestion: A global quota check alongside the per-owner check. Our wrapper checks total stored bytes against a configurable QUOTA_GLOBAL_MB in the same isOwnerWithinQuota callback (src/lib/quota.ts).

7. subscribeSyncState / getSyncState commented out

In the Evolu client, both subscribeSyncState and getSyncState are commented out with "TODO: Update it for the owner-api". This means clients can't check whether they're actually syncing. I had to add a 30-second polling safety net because subscribeQuery doesn't reliably fire for remote changes in all cases. Having sync state observability would let clients show meaningful UI (syncing/synced/error/disconnected) instead of guessing.

Not filing this as a separate issue since it's marked as a TODO — just flagging it as something that would significantly help client-side UX.

8. Docker file ownership (minor)

The official Dockerfile creates data dirs as the evolu user (UID 1001). If you mount a host volume with different ownership, the relay can't write and fails with SQLITE_READONLY — but there's no error message, sync just silently stops.

What could be upstreamed

Good candidates for upstream PRs:

  • Configurable quota via env vars (trivial change to apps/relay/src/index.ts)
  • Leveled logging (or at minimum, a LOG_LEVEL env var that filters existing relay logger events)
  • /health endpoint on the relay server
  • Re-enabling subscribeSyncState / getSyncState

Pattern worth documenting:

  • Separate admin server with metrics endpoint
  • Owner activity tracking via sidecar file
  • Global disk quota
  • DB startup integrity checks

Reference implementation

All of the above is implemented in getbased-relay (~800 lines of TypeScript across 8 modules). It's a standalone TypeScript project that npm installs @evolu/nodejs and wraps createNodeJsRelay — no fork of the Evolu monorepo needed.

Happy to open PRs for any of the upstream-friendly items if that would be useful. Thanks for building Evolu — the local-first CRDT layer is excellent, these are just operational rough edges from running it in production.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions