Skip to content

ozontech/runr

Repository files navigation

runr — Process Supervisor for Stateful Container Pods

Process supervisor for stateful container pods, designed for PostgreSQL, ClickHouse, OpenSearch, and other rich data systems where one container hosts a small fleet of cooperating processes. systemd-compatible .service/.timer configuration format; single static Rust binary.


Why runr exists

Running multiple processes in one container is an anti-pattern. Docker's own best-practices guide says it directly: "Each container should have only one concern." For stateless services that rule holds, and runr is not for you. Use a minimal base image, set ENTRYPOINT, ship.

There is one settled exception: stateful systems that are multi-process by design (PostgreSQL, ClickHouse, OpenSearch, MongoDB), together with the helper agents that turn them into a service: metrics exporter, connection pooler, backup agent, scheduled vacuums, log collector. runr was built at Ozon for that exact case (database pods on Kubernetes) and open-sourced for anyone with the same problem. It is not an Ozon-wide platform recommendation; it's a focused tool for the case described below.

Splitting that fleet into sidecars has known costs: per-container cgroup limits cause OOM kills with data loss (Crunchy Data on the Linux Assassin), and sidecars receive SIGKILL before flushing during graceful pod termination (Kubernetes sidecar containers docs). The Kubernetes project's own blog (April 2025) is direct about it: "While the sidecar pattern can be useful in many cases, it is generally not the preferred approach unless the use case justifies it."

In practice, production systems run cooperating processes together in one container under a real init:

  • Spilo (Zalando): Patroni + PostgreSQL bundled in one container, supervised by runit (runsvdir -P /etc/service)
  • CloudNativePG (CNCF Sandbox, donated by EDB): operator-managed PostgreSQL with an in-container supervisor process
  • Crunchy Data PGO (Crunchy Data, acquired by Snowflake in June 2025): operator-based PostgreSQL with HA and backup tooling
  • Phusion baseimage-docker (9.1k★): the general-purpose runit + /sbin/my_init image behind the original "your container needs a real init system" argument

If your workload looks like one of theirs (a database pod, a search cluster, a stateful system with helper agents), runr is built for that case. systemd-compatible .service and .timer files, cgroup v2 pools, integrated syslog server, log rotation, on-demand tasks, journalctl-style inspection, single static binary.

When NOT to use runr

  • One stateless process per container (web servers, API gateways, queue workers, single-binary CLIs), or stock PostgreSQL/ClickHouse/Elasticsearch images without an operational layer around them. The standard Kubernetes model fits; runr only adds complexity.
  • Init-only need: zombie reaping and SIGTERM forwarding for a single child. tini or docker run --init is enough.

If neither matches, and your container looks like a stateful system surrounded by helper agents, keep reading.


Quick start

# 1. Put the binary in the container image
COPY runr /usr/bin/runr

# 2. Create a service file
cat > /etc/runr/my-app.service << 'EOF'
[Service]
ExecStart=/usr/bin/my-app --config /etc/my-app/config.toml
Restart=always
RestartSec=5s

[Log]
Directory=/var/log/my-app
Sink=stdout
EOF

# 3. Run as PID 1
ENTRYPOINT ["/usr/bin/runr", "supervisor"]
# Control from inside the container
runr status                        # all services
runr log my-app -f                 # follow logs
runr restart my-app                # restart
runr daemon-reload                 # pick up new/changed .service files

Security: HTTP API exposure

runr listens on 127.0.0.1:8010 by default. The API has no authentication — anyone who can reach the port can start/stop services, run arbitrary commands via background run, and shut down the supervisor.

Inside a standard Kubernetes pod or Docker container with bridge networking, 127.0.0.1 is only reachable from within the container. The risk appears when:

  • docker run --network host — container shares the host network stack, port 8010 is reachable from other hosts
  • --http-listen-api 0.0.0.0:8010 — explicitly binds to all interfaces
  • hostNetwork: true in a Kubernetes pod spec

In these cases, any process on the host (or on the network, if no firewall) can execute commands inside the container through the API.

If you need the API accessible outside the container, restrict access at the network level (firewall rules, NetworkPolicy in Kubernetes, or a reverse proxy with authentication).


Why not the existing tools?

The space is well-explored; runr fills a specific gap.

  • supervisord: Python runtime in the image, no timers, no forking-daemon support (no PIDFile), no cgroup integration. Predates cgroup v2.
  • runit: minimal and reliable, but requires shell-script unit files, no timers, no on-demand tasks, no log inspection CLI.
  • s6 / s6-overlay: same shell-script burden, no calendar timers, no cgroup management, no forking daemons. Excellent as a general-purpose base-image init; lighter on database-specific features.
  • systemd: the right abstraction (declarative .service files, calendar timers, journalctl, restart semantics) but too heavy for a container: D-Bus, large dependency surface.

runr takes systemd's configuration format and lifecycle model and packages them for the container case: single static binary, cgroup v2 pools, integrated syslog server, log rotation, on-demand background tasks, journalctl-style inspection.


Feature comparison

Feature runit s6 / s6-overlay supervisord systemd runr
Timers / Cron - - - + +
On-Demand Tasks - - - + +
Shared Cgroups v2 - - - + +
Syslog Server + Rotation - partial (s6-socklog) - + (journald) +
Log Rotation + (svlogd) + (s6-log) - + (journald) +
Log Inspection (follow/tail) - - partial + (journalctl) +
Forking Daemons (PIDFile) - - - + +
Dynamic Reload - - partial + +
KillMode (control-group/mixed/process/none) - - - + +
PID 1 Zombie Reaping + + - + +
Service Dependencies - + (s6-rc) - + -
HTTP API - - partial (XML-RPC) - (D-Bus) +
Declarative Config (.service/.timer) - - + (INI) + +
systemctl/journalctl Compat - - - native +
Resource Overhead minimal minimal medium heavy minimal
Container Image Impact none ~1MB Python runtime systemd + deps none

Features

Forking daemons

PostgreSQL, MySQL, and most traditional databases fork on startup. The postmaster forks into background, writes its PID to a file, and the parent exits. supervisord, runit, and s6 see the parent exit and think the service crashed.

runr tracks the forked daemon via Type=forking + PIDFile. After the parent exits, it reads the PID file, verifies the daemon is alive via /proc/<pid>, and monitors it with PID reuse detection (comparing /proc/<pid>/cmdline snapshots). On daemon exit, restart policy kicks in.

For services that fork without writing PID files, runr tracks the process group (PGID). The service is alive as long as at least one process in the group exists.

Graceful shutdown

When Kubernetes sends SIGTERM to the pod, you have terminationGracePeriodSeconds to shut down cleanly. For a database that means: stop accepting connections, drain active queries, flush WAL, checkpoint, exit. 30 seconds for a quiet instance, 5 minutes under heavy write load.

Shutdown sequence:

  1. Run ExecStop if configured
  2. Send SIGTERM to process/group (respecting KillMode)
  3. Wait up to TimeoutStopSec for the process to exit
  4. If still alive, escalate to SIGKILL

KillMode controls signal scope:

  • control-group (default) — SIGTERM to the entire process group
  • mixed — SIGTERM to main process, SIGKILL to group on timeout
  • process — signal only the main process, leave children alone
  • none — skip SIGTERM, only run ExecStop; safety SIGKILL as last resort

$MAINPID, $PGID, $LAST_PID are available in ExecStop.

Timers

Backups, vacuum, statistics refresh, log cleanup, partition management. On a VM that's cron. In a container there is no cron.

# /etc/runr/backup.timer
[Timer]
OnCalendar=*-*-* 02:00:00
RandomizedDelaySec=600

Trigger types:

  • OnCalendar — calendar expressions (Mon..Fri 03:00, *-*-* *:0/5, hourly, daily)
  • OnStartupSec — fire once after runr starts
  • OnUnitInactiveSec — fire after the target service finishes (repeating, paired with OnStartupSec)
  • RandomizedDelaySec — jitter to prevent thundering herd

If the previous run is still going when the next trigger fires, it's skipped.

Shared cgroups

In the sidecar model, each process gets its own container with separate cgroup limits. If PostgreSQL needs 3.8GB of your 4GB pod during a heavy query, but the exporter's cgroup caps it at 256MB, the exporter OOM-kills even though the pod has headroom.

runr puts multiple services into a shared cgroup v2 pool. PostgreSQL uses the headroom when the exporter idles; the exporter borrows from idle capacity during metric scrapes. One cgroup instead of three, less kubelet overhead.

I/O limits work with any block device type: device-mapper, RBD (Ceph), NVMe, SCSI. Device major:minor numbers are resolved automatically.

On-demand tasks

pg_basebackup, pg_dump, REINDEX, schema migrations, data exports. With runr:

runr background run -- /usr/local/bin/migrate-db --target 42
runr background run --env PGDATABASE=analytics --kill-timeout 600 \
  -- pg_dump -Fc -f /backup/analytics.dump
runr background list
runr background log a1b2c3

Background services are identified by UUID, persist state to disk (survive runr restarts), and stop via SIGTERM → timeout → SIGKILL. systemd-run compatibility mode also works.

Log management

PostgreSQL logs to a file, pg_doorman logs to syslog, the exporter logs to stderr. Without a log collector, you need logrotate (requires cron), rsyslog (requires a daemon), and manual cleanup scripts.

Managed process logs with automatic rotation:

[Log]
Directory=/var/log/postgresql
FileSize=100M
FileCount=7
RotateEvery=24h
Sink=stdout
Prefix=[%T %s pid=%p]

Rotation by size and time, gzip compression, configurable retention. Line prefixes with dynamic placeholders: %T (timestamp), %s (service name), %p (PID), %R (restart count), %S (stdout/stderr), %d (date), %I (ISO 8601), %U (unix epoch). Services continue running when disk fills up — log writes are discarded, the process isn't killed.

Integrated syslog server for legacy applications:

# /etc/runr/syslog.conf
[server]
listen = /dev/log

[pg_doorman]
appname = pg_doorman
directory = /var/log/syslog/pg_doorman
max_size = 50M
max_count = 10

Applications that only speak syslog work without rsyslog, logrotate, or cron in the image.

Log inspection:

runr log postgresql -n 50        # last 50 lines
runr log postgresql -f           # follow (tail -f)
journalctl -u postgresql -f      # systemd compatibility mode

PID 1 behavior

When runr is PID 1, it handles what the kernel expects from init:

  • Zombie reaping. Calls waitpid(-1, WNOHANG) every 200ms. Prevents zombie accumulation from PostgreSQL backends orphaned during connection termination.
  • Signal forwarding. SIGTERM from kubelet triggers per-service shutdown sequences instead of letting the kernel SIGKILL everything after the grace period.
  • Cgroup v2 initialization. Creates /sys/fs/cgroup/runr, moves itself there, enables cpu, io, memory controllers. Services can override placement with Cgroup=.
  • Subreaper. Sets PR_SET_CHILD_SUBREAPER so orphaned grandchild processes are reparented to runr.

Dynamic reconfiguration

cat > /etc/runr/new-exporter.service << 'EOF'
[Service]
ExecStart=/usr/bin/node_exporter
Restart=always
Autostart=yes
EOF

runr daemon-reload    # new service detected and started, existing services untouched

Hot-reload applies changes at different lifecycle points:

When applied Fields
Next start ExecStart, ExecStartPre, WorkingDirectory, User, Group, Environment, Nice, LimitNOFILE, CapabilityBoundingSet, TimeoutStartSec, PIDFile
Next stop KillMode, ExecStop, TimeoutStopSec
Next reload ExecReload
Next exit/crash Restart, RestartSec
Immediately MaxMemoryRSS
Requires restart Type (different actor)

Services in Failed state with Restart=always or on-failure get restarted automatically on reload.

Systemd compatibility modes

Create symlinks to activate drop-in replacement:

ln -s /usr/bin/runr /usr/local/bin/systemctl
ln -s /usr/bin/runr /usr/local/bin/journalctl
ln -s /usr/bin/runr /usr/local/bin/systemd-run

runr detects the binary name at startup and switches CLI parsing:

  • systemctl status|start|stop|reload|kill|show|cat|daemon-reload <unit>
  • journalctl -u <unit> [-n N] [-f] [-e] (multiple -u for merged output)
  • systemd-run [--unit <name>] [--wait] [-q] -- <command>

Existing scripts and Ansible playbooks work without changes.


Real-world examples

A PostgreSQL pod, sketched:

Container (PID 1 = runr)
├── postgresql              Type=forking, PIDFile=/var/run/postgresql/postmaster.pid
├── pg_exporter             Type=simple, Restart=always
├── pg_doorman              Type=simple, Cgroup=infra
├── wal-g-backup.timer      OnCalendar=*-*-* 02:00:00
├── pg_stat_monitor.timer   OnUnitInactiveSec=5min
└── [on-demand: runr background run -- pg_basebackup ...]

PostgreSQL pod

# /etc/runr/postgresql.service
[Service]
Type=forking
User=postgres
Group=postgres
WorkingDirectory=~
ExecStartPre=-/usr/local/bin/pg-preflight-check.sh
ExecStart=/usr/bin/pg_ctl start -D /pgdata -l /dev/null
ExecStop=/usr/bin/pg_ctl stop -D /pgdata -m fast
ExecReload=/usr/bin/pg_ctl reload -D /pgdata
PIDFile=/var/run/postgresql/postmaster.pid
TimeoutStartSec=120s
TimeoutStopSec=300s
Restart=on-failure
RestartSec=10s
Cgroup=/sys/fs/cgroup/pg-pool/cgroup.procs

[Log]
Directory=/var/log/postgresql
Sink=stdout
Prefix=[%T postgresql]
FileSize=100M
RotateEvery=24h
# /etc/runr/pg-exporter.service
[Service]
User=postgres
EnvironmentFile=/etc/pg_exporter/env
ExecStart=/usr/bin/postgres_exporter --web.listen-address=:9187
Restart=always
RestartSec=5s
Cgroup=/sys/fs/cgroup/pg-pool/cgroup.procs

[Log]
Sink=stdout
Prefix=[%T pg-exporter]
# /etc/runr/backup.service
[Service]
Type=oneshot
User=postgres
ExecStart=/usr/local/bin/wal-g backup-push /pgdata
Autostart=no
Restart=no
TimeoutStartSec=7200s

[Log]
Directory=/var/log/backup
Prefix=[%T backup]

# /etc/runr/backup.timer
[Timer]
OnCalendar=*-*-* 02:00:00
RandomizedDelaySec=600
# /etc/runr/pg-pool.cgroup
[Cgroup]
Name=pg-pool
MemoryMax=8G
CpuMax=400%
IOMax=/pgdata write_bps:200M read_bps:500M

ClickHouse pod

# /etc/runr/clickhouse.service
[Service]
Type=forking
User=clickhouse
ExecStart=/usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml --daemon
ExecStop=/bin/kill -TERM $MAINPID
PIDFile=/var/run/clickhouse-server/clickhouse-server.pid
TimeoutStopSec=120s
Restart=on-failure
MaxMemoryRSS=16G
LimitNOFILE=262144

[Log]
Directory=/var/log/clickhouse-server
FileSize=200M
FileCount=10
# /etc/runr/ch-keeper.service
[Service]
Type=forking
User=clickhouse
ExecStart=/usr/bin/clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper-config.xml --daemon
PIDFile=/var/run/clickhouse-keeper/clickhouse-keeper.pid
Restart=always

[Log]
Directory=/var/log/clickhouse-keeper
# /etc/runr/cleanup.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/ch-cleanup-old-parts.sh
Autostart=no
Restart=no

# /etc/runr/cleanup.timer
[Timer]
OnStartupSec=1h
OnUnitInactiveSec=6h

Configuration reference

Service file (.service)

[Unit]
Description=PostgreSQL Database Server

[Service]
# Service type
Type=simple|forking|oneshot           # default: simple

# Execution
ExecStart=/usr/bin/postgres -D /pgdata
ExecStartPre=/usr/bin/check-disk-space.sh  # pre-start hook (prefix with - to ignore failure)
ExecStop=/usr/bin/pg_ctl stop -D /pgdata   # custom stop command
ExecReload=/usr/bin/pg_ctl reload -D /pgdata

# Identity
User=postgres
Group=postgres
WorkingDirectory=/var/lib/postgresql

# Restart policy
Restart=no|on-failure|always|halt     # default: always
RestartSec=5s                         # delay between restarts (default: 2s)

# Timeouts
TimeoutStartSec=90s                   # max startup time (default: 90s)
TimeoutStopSec=300s                   # max shutdown time (default: 90s)
TimeoutSec=90s                        # shorthand for both

# Process control
KillMode=control-group|mixed|process|none  # signal scope (default: control-group)
PIDFile=/var/run/postgresql/postmaster.pid  # for Type=forking
Autostart=yes|no                            # start on daemon-reload (default: yes)

# Resource limits
MaxMemoryRSS=2G                       # software OOM: kill if RSS exceeds limit
LimitNOFILE=65536                     # max open file descriptors
Nice=5                                # process priority (-20..19)
CapabilityBoundingSet=CAP_CHOWN CAP_KILL  # Linux capabilities to keep

# Environment
Environment=PGDATA=/pgdata
Environment=PGPORT=5432
EnvironmentFile=/etc/postgresql/env       # load from file
EnvironmentFile=-/etc/postgresql/env.local # prefix - = optional (ignore if missing)

# Cgroup
Cgroup=/sys/fs/cgroup/pg-pool/cgroup.procs  # join shared cgroup
# (alias: CgroupProcPidsFile)

[Log]
Directory=/var/log/postgresql
Sink=stdout|none                      # mirror to supervisor stdout (default: none)
FileSize=100M                         # rotate at size (default: 100M)
FileCount=7                           # keep N rotated files (default: 7)
RotateEvery=24h                       # rotate by time (default: 24h)
Prefix=[%T %s]                        # line prefix for all output
PrefixSink=[%T %s pid=%p]            # override prefix for stdout/stderr mirror
PrefixFile=[%d %T]                   # override prefix for log file

Prefix placeholders: %s (service name), %T (HH:MM:SS.mmm), %d (YYYY-MM-DD), %I (ISO 8601), %U (unix epoch.ms), %S (O=stdout, E=stderr), %p (PID), %R (restart count), %% (literal %)

Timer file (.timer)

[Timer]
OnCalendar=*-*-* 02:00:00            # calendar expression
OnStartupSec=30s                      # fire once after runr starts
OnUnitInactiveSec=5min                # fire after target service stops (requires OnStartupSec)
RandomizedDelaySec=600                # jitter to prevent thundering herd
Unit=backup                           # target service (default: timer name)
Autostart=yes|no                      # auto-start timer (default: yes)

Calendar syntax: Mon..Fri 03:00, *-*-* *:0/5 (every 5 min), *-*-1..7 18:00, hourly, daily, weekly, monthly, yearly

Rules:

  • At least one of OnCalendar or OnStartupSec required
  • OnUnitInactiveSec requires OnStartupSec, cannot be combined with OnCalendar
  • Target service for OnUnitInactiveSec should be Type=oneshot with Restart=no (recommended, not enforced)
  • Double-run prevention: if target service is still running, the trigger is skipped

Cgroup file (.cgroup)

[Cgroup]
Name=pg-pool                          # creates /sys/fs/cgroup/pg-pool
# (or Path=/sys/fs/cgroup/custom-name for explicit path)
MemoryMax=4G                          # memory.max
CpuMax=200%                           # cpu.max (200% = 2 cores)
IOMax=/pgdata read_bps:500M write_bps:200M read_iops:10000 write_iops:5000

Syslog configuration

[server]
listen = /dev/log                     # unix socket path

[default]
directory = /var/log/syslog/default
max_size = 10M
max_count = 7
max_age = 168h

[pg_doorman]
appname = pg_doorman                   # route by syslog appname
directory = /var/log/syslog/pg_doorman
max_size = 50M
max_count = 10
max_age = 720h

Restart policies

Policy Behavior
no Never restart. Service transitions to Stopped.
on-failure Restart if exit code != 0 or killed by signal (except SIGTERM).
always Restart on any exit.
halt On crash (non-zero exit or signal): halt the supervisor and all services. On clean exit (code 0): transition to Stopped without halt.

Restart=halt is for critical services where local recovery is impossible. If PostgreSQL's postmaster can't start (corrupted pg_control, missing data directory), the pod needs a fresh start on a different node.


CLI

# Service control
runr start|stop|restart|reload <name>
runr kill -s TERM|KILL|HUP|INT <name>
runr enable|disable [--now] <name>

# Status
runr status [<name>]
runr list-services [--state running,failed]
runr list-timers
runr list-units [--type service|timer]
runr is-active|is-failed|is-enabled <name>

# Logs
runr log <name> [-n 50] [-f]

# Configuration
runr daemon-reload                    # reload all unit files from disk
runr cat <name>                       # show unit file contents
runr show <name>                      # show properties (Key=Value format)

# Background on-demand tasks
runr background run [--env K=V] [--kill-timeout 120] -- <command>
runr background list|status|log|stop|remove <uuid>

# Daemon
runr info                             # version, PID, memory, CPU, uptime
runr healthz / readyz                 # probe endpoints

# Completions
runr completion bash|zsh|fish

Output control

  • --json — machine-readable output
  • --no-header — suppress table headers (for scripting)
  • --quiet — suppress all output
  • --color auto|always|never

Building

# glibc (default, requires matching glibc at runtime)
cargo build --release

# musl (fully static, runs on any Linux)
cargo build --release --target x86_64-unknown-linux-musl

musl build produces a statically linked binary with no runtime dependencies. Copy it into any Linux container image — Alpine, Debian, Ubuntu, Fedora, scratch — and it works. No glibc version matching, no shared library hunt.

Release profile: codegen-units = 1, LTO, opt-level = "s", symbols stripped, panic = "abort". Uses jemalloc as global allocator.

Container images

glibc:

FROM rust:slim-bookworm AS builder
RUN cargo build --release

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/runr /usr/bin/runr
ENTRYPOINT ["/usr/bin/runr", "supervisor"]

musl (static):

FROM rust:slim-bookworm AS builder
RUN rustup target add x86_64-unknown-linux-musl && \
    cargo build --release --target x86_64-unknown-linux-musl

FROM alpine:3.20
COPY --from=builder /app/target/x86_64-unknown-linux-musl/release/runr /usr/bin/runr
ENTRYPOINT ["/usr/bin/runr", "supervisor"]

Testing

BDD tests with Cucumber covering 47 feature files:

make cucumber                          # all BDD tests locally
make cucumber FEATURE=cli.feature      # specific feature
make cucumber TAGS=@smoke              # by tag

Tags: @linux-only (skipped on macOS), @pid1-only (Docker-only), @smoke, @critical

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors