Skip to content

feat(reconciler): persistent worker pools + bulk operations for decoupled ingestion processing#526

Open
mfiedorowicz wants to merge 23 commits into
developfrom
prs/ingestion-throughput
Open

feat(reconciler): persistent worker pools + bulk operations for decoupled ingestion processing#526
mfiedorowicz wants to merge 23 commits into
developfrom
prs/ingestion-throughput

Conversation

@mfiedorowicz
Copy link
Copy Markdown
Member

Summary

Rework the reconciler's ingestion path from per-message channel/worker dispatch into a Postgres-backed inbox model with persistent worker pools. The Redis consume loop now drains at Postgres COPY speed and no longer blocks on NetBox HTTP; plan and apply phases run in their own decoupled processors with independent concurrency.

What changes

Consume loop is now thin. handleStreamMessage decodes the batch, bulk-COPYs new ingestion_logs rows (state=QUEUED), XACK/XDELs, and returns. No worker channels, no NetBox calls, no waiting on apply.

Two new long-running processors:

  • IngestionLogProcessor — polls QUEUED rows via FOR UPDATE SKIP LOCKED, dispatches to a persistent worker pool. Calls GenerateChangeSet (plan) at its own pace. Configurable concurrency, batch size, and Redis-stream-length backpressure threshold.
  • ApplyProcessor / AutoApplyProcessor — apply phase split out from generation for independent tuning. AutoApplyProcessor uses a new combined /bulk-plan-apply HTTP call when AUTO_APPLY_CHANGESETS=true to halve NetBox round-trips per entity.

Bulk HTTP clients to the NetBox plugin: BulkPlan, BulkApply, BulkPlanApply (gated by BULK_OPERATIONS_ENABLED, default false for compatibility with older plugin versions).

Bulk DB writes: BulkCreateIngestionLogs (sqlc :copyfrom) and BulkPersistChangeSets replace per-row INSERT loops in the hot paths.

Background default-branch refresher owns the NetBox HTTP call that previously sat in the consume hot path. Hot-path callers read a cache; a goroutine refreshes on a fixed interval and retains the last-known value on transient failure, so consume keeps draining if NetBox/auth is briefly unreachable. Ops.Start(ctx) is idempotent via sync.Once.

Proto / state machine: new APPLYING state for the apply-in-flight window.

Cleanup: drops the dead GenerateChangeSetConcurrency / ApplyChangeSetConcurrency config (left over from the removed in-line worker dispatch).

Notes for review

  • 21 commits, broken up by logical step (bulk-COPY paths, persistent worker pool, apply split, bulk-plan-apply, default-branch decoupling, dead-code cleanup). Reviewing commit-by-commit may be easier than the full diff.
  • Behaviour is unchanged when BULK_OPERATIONS_ENABLED=false and AUTO_APPLY_CHANGESETS=false (the defaults).
  • New tests: ingestion_log_processor_test.go, expanded ops_test.go. Existing ingestion_processor tests trimmed to reflect the slimmer consume loop.

Test plan

  • make lint clean
  • make test passes (race detector on)
  • Manual: NetBox briefly stopped → consume loop keeps draining; warning logged once per refresh interval; resumes cleanly on NetBox restart
  • Manual: end-to-end load test against local stack, both with BULK_OPERATIONS_ENABLED=true and the legacy path

Strip handleStreamMessage to decode → CreateIngestionLogs → XACK → XDEL.
Remove channel/worker/allDone machinery that spawned unbounded goroutine
pools per message. Consume loop now returns immediately after Postgres
bulk insert, draining Redis at ~17K entities/sec (vs ~62/sec before).

Changeset generation and application will move to a separate inbox
processor that polls ingestion_logs independently.
Polls ingestion_logs for QUEUED rows using FOR UPDATE SKIP LOCKED,
dispatches to a persistent worker pool via buffered channel. Workers
call GenerateChangeSet and ApplyChangeSet at their own pace, decoupled
from the Redis consume loop. Also fixes MaxIdleConnsPerHost (was Go
default of 2) to match MaxIdleConns for HTTP connection reuse.
… processor

Adds BulkPlan HTTP client for the plugin's /bulk-plan endpoint,
replacing N per-entity /generate-diff calls with one batched call per
claimed batch. The ingestion log processor now calls
BulkGenerateChangeSets which builds the bulk request, processes
per-entity results, and persists change sets individually.

Extracted handleGenerateChangeSetFailure and persistChangeSet helpers
from GenerateChangeSet for reuse by the bulk path. Moved
BulkGenerateChangeSetResult to reconciler/ops to avoid import cycle
with reconciler/mocks.
Add INGESTION_LOG_PROCESSOR_CONCURRENCY config to run multiple poll
workers within the ingestion log processor. Each worker independently
claims and processes batches via FOR UPDATE SKIP LOCKED, enabling
parallel processing without contention.
- Add BulkPersistChangeSets: single-transaction COPY for change_sets and
  changes rows with pre-allocated sequence IDs, replacing per-item
  persistChangeSet calls in BulkGenerateChangeSets
- Add BulkUpdateIngestionLogStates: unnest-based batch state update
- Add BulkCreateChangeSets and BulkTruncateChangeSets queries
- Add partial index on ingestion_logs(id) WHERE state = 1 for
  ClaimQueuedIngestionLogs, drop redundant entity_hash and request_id
  indexes
- Add stripNoopOnlyChanges to clear noop-only changesets before persist
…ades

Default to per-item differ.Diff calls (BULK_OPERATIONS_ENABLED=false)
so existing deployments with older NetBox plugin versions continue
working after reconciler upgrade. When enabled, uses BulkPlan HTTP
endpoint for batched diffing. Both paths share the same bulk-persist
DB layer.
Add ClaimOpenIngestionLogs (state OPEN→APPLYING with FOR UPDATE SKIP
LOCKED), ResetApplyingIngestionLogs for crash recovery, and
loadLatestChangeSets that loads changesets with NOOP changes filtered
out at the SQL level.

New types: OpenIngestionLog, BulkApplyItem, BulkApplyResult.
New config: ApplyProcessorBatchSize, ApplyProcessorConcurrency.
Add NetBox plugin client method for POST /bulk-apply/ endpoint and
BulkApplyChangeSets applier that sends multiple changesets in a single
HTTP call. Refactor toApplyRequest to share serialization between
singular and bulk paths.
ApplyProcessor polls OPEN ingestion logs (claimed via state 8), loads
their changesets from the DB, and applies them to NetBox via
BulkApplyChangeSets. Runs as a separate component alongside the plan
processor with configurable concurrency.

Also adds BulkApplyChangeSets to Ops and the IngestionProcessorOps
interface, and registers the apply processor in cmd/reconciler/main.go
with crash recovery (ResetApplyingIngestionLogs on startup).
Remove ApplyChangeSet calls and AutoApplyChangesets branching from the
plan processor — applying is now handled by the separate ApplyProcessor.
Remove obsolete tests (ApplyChangeSetError, SkipsApplyWhenDisabled) and
clean up ProcessesItems test expectations.
Add State_APPLYING = 8 used by the ApplyProcessor to claim OPEN rows
via FOR UPDATE SKIP LOCKED before applying them to NetBox.
Both IngestionLogProcessor.pollWorker and ApplyProcessor.pollWorker have
return signature error, but every path returns nil — errors are logged
and the loop continues. The single-worker dispatch in pollLoop returned
the always-nil error; the multi-worker dispatch silently discarded it
with _ = p.pollWorker(ctx). The asymmetry was misleading.

Drop the error return entirely. Start/pollLoop still return error for
the Component interface, but the internal worker no longer pretends to
have a failure path it doesn't use.
The IngestionProcessor.GenerateChangeSet/ApplyChangeSet worker pool
methods are no longer wired into handleStreamMessage after the
consume-loop decoupling. With them gone, GENERATE_CHANGESET_CONCURRENCY
and APPLY_CHANGESET_CONCURRENCY no longer control anything — remove
them along with IngestionLogToProcess and the test that exercised the
dead worker pool.

The IngestionProcessorOps.GenerateChangeSet/ApplyChangeSet interface
methods stay because the gRPC reconciler service in diode-pro still
calls them for the singular RPC handlers.
Adds the Go-side client for the new diode-netbox-plugin endpoint that
combines plan + apply in a single HTTP round-trip. The endpoint returns
a per-entity result with the generated change_set (always populated when
plan succeeded) plus split plan/apply error fields, so the reconciler
can persist the change_set for audit/retry regardless of apply outcome.

HTTP 200 or 207 are both successful HTTP responses here — the caller
inspects per-result Errors to attribute plan vs apply failures.
…an-apply

Replaces the two-processor design (IngestionLogProcessor + ApplyProcessor)
where both consumed NetBox capacity in parallel and competed for the
shared rate limiter, with two mutually-exclusive processors per tenant
routed by AUTO_APPLY_CHANGESETS:

  AUTO_APPLY_CHANGESETS=false -> IngestionLogProcessor only
                                  (plan-only, leaves rows in OPEN for
                                  manual review)

  AUTO_APPLY_CHANGESETS=true  -> AutoApplyProcessor only
                                  (combined plan + apply in one
                                  /bulk-plan-apply HTTP call; rows
                                  go QUEUED -> APPLYING -> APPLIED)

Eliminates the plan-vs-apply contention on the rate limiter: only one
processor consumes NetBox capacity per tenant. Halves HTTP overhead per
ingested entity (one round-trip instead of two) and avoids the
NetBox-side double-validation of the split flow.

Schema changes:
  - New SQL ClaimQueuedForAutoApply transitions QUEUED (1) -> APPLYING
    (8) atomically under FOR UPDATE SKIP LOCKED.
  - ResetApplyingIngestionLogs now resets APPLYING -> QUEUED (was
    APPLYING -> OPEN to feed the removed ApplyProcessor).
  - Drops the no-longer-used ClaimOpenIngestionLogs query and the
    idx_ingestion_logs_claim_apply partial index for state=2.

Go changes:
  - New AutoApplyProcessor with the same poll-loop / worker-pool shape
    as IngestionLogProcessor (configurable AUTO_APPLY_PROCESSOR_*
    env vars).
  - New Ops.BulkPlanApply orchestrates the call, persists returned
    change_sets in bulk, and writes the per-entity terminal state.
  - New differ.ConvertBulkPlanApplyResult splits the plan/apply error
    fields from the new endpoint's response shape.
  - cmd/reconciler/main.go starts exactly one of the two processors
    based on AUTO_APPLY_CHANGESETS; ResetApplyingIngestionLogs runs at
    startup only on the auto-apply path.
  - Removed ApplyProcessor, ops.BulkApplyChangeSets, ops.BulkApplyItem,
    ops.BulkApplyResult, ops.OpenIngestionLog, and the dead
    loadLatestChangeSets / loadedChangeSet helpers.
…anApply

BulkGenerateChangeSets has long persisted a failure-placeholder change_set
when generate_changeset raises — so audit, retry, and (in diode-pro)
deviation-type association have a change_set row to attach to. The new
BulkPlanApply path was missing this: plan failures just flipped the
ingestion_log to FAILED without producing a placeholder change_set, which
in diode-pro silently skipped the deviation-type association step.

Add persistPlanApplyFailurePlaceholder (mirroring
handleGenerateChangeSetFailure): on plan error, record the FAILED state
with the error message AND insert a FailedDiffChangeSet so the caller's
BulkPlanApplyResult has a ChangeSetID populated. Used for both the
whole-batch HTTP error path and the per-entity missing-result path.

Removes the now-redundant handlePlanApplyBatchFailure helper.
Without backpressure, AutoApplyProcessor immediately starts claiming
QUEUED rows during a burst ingest — competing with the Redis consume
loop for Postgres connections and CPU while also consuming NetBox
capacity that's already the slowest stage. Net effect under load: both
stages run slower than if AutoApply yielded to the consume loop first.

Add the same BackpressureFunc parameter IngestionLogProcessor already
takes. In the poll loop, before claiming a batch, check the function
and if it returns true, sleep one idle interval and retry. The
diode-server main.go wires it to redisStreamClient.XLen() >
INGESTION_LOG_PROCESSOR_BACKPRESSURE_THRESHOLD — same threshold as the
plan-only path uses, since the two processors are mutually exclusive
and the variable describes "log processing should defer to the consume
loop".
CreateIngestionLogs was calling RefreshDefaultBranch on every Redis
stream batch, which evicts the 60s LRU cache and forces a fresh
NetBox HTTP call. Under burst ingest the consume loop fired
tens of default-branch requests per second, saturating NetBox
plugin workers and starving /bulk-plan-apply.

Switch to the cached DefaultBranch lookup. A short staleness
window on a value that changes rarely is acceptable, and matches
what other ingestion-log paths already do.
Background goroutine (Ops.Start) owns NetBox HTTP for the default-branch
lookup; hot-path callers read from cache only. Consume loop keeps
draining Redis when NetBox/hydra is unreachable.

Failed refresh retains the last-known value. Cold cache returns
(nil, nil) and callers degrade to no-branch-filter lookups.
Doc claimed "subsequent calls are no-ops" but impl launched a new
goroutine every call. With current callers this never happens, but
defensive against future misuse.
@github-actions github-actions Bot added documentation Improvements or additions to documentation go labels May 13, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Vulnerability Scan: Passed — diode-reconciler

Image: diode-reconciler:scan

No vulnerabilities found.

Commit: 16a5ec8

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Vulnerability Scan: Passed — diode-auth

Image: diode-auth:scan

Source Library CVE Severity Installed Fixed Title
usr/bin/hydra github.com/docker/docker CVE-2026-34040 🟠 HIGH v28.3.3+incompatible 29.3.1 Moby: Moby: Authorization bypass vulnerability
usr/bin/hydra github.com/docker/docker CVE-2026-33997 🟡 MEDIUM v28.3.3+incompatible 29.3.1 moby: docker: github.com/moby/moby: Moby: Privilege validation bypass during plu
usr/bin/hydra github.com/go-jose/go-jose/v3 CVE-2026-34986 🟠 HIGH v3.0.4 3.0.5 github.com/go-jose/go-jose/v3: github.com/go-jose/go-jose/v4: Go JOSE: Denial of
usr/bin/hydra github.com/jackc/pgx/v5 CVE-2026-33816 🔴 CRITICAL v5.7.5 5.9.0 github.com/jackc/pgx/v5: github.com/jackc/pgx: Memory-safety vulnerability
usr/bin/hydra github.com/jackc/pgx/v5 CVE-2026-41889 ⚪ LOW v5.7.5 5.9.2 pgx is a PostgreSQL driver and toolkit for Go. Prior to version 5.9.2, ...
usr/bin/hydra go.opentelemetry.io/otel CVE-2026-29181 🟠 HIGH v1.40.0 1.41.0 OpenTelemetry-Go: multi-value baggage header extraction causes excessive alloc
usr/bin/hydra go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp CVE-2026-39882 🟡 MEDIUM v1.37.0 1.43.0 OpenTelemetry-Go is the Go implementation of OpenTelemetry. Prior to 1 ...
usr/bin/hydra go.opentelemetry.io/otel/sdk CVE-2026-39883 🟠 HIGH v1.40.0 1.43.0 opentelemetry-go: BSD kenv command not using absolute path enables PATH hijackin
usr/bin/hydra stdlib CVE-2026-25679 🟠 HIGH v1.26.0 1.25.8, 1.26.1 net/url: Incorrect parsing of IPv6 host literals in net/url
usr/bin/hydra stdlib CVE-2026-27137 🟠 HIGH v1.26.0 1.26.1 crypto/x509: Incorrect enforcement of email constraints in crypto/x509
usr/bin/hydra stdlib CVE-2026-32280 🟠 HIGH v1.26.0 1.25.9, 1.26.2 crypto/x509: crypto/tls: golang: Go: Denial of Service vulnerability in certific
usr/bin/hydra stdlib CVE-2026-32281 🟠 HIGH v1.26.0 1.25.9, 1.26.2 crypto/x509: golang: Go crypto/x509: Denial of Service via inefficient certifica
usr/bin/hydra stdlib CVE-2026-32283 🟠 HIGH v1.26.0 1.25.9, 1.26.2 crypto/tls: golang: Go crypto/tls: Denial of Service via multiple TLS 1.3 key up
usr/bin/hydra stdlib CVE-2026-33810 🟠 HIGH v1.26.0 1.26.2 crypto/x509: golang: Go crypto/x509: Certificate validation bypass due to incorr
usr/bin/hydra stdlib CVE-2026-33811 🟠 HIGH v1.26.0 1.25.10, 1.26.3 When using LookupCNAME with the cgo DNS resolver, a very long CNAME re ...
usr/bin/hydra stdlib CVE-2026-33814 🟠 HIGH v1.26.0 1.25.10, 1.26.3 When processing HTTP/2 SETTINGS frames, transport will enter an infini ...
usr/bin/hydra stdlib CVE-2026-39820 🟠 HIGH v1.26.0 1.25.10, 1.26.3 Well-crafted inputs reaching ParseAddress, ParseAddressList, and Parse ...
usr/bin/hydra stdlib CVE-2026-39836 🟠 HIGH v1.26.0 1.25.10, 1.26.3 Panic in Dial and LookupPort when handling NUL byte on Windows in net
usr/bin/hydra stdlib CVE-2026-42499 🟠 HIGH v1.26.0 1.25.10, 1.26.3 Pathological inputs could cause DoS through consumePhrase when parsing ...
usr/bin/hydra stdlib CVE-2026-27142 🟡 MEDIUM v1.26.0 1.25.8, 1.26.1 html/template: URLs in meta content attribute actions are not escaped in html/te
usr/bin/hydra stdlib CVE-2026-32282 🟡 MEDIUM v1.26.0 1.25.9, 1.26.2 golang: internal/syscall/unix: Root.Chmod can follow symlinks out of the root
usr/bin/hydra stdlib CVE-2026-32288 🟡 MEDIUM v1.26.0 1.25.9, 1.26.2 archive/tar: golang: Go's archive/tar package: Denial of Service via maliciously
usr/bin/hydra stdlib CVE-2026-32289 🟡 MEDIUM v1.26.0 1.25.9, 1.26.2 html/template: golang: html/template: Cross-Site Scripting (XSS) via improper co
usr/bin/hydra stdlib CVE-2026-39823 🟡 MEDIUM v1.26.0 1.25.10, 1.26.3 CVE-2026-27142 fixed a vulnerability in which URLs were not correctly ...
usr/bin/hydra stdlib CVE-2026-39825 🟡 MEDIUM v1.26.0 1.25.10, 1.26.3 ReverseProxy can forward queries containing parameters not visible to ...
usr/bin/hydra stdlib CVE-2026-39826 🟡 MEDIUM v1.26.0 1.25.10, 1.26.3 If a trusted template author were to write a <script> tag containing a ...
usr/bin/hydra stdlib CVE-2026-27138 ⚪ LOW v1.26.0 1.26.1 crypto/x509: Panic in name constraint checking for malformed certificates in cry
usr/bin/hydra stdlib CVE-2026-27139 ⚪ LOW v1.26.0 1.25.8, 1.26.1 os: FileInfo can escape from a Root in golang os module

Commit: 16a5ec8

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Vulnerability Scan: Passed — diode-ingester

Image: diode-ingester:scan

No vulnerabilities found.

Commit: 16a5ec8

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Go test coverage

STATUS ELAPSED PACKAGE COVER PASS FAIL SKIP
🟢 PASS 1.43s github.com/netboxlabs/diode/diode-server/auth 44.7% 42 0 0
🟢 PASS 0.95s github.com/netboxlabs/diode/diode-server/auth/cli 0.0% 0 0 0
🟢 PASS 1.02s github.com/netboxlabs/diode/diode-server/authutil 82.8% 5 0 0
🟢 PASS 0.21s github.com/netboxlabs/diode/diode-server/dbstore/postgres 0.0% 0 0 0
🟢 PASS 1.12s github.com/netboxlabs/diode/diode-server/entityhash 79.2% 13 0 0
🟢 PASS 1.14s github.com/netboxlabs/diode/diode-server/entitymatcher 82.8% 97 0 0
🟢 PASS 0.09s github.com/netboxlabs/diode/diode-server/errors 0.0% 0 0 0
🟢 PASS 1.16s github.com/netboxlabs/diode/diode-server/graph 52.0% 81 0 0
🟢 PASS 1.42s github.com/netboxlabs/diode/diode-server/ingester 85.4% 66 0 0
🟢 PASS 1.11s github.com/netboxlabs/diode/diode-server/matching 94.1% 66 0 0
🟢 PASS 1.08s github.com/netboxlabs/diode/diode-server/migrator 70.4% 4 0 0
🟢 PASS 6.28s github.com/netboxlabs/diode/diode-server/netboxdiodeplugin 49.2% 40 0 0
🟢 PASS 0.17s github.com/netboxlabs/diode/diode-server/pprof 0.0% 0 0 0
🟢 PASS 4.26s github.com/netboxlabs/diode/diode-server/reconciler 50.7% 86 0 0
🟢 PASS 1.02s github.com/netboxlabs/diode/diode-server/reconciler/applier 40.0% 1 0 0
🟢 PASS 0.09s github.com/netboxlabs/diode/diode-server/reconciler/changeset 0.0% 0 0 0
🟢 PASS 1.09s github.com/netboxlabs/diode/diode-server/reconciler/differ 41.1% 6 0 0
🟢 PASS 1.02s github.com/netboxlabs/diode/diode-server/server 85.7% 14 0 0
🟢 PASS 1.01s github.com/netboxlabs/diode/diode-server/strcase 100.0% 24 0 0
🟢 PASS 1.02s github.com/netboxlabs/diode/diode-server/telemetry 28.0% 26 0 0
🟢 PASS 1.02s github.com/netboxlabs/diode/diode-server/telemetry/otel 90.2% 25 0 0
🟢 PASS 0.10s github.com/netboxlabs/diode/diode-server/tls 0.0% 0 0 0
🟢 PASS 1.01s github.com/netboxlabs/diode/diode-server/version 100.0% 2 0 0

Total coverage: 50.1%

Comment thread diode-server/reconciler/auto_apply_processor.go Outdated
Comment thread diode-server/reconciler/ingestion_log_processor.go
Comment thread diode-server/reconciler/ingestion_log_processor.go Outdated
…rker

Both AutoApplyProcessor.pollWorker and IngestionLogProcessor.pollWorker
opened each loop iteration with a non-blocking select on ctx.Done() to
exit on cancellation. The check is defensive but redundant: every
operation that follows in the body (backpressure(ctx), claim queries,
HTTP calls, the time.After fallback) is itself ctx-aware and will
surface a cancelled context within a single tick. Dropping the early
selects shortens the hot loop without changing cancellation semantics.

Per PR #526 review feedback from @MicahParks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants