Skip to content

fix(core): Resolve multi-main startup race condition in AuthRolesService#26176

Merged
afitzek merged 3 commits intomasterfrom
iam-298-multi-main-startup-race-condition-in
Feb 24, 2026
Merged

fix(core): Resolve multi-main startup race condition in AuthRolesService#26176
afitzek merged 3 commits intomasterfrom
iam-298-multi-main-startup-race-condition-in

Conversation

@afitzek
Copy link
Contributor

@afitzek afitzek commented Feb 24, 2026

Summary

Fixes a multi-main startup race condition where concurrent n8n instances running AuthRolesService.syncScopes() against the same Postgres database hit duplicate key constraint violations.

  • Introduces a centralized DbLockService in @n8n/db that wraps Postgres advisory locks (pg_advisory_xact_lock) inside transactions, with a DbLock enum to prevent lock ID collisions
  • Supports an optional timeoutMs on withLock (via SET LOCAL lock_timeout) and a non-blocking tryWithLock (via pg_try_advisory_xact_lock) that fails fast if the lock is already held
  • On SQLite, advisory locks are skipped — the transaction alone provides serialization
  • Refactors AuthRolesService.init() to use DbLockService.withLock(DbLock.AUTH_ROLES_SYNC, ...) instead of repository-level calls
  • Removes the leader-only guard in start.ts — all main instances now call AuthRolesService.init(), and the advisory lock serializes them safely
  • Lock contention throws OperationalError (timeout exceeded or try-lock busy)

Test plan

  • 10 unit tests for DbLockService (withLock + tryWithLock, Postgres/SQLite, timeout, error propagation)
  • 8 integration tests against live database (transaction behavior, advisory lock serialization, timeout, try-lock contention)
  • 27 existing AuthRolesService unit tests updated and passing
  • 4 existing start.ts unit tests updated (non-leader now calls init)
  • Typecheck and lint clean on both @n8n/db and cli packages

Related Linear tickets, Github issues, and Community forum posts

closes https://linear.app/n8n/issue/IAM-298

Review / Merge checklist

  • PR title and summary are descriptive. (conventions)
  • Docs updated or follow-up ticket created.
  • Tests included.
  • PR Labeled with release/backport (if the PR is an urgent fix that needs to be backported)

@n8n-assistant n8n-assistant bot added core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team labels Feb 24, 2026
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 57.89474% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ackages/@n8n/db/src/services/auth.roles.service.ts 8.00% 23 Missing ⚠️
packages/@n8n/db/src/services/index.ts 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@blacksmith-sh

This comment has been minimized.

@afitzek afitzek marked this pull request as ready for review February 24, 2026 12:44
@afitzek afitzek requested review from a team, BGZStephen, cstuncsik, guillaumejacquart and phyllis-noester and removed request for a team February 24, 2026 12:45
Copy link
Contributor

@guillaumejacquart guillaumejacquart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions

this.logger.debug('Initializing AuthRolesService...');
await this.syncScopes();
await this.syncRoles();
await this.dbLockService.withLock(DbLock.AUTH_ROLES_SYNC, async (tx) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the tryWithLock ? I feel like it would be faster, and would fail only if another main instance is doing the job ?
Also, why no catching the OperationalError in case of timeout ? Do we want this to prevent the instance from starting ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OperationalError is only thrown when the timeout parameter is set, which it is not in this call. The thinking for not using tryWithLock is that I want all instances to sync there changes to avoid any potential miss behaviors. The downside is that instances might wait for each other if they reach this point at the same time, but since this is a few ms, it should be fine during the bootstrap process.

await this.syncScopes();
await this.syncRoles();
await this.dbLockService.withLock(DbLock.AUTH_ROLES_SYNC, async (tx) => {
await this.syncScopes(tx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure I understand: the reason why we want to do this whenever a new instance starts is that the instance start could be related to a deployment that either added or removed scopes. correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes correct, this approach safes us from adding a DB migration for every scope that we want to add.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we don't do rolling updates, there is no scenario where there could be a race condition and the scopes remain in the old state right?
I was trying to think of edge cases e.g.:

  • a container restarts for non deployment related reason
  • we deploy (miliseconds after)
  • container restart locks the table
  • deploy does not sync roles
    => scopes remain in old state

but I assume that would only be possible if we did rolling updates and even then it would be super unlikely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the container restarts it would acquire the advisory lock, and sync its roles, the deployment (the new versions), would reach this point and wait for the advisory lock to be released. One of the new instances would acquire the lock and sync its roles, the other still wait. So one after another passes this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, that's essentially the answer to guillaumes question :)

@afitzek afitzek added this pull request to the merge queue Feb 24, 2026
Merged via the queue into master with commit 5a85a4f Feb 24, 2026
79 checks passed
@afitzek afitzek deleted the iam-298-multi-main-startup-race-condition-in branch February 24, 2026 18:29
@n8n-assistant n8n-assistant bot mentioned this pull request Mar 2, 2026
This was referenced Mar 3, 2026
@n8n-assistant
Copy link
Contributor

n8n-assistant bot commented Mar 3, 2026

Got released with n8n@2.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Enhancement outside /nodes-base and /editor-ui n8n team Authored by the n8n team Released

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants