fix(core): Resolve multi-main startup race condition in AuthRolesService#26176
fix(core): Resolve multi-main startup race condition in AuthRolesService#26176
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This comment has been minimized.
This comment has been minimized.
guillaumejacquart
left a comment
There was a problem hiding this comment.
Just a few questions
| this.logger.debug('Initializing AuthRolesService...'); | ||
| await this.syncScopes(); | ||
| await this.syncRoles(); | ||
| await this.dbLockService.withLock(DbLock.AUTH_ROLES_SYNC, async (tx) => { |
There was a problem hiding this comment.
Why not use the tryWithLock ? I feel like it would be faster, and would fail only if another main instance is doing the job ?
Also, why no catching the OperationalError in case of timeout ? Do we want this to prevent the instance from starting ?
There was a problem hiding this comment.
The OperationalError is only thrown when the timeout parameter is set, which it is not in this call. The thinking for not using tryWithLock is that I want all instances to sync there changes to avoid any potential miss behaviors. The downside is that instances might wait for each other if they reach this point at the same time, but since this is a few ms, it should be fine during the bootstrap process.
| await this.syncScopes(); | ||
| await this.syncRoles(); | ||
| await this.dbLockService.withLock(DbLock.AUTH_ROLES_SYNC, async (tx) => { | ||
| await this.syncScopes(tx); |
There was a problem hiding this comment.
just to make sure I understand: the reason why we want to do this whenever a new instance starts is that the instance start could be related to a deployment that either added or removed scopes. correct?
There was a problem hiding this comment.
Yes correct, this approach safes us from adding a DB migration for every scope that we want to add.
There was a problem hiding this comment.
given that we don't do rolling updates, there is no scenario where there could be a race condition and the scopes remain in the old state right?
I was trying to think of edge cases e.g.:
- a container restarts for non deployment related reason
- we deploy (miliseconds after)
- container restart locks the table
- deploy does not sync roles
=> scopes remain in old state
but I assume that would only be possible if we did rolling updates and even then it would be super unlikely
There was a problem hiding this comment.
If the container restarts it would acquire the advisory lock, and sync its roles, the deployment (the new versions), would reach this point and wait for the advisory lock to be released. One of the new instances would acquire the lock and sync its roles, the other still wait. So one after another passes this point.
There was a problem hiding this comment.
ah i see, that's essentially the answer to guillaumes question :)
|
Got released with |
Summary
Fixes a multi-main startup race condition where concurrent n8n instances running
AuthRolesService.syncScopes()against the same Postgres database hit duplicate key constraint violations.DbLockServicein@n8n/dbthat wraps Postgres advisory locks (pg_advisory_xact_lock) inside transactions, with aDbLockenum to prevent lock ID collisionstimeoutMsonwithLock(viaSET LOCAL lock_timeout) and a non-blockingtryWithLock(viapg_try_advisory_xact_lock) that fails fast if the lock is already heldAuthRolesService.init()to useDbLockService.withLock(DbLock.AUTH_ROLES_SYNC, ...)instead of repository-level callsstart.ts— all main instances now callAuthRolesService.init(), and the advisory lock serializes them safelyOperationalError(timeout exceeded or try-lock busy)Test plan
DbLockService(withLock+tryWithLock, Postgres/SQLite, timeout, error propagation)AuthRolesServiceunit tests updated and passingstart.tsunit tests updated (non-leader now calls init)@n8n/dbandclipackagesRelated Linear tickets, Github issues, and Community forum posts
closes https://linear.app/n8n/issue/IAM-298
Review / Merge checklist
release/backport(if the PR is an urgent fix that needs to be backported)