fix: Cap `WorkerLock` timeout intervals to 15 minutes by jason-famedly · Pull Request #19394 · element-hq/synapse

jason-famedly · 2026-01-20T12:42:34Z

Fixes the symptoms of #19315 but not the underlying reason causing the number to grow so large in the first place.

ValueError: Exceeds the limit (4300 digits) for integer string conversion; use sys.set_int_max_str_digits() to increase the limit

Copied from the original pull request on Famedly's Synapse repo (with some edits):

Basing the time interval around a 5 seconds leaves a big window of waiting especially as this window is doubled each retry, when another worker could be making progress but can not.

Right now, the retry interval in seconds looks like [0.2, 5, 10, 20, 40, 80, 160, 320, (continues to double)] after which logging should start about excessive times and (relatively quickly) end up with an extremely large retry interval with an unrealistic expectation past the heat death of the universe. 1 year in seconds = 31,536,000.

With this change, retry intervals in seconds should look more like:

[
0.2, 
0.4, 
0.8, 
1.6, 
3.2, 
6.4, 
12.8, 
25.6, 
51.2, 
102.4,  # 1.7 minutes
204.8,  # 3.41 minutes
409.6,  # 6.83 minutes
819.2,  # 13.65 minutes  < logging about excessive times will start here, 13th iteration
900,  # 15 minutes < never goes higher than this
]

Further suggested work in this area could be to define the cap, the retry interval starting point and the multiplier depending on how frequently this lock should be checked. See data below for reasons why. Increasing the jitter range may also be a good idea

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct (run the linters)

…es, continue logging at durations greater than 10 minutes

CLAassistant · 2026-01-20T12:42:42Z

All committers have signed the CLA.

changelog.d/19394.bugfix

MadLittleMods · 2026-01-20T16:25:14Z

changelog.d/19394.bugfix

@@ -0,0 +1 @@
+Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly.


In #19390 (comment) (another Famedly PR),

I am submitting this PR as an employee of Famedly, who has signed the corporate CLA, and used my company email in the commit.

I assume the same applies here?

Yes, this is correct. Will we have to state such each time we upstream changes?

To explain here, we didn't have any corporate CLA signed from Famedly at the time but just got confirmation today that this has now happened ⏩

The whole https://github.com/famedly org is now excempt from the CLA check but you need to make your org membership public in order for the check to be able to work.

MadLittleMods · 2026-01-20T16:32:44Z

synapse/handlers/worker_lock.py

+        self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2)
+        if self._retry_interval > Duration(minutes=10).as_secs():  # >12 iterations


It would be nice to have these as constants WORKER_LOCK_MAX_RETRY_INTERVAL and WORKER_LOCK_WARN_RETRY_INTERVAL (perhaps better name) so we can share better describe these values.

I had actually considered such before also considering that a more flexible approach for different locks may be worth exploring. For example: when a lock is taken because an event is being persisted, the retry interval could be capped to a much smaller value, and the same for the logging of excessive times. Whereas, instead, a lock for purging a room might start with a longer retry interval but keep the cap the same.

Perhaps as defaults, if that exploration bears fruit. I'll shall add that to my notes for more work in this area, but I would rather do such separately

Sounds good to tackle that in a separate PR.

In the mean-time, I think my original suggestion still makes sense. And we can also assert that WORKER_LOCK_MAX_RETRY_INTERVAL > WORKER_LOCK_WARN_RETRY_INTERVAL so we always get a warning about excessive timeout.

changelog.d/19394.bugfix

Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>

denzs · 2026-01-23T09:22:45Z

After the issue occured again in our prod:

2026-01-22 13:36:53.725errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 5120s. There may be a deadlock.
2026-01-22 13:36:53.981errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 10240s. There may be a deadlock.
2026-01-22 13:36:54.560errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 20480s. There may be a deadlock.
2026-01-22 13:36:54.798errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 40960s. There may be a deadlock.
2026-01-22 13:36:56.342errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 81920s. There may be a deadlock.

My hypothesis would be: the issue is not primarily about the dimensions of the growing timeout but, about the timeout being ignored at all?

At least the logged timeout is not reflected in the timestamp deltas of the log lines?!

jason-famedly · 2026-01-23T14:43:46Z

After the issue occured again in our prod:

2026-01-22 13:36:53.725errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 5120s. There may be a deadlock.
2026-01-22 13:36:53.981errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 10240s. There may be a deadlock.
2026-01-22 13:36:54.560errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 20480s. There may be a deadlock.
2026-01-22 13:36:54.798errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 40960s. There may be a deadlock.
2026-01-22 13:36:56.342errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 81920s. There may be a deadlock.

My hypothesis would be: the issue is not primarily about the dimensions of the growing timeout but, about the timeout being ignored at all?

At least the logged timeout is not reflected in the timestamp deltas of the log lines?!

Yes there is more than one thing going on here. This fix(switch max() to min() and adjust iteration assumptions) is only to fix obnoxiously long strings of numbers that are trying to reach infinity not being introduced in the first place. The underlying cause is something else: timeouts seem to not be honored as well what the request that triggers the situation at all is doing to cause the locks to pile up and not make progress

…only increase it when a timeout occurs

jason-famedly · 2026-01-26T16:39:37Z

Added some additional work onto this:

Adjusted for the fact that the retry interval was actually for time outs, by renaming it
Adjusted that a timeout interval should only be increased if an actual timeout was reached
Added a warning level log to the generic Exception to see if it is even being hit
Added/adjusted tests to check for time outs being hit(there is a typo in one that I will fix after the first test series is run)

This allows that a normal notification of another lock being released should not increment a timeout, when a time out has not actually occurred. It should cut down on what may end up otherwise being excessive log spam about locks having a long timeout duration when such is not true. The jitter value is maintained for the timeout, to help avoid a "thundering herd" situation when all locks may time out at the same time.

MadLittleMods · 2026-02-10T18:14:10Z

synapse/handlers/worker_lock.py

+                    # Only increment the timeout interval if this was an actual timeout
+                    self._timeout_interval = self._increment_timeout_interval()


We should explain why in the comments,

Adjusted that a timeout interval should only be increased if an actual timeout was reached

This allows that a normal notification of another lock being released should not increment a timeout, when a time out has not actually occurred. It should cut down on what may end up otherwise being excessive log spam about locks having a long timeout duration when such is not true.

-- @jason-famedly, #19394 (comment)

MadLittleMods · 2026-02-10T18:19:30Z

changelog.d/19394.bugfix

@@ -0,0 +1 @@
+Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly.


To explain here, we didn't have any corporate CLA signed from Famedly at the time but just got confirmation today that this has now happened ⏩

The whole https://github.com/famedly org is now excempt from the CLA check but you need to make your org membership public in order for the check to be able to work.

MadLittleMods · 2026-02-10T18:24:44Z

synapse/handlers/worker_lock.py

+    def _increment_timeout_interval(self) -> float:
+        next = self._timeout_interval
+        next = min(Duration(minutes=15).as_secs(), next * 2)
+        if next > Duration(minutes=10).as_secs():  # >12 iterations
            logger.warning(
                "Lock timeout is getting excessive: %ss. There may be a deadlock.",
-                self._retry_interval,
+                next,
            )
        return next * random.uniform(0.9, 1.1)


We should probably just keep track of the new self._timeout_interval in here instead of assigning outside of this.

I think this makes even more sense now that this is named _increment_timeout_interval

MadLittleMods · 2026-02-10T18:25:02Z

synapse/handlers/worker_lock.py

+    def _increment_timeout_interval(self) -> float:
+        next = self._timeout_interval
+        next = min(Duration(minutes=15).as_secs(), next * 2)
+        if next > Duration(minutes=10).as_secs():  # >12 iterations
            logger.warning(
                "Lock timeout is getting excessive: %ss. There may be a deadlock.",
-                self._retry_interval,
+                next,
            )
        return next * random.uniform(0.9, 1.1)


We should explain this in the comments:

The jitter value is maintained for the timeout, to help avoid a "thundering herd" situation when all locks may time out at the same time.

MadLittleMods · 2026-02-10T18:28:04Z

synapse/handlers/worker_lock.py

+                    self._timeout_interval = self._increment_timeout_interval()
+                except Exception as e:
+                    logger.warning(
+                        "Caught an exception while waiting on WaitingLock: %r", e


For better context, we can also add the self.lock_name and self.lock_key

Suggested change

"Caught an exception while waiting on WaitingLock: %r", e

"Caught an exception while waiting on WaitingLock(%s, %s): %r", self.lock_name, self.lock_key, e

MadLittleMods · 2026-02-10T18:28:23Z

synapse/handlers/worker_lock.py


                try:
-                    # Wait until the we get notified the lock might have been
+                    # Wait until the notification the lock might have been


Suggested change

# Wait until the notification the lock might have been

# Wait until the notification that the lock might have been

MadLittleMods · 2026-02-10T18:31:02Z

synapse/handlers/worker_lock.py

+        self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2)
+        if self._retry_interval > Duration(minutes=10).as_secs():  # >12 iterations


Sounds good to tackle that in a separate PR.

In the mean-time, I think my original suggestion still makes sense. And we can also assert that WORKER_LOCK_MAX_RETRY_INTERVAL > WORKER_LOCK_WARN_RETRY_INTERVAL so we always get a warning about excessive timeout.

MadLittleMods · 2026-02-10T18:35:52Z

tests/handlers/test_worker_lock.py

+        # Wrap the WaitingLock object, so we can detect if the timeouts are being hit
+        with patch.object(
+            lock2,
+            "_increment_timeout_interval",
+            wraps=lock2._increment_timeout_interval,
+        ) as wrapped_lock2_increment_timeout_interval_method:
+            d2 = defer.ensureDeferred(lock2.__aenter__())
+            self.assertNoResult(d2)
+
+            # The lock should not time out here
+            wrapped_lock2_increment_timeout_interval_method.assert_not_called()
+            self.get_success(lock1.__aexit__(None, None, None))
+
+            self.get_success(d2)
+            self.get_success(lock2.__aexit__(None, None, None))


The previous version of this test is only concerned that lock2 isn't acquired until lock1 releases.

I don't think we need to concern ourselves with internal details of timeouts here. We could have a separate test for this kind of thing but it doesn't have a ton of value.

MadLittleMods · 2026-02-10T18:44:14Z

synapse/handlers/worker_lock.py

            logger.warning(
                "Lock timeout is getting excessive: %ss. There may be a deadlock.",
-                self._retry_interval,
+                next,
            )


This already exists but we can improve this message:

Suggested change

logger.warning(

"Lock timeout is getting excessive: %ss. There may be a deadlock.",

self._retry_interval,

next,

)

"WaitingLock(%s, %s): We are having to wait a long time for the lock. Wait timeout is getting excessive: %ss. There may be a deadlock.",

I think we could do one step better and instead track the time when we start trying to acquire the lock in __aenter__ and be able to compare to the actual time we have been waiting overall. Then we can be more specific and say "We have been waiting over 10 minutes (excessive) to acquire the lock (%s, %s). There may be a deadlock."

Then it doesn't matter what's causing us to wait and we can get a more accurate picture.

Feel free to push this off to the future.

MadLittleMods · 2026-02-10T18:53:31Z

tests/handlers/test_worker_lock.py

+            # Should be timed out 6 times, but do not fail on that exact count
+            wrapped_lock2_increment_timeout_interval_method.assert_called()
+            self.get_success(lock1.__aexit__(None, None, None))


Probably better to test whether we try_acquire_lock multiple times (what we actually care about).

Testing the internals isn't that useful.

Even more ideally, we wouldn't patch anything and instead acquire and release a lock silently (without notification) and see whether the second lock is still able to acquire some time later.

jason-famedly added 2 commits January 20, 2026 06:27

max() and min() were probably switched. Set max to arbitrary 15 minut…

ff23d00

…es, continue logging at durations greater than 10 minutes

changelog

548c85b

jason-famedly requested a review from a team as a code owner January 20, 2026 12:42

MadLittleMods added the A-Workers label Jan 20, 2026

MadLittleMods reviewed Jan 20, 2026

View reviewed changes

Update changelog.d/19394.bugfix

1d22f90

Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>

jason-famedly requested a review from MadLittleMods January 21, 2026 21:15

Adjust for the retry interval actually being a timeout interval, and …

e864cfe

…only increase it when a timeout occurs

jason-famedly changed the title ~~fix: Cap WorkerLock retry intervals to 15 minutes~~ fix: Cap WorkerLock timeout intervals to 15 minutes Jan 26, 2026

jason-famedly added 2 commits January 26, 2026 11:18

unecessarily long pump() in test, left over from testing logging

d416dc8

adjust changelog(again)

e555cd6

MadLittleMods reviewed Feb 10, 2026

View reviewed changes

MadLittleMods mentioned this pull request Mar 26, 2026

fix: Adjust timing on WorkerLocks to allow faster resolution of the lock famedly/synapse#221

Closed

janonym1 mentioned this pull request Mar 31, 2026

Deadlock in _process_event_persist_queue_task causes all message delivery to hang (single-process) #19601

Open

		@@ -0,0 +1 @@
		Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly.

		self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2)
		if self._retry_interval > Duration(minutes=10).as_secs(): # >12 iterations

		# Only increment the timeout interval if this was an actual timeout
		self._timeout_interval = self._increment_timeout_interval()

	"Caught an exception while waiting on WaitingLock: %r", e
	"Caught an exception while waiting on WaitingLock(%s, %s): %r", self.lock_name, self.lock_key, e

	# Wait until the notification the lock might have been
	# Wait until the notification that the lock might have been

Conversation

jason-famedly commented Jan 20, 2026 • edited by MadLittleMods Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Checklist

Uh oh!

CLAassistant commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

denzs commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jason-famedly commented Jan 23, 2026

Uh oh!

jason-famedly commented Jan 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jason-famedly commented Jan 20, 2026 •

edited by MadLittleMods

Loading

CLAassistant commented Jan 20, 2026 •

edited

Loading

denzs commented Jan 23, 2026 •

edited

Loading