Skip to content

fix: Cap WorkerLock timeout intervals to 15 minutes#19394

Open
jason-famedly wants to merge 6 commits intoelement-hq:developfrom
famedly:jason/cap-worker-locks
Open

fix: Cap WorkerLock timeout intervals to 15 minutes#19394
jason-famedly wants to merge 6 commits intoelement-hq:developfrom
famedly:jason/cap-worker-locks

Conversation

@jason-famedly
Copy link
Copy Markdown

@jason-famedly jason-famedly commented Jan 20, 2026

Fixes the symptoms of #19315 but not the underlying reason causing the number to grow so large in the first place.

ValueError: Exceeds the limit (4300 digits) for integer string conversion; use sys.set_int_max_str_digits() to increase the limit

Copied from the original pull request on Famedly's Synapse repo (with some edits):

Basing the time interval around a 5 seconds leaves a big window of waiting especially as this window is doubled each retry, when another worker could be making progress but can not.

Right now, the retry interval in seconds looks like [0.2, 5, 10, 20, 40, 80, 160, 320, (continues to double)] after which logging should start about excessive times and (relatively quickly) end up with an extremely large retry interval with an unrealistic expectation past the heat death of the universe. 1 year in seconds = 31,536,000.

With this change, retry intervals in seconds should look more like:

[
0.2, 
0.4, 
0.8, 
1.6, 
3.2, 
6.4, 
12.8, 
25.6, 
51.2, 
102.4,  # 1.7 minutes
204.8,  # 3.41 minutes
409.6,  # 6.83 minutes
819.2,  # 13.65 minutes  < logging about excessive times will start here, 13th iteration
900,  # 15 minutes < never goes higher than this
]

Further suggested work in this area could be to define the cap, the retry interval starting point and the multiplier depending on how frequently this lock should be checked. See data below for reasons why. Increasing the jitter range may also be a good idea

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

@jason-famedly jason-famedly requested a review from a team as a code owner January 20, 2026 12:42
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jan 20, 2026

CLA assistant check
All committers have signed the CLA.

@@ -0,0 +1 @@
Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #19390 (comment) (another Famedly PR),

I am submitting this PR as an employee of Famedly, who has signed the corporate CLA, and used my company email in the commit.

I assume the same applies here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is correct. Will we have to state such each time we upstream changes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To explain here, we didn't have any corporate CLA signed from Famedly at the time but just got confirmation today that this has now happened ⏩

The whole https://github.com/famedly org is now excempt from the CLA check but you need to make your org membership public in order for the check to be able to work.

Comment on lines +278 to +279
self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2)
if self._retry_interval > Duration(minutes=10).as_secs(): # >12 iterations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have these as constants WORKER_LOCK_MAX_RETRY_INTERVAL and WORKER_LOCK_WARN_RETRY_INTERVAL (perhaps better name) so we can share better describe these values.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had actually considered such before also considering that a more flexible approach for different locks may be worth exploring. For example: when a lock is taken because an event is being persisted, the retry interval could be capped to a much smaller value, and the same for the logging of excessive times. Whereas, instead, a lock for purging a room might start with a longer retry interval but keep the cap the same.

Perhaps as defaults, if that exploration bears fruit. I'll shall add that to my notes for more work in this area, but I would rather do such separately

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to tackle that in a separate PR.

In the mean-time, I think my original suggestion still makes sense. And we can also assert that WORKER_LOCK_MAX_RETRY_INTERVAL > WORKER_LOCK_WARN_RETRY_INTERVAL so we always get a warning about excessive timeout.

Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>
@denzs
Copy link
Copy Markdown

denzs commented Jan 23, 2026

After the issue occured again in our prod:

2026-01-22 13:36:53.725errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 5120s. There may be a deadlock.
2026-01-22 13:36:53.981errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 10240s. There may be a deadlock.
2026-01-22 13:36:54.560errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 20480s. There may be a deadlock.
2026-01-22 13:36:54.798errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 40960s. There may be a deadlock.
2026-01-22 13:36:56.342errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 81920s. There may be a deadlock.

My hypothesis would be: the issue is not primarily about the dimensions of the growing timeout but, about the timeout being ignored at all?

At least the logged timeout is not reflected in the timestamp deltas of the log lines?!

@jason-famedly
Copy link
Copy Markdown
Author

After the issue occured again in our prod:

2026-01-22 13:36:53.725errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 5120s. There may be a deadlock.
2026-01-22 13:36:53.981errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 10240s. There may be a deadlock.
2026-01-22 13:36:54.560errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 20480s. There may be a deadlock.
2026-01-22 13:36:54.798errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 40960s. There may be a deadlock.
2026-01-22 13:36:56.342errorsynapse.handlers.worker_lock - 280 - WARNING - sentinel - Lock timeout is getting excessive: 81920s. There may be a deadlock.

My hypothesis would be: the issue is not primarily about the dimensions of the growing timeout but, about the timeout being ignored at all?

At least the logged timeout is not reflected in the timestamp deltas of the log lines?!

Yes there is more than one thing going on here. This fix(switch max() to min() and adjust iteration assumptions) is only to fix obnoxiously long strings of numbers that are trying to reach infinity not being introduced in the first place. The underlying cause is something else: timeouts seem to not be honored as well what the request that triggers the situation at all is doing to cause the locks to pile up and not make progress

@jason-famedly
Copy link
Copy Markdown
Author

Added some additional work onto this:

  1. Adjusted for the fact that the retry interval was actually for time outs, by renaming it
  2. Adjusted that a timeout interval should only be increased if an actual timeout was reached
  3. Added a warning level log to the generic Exception to see if it is even being hit
  4. Added/adjusted tests to check for time outs being hit(there is a typo in one that I will fix after the first test series is run)

This allows that a normal notification of another lock being released should not increment a timeout, when a time out has not actually occurred. It should cut down on what may end up otherwise being excessive log spam about locks having a long timeout duration when such is not true. The jitter value is maintained for the timeout, to help avoid a "thundering herd" situation when all locks may time out at the same time.

@jason-famedly jason-famedly changed the title fix: Cap WorkerLock retry intervals to 15 minutes fix: Cap WorkerLock timeout intervals to 15 minutes Jan 26, 2026
Comment on lines +254 to +255
# Only increment the timeout interval if this was an actual timeout
self._timeout_interval = self._increment_timeout_interval()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explain why in the comments,

Adjusted that a timeout interval should only be increased if an actual timeout was reached

This allows that a normal notification of another lock being released should not increment a timeout, when a time out has not actually occurred. It should cut down on what may end up otherwise being excessive log spam about locks having a long timeout duration when such is not true.

-- @jason-famedly, #19394 (comment)

@@ -0,0 +1 @@
Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To explain here, we didn't have any corporate CLA signed from Famedly at the time but just got confirmation today that this has now happened ⏩

The whole https://github.com/famedly org is now excempt from the CLA check but you need to make your org membership public in order for the check to be able to work.

Comment on lines +280 to 288
def _increment_timeout_interval(self) -> float:
next = self._timeout_interval
next = min(Duration(minutes=15).as_secs(), next * 2)
if next > Duration(minutes=10).as_secs(): # >12 iterations
logger.warning(
"Lock timeout is getting excessive: %ss. There may be a deadlock.",
self._retry_interval,
next,
)
return next * random.uniform(0.9, 1.1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just keep track of the new self._timeout_interval in here instead of assigning outside of this.

I think this makes even more sense now that this is named _increment_timeout_interval

Comment on lines +280 to 288
def _increment_timeout_interval(self) -> float:
next = self._timeout_interval
next = min(Duration(minutes=15).as_secs(), next * 2)
if next > Duration(minutes=10).as_secs(): # >12 iterations
logger.warning(
"Lock timeout is getting excessive: %ss. There may be a deadlock.",
self._retry_interval,
next,
)
return next * random.uniform(0.9, 1.1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explain this in the comments:

The jitter value is maintained for the timeout, to help avoid a "thundering herd" situation when all locks may time out at the same time.

self._timeout_interval = self._increment_timeout_interval()
except Exception as e:
logger.warning(
"Caught an exception while waiting on WaitingLock: %r", e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better context, we can also add the self.lock_name and self.lock_key

Suggested change
"Caught an exception while waiting on WaitingLock: %r", e
"Caught an exception while waiting on WaitingLock(%s, %s): %r", self.lock_name, self.lock_key, e


try:
# Wait until the we get notified the lock might have been
# Wait until the notification the lock might have been
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Wait until the notification the lock might have been
# Wait until the notification that the lock might have been

Comment on lines +278 to +279
self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2)
if self._retry_interval > Duration(minutes=10).as_secs(): # >12 iterations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to tackle that in a separate PR.

In the mean-time, I think my original suggestion still makes sense. And we can also assert that WORKER_LOCK_MAX_RETRY_INTERVAL > WORKER_LOCK_WARN_RETRY_INTERVAL so we always get a warning about excessive timeout.

Comment on lines +52 to +66
# Wrap the WaitingLock object, so we can detect if the timeouts are being hit
with patch.object(
lock2,
"_increment_timeout_interval",
wraps=lock2._increment_timeout_interval,
) as wrapped_lock2_increment_timeout_interval_method:
d2 = defer.ensureDeferred(lock2.__aenter__())
self.assertNoResult(d2)

# The lock should not time out here
wrapped_lock2_increment_timeout_interval_method.assert_not_called()
self.get_success(lock1.__aexit__(None, None, None))

self.get_success(d2)
self.get_success(lock2.__aexit__(None, None, None))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version of this test is only concerned that lock2 isn't acquired until lock1 releases.

I don't think we need to concern ourselves with internal details of timeouts here. We could have a separate test for this kind of thing but it doesn't have a ton of value.

Comment on lines 284 to 287
logger.warning(
"Lock timeout is getting excessive: %ss. There may be a deadlock.",
self._retry_interval,
next,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already exists but we can improve this message:

Suggested change
logger.warning(
"Lock timeout is getting excessive: %ss. There may be a deadlock.",
self._retry_interval,
next,
)
"WaitingLock(%s, %s): We are having to wait a long time for the lock. Wait timeout is getting excessive: %ss. There may be a deadlock.",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do one step better and instead track the time when we start trying to acquire the lock in __aenter__ and be able to compare to the actual time we have been waiting overall. Then we can be more specific and say "We have been waiting over 10 minutes (excessive) to acquire the lock (%s, %s). There may be a deadlock."

Then it doesn't matter what's causing us to wait and we can get a more accurate picture.

Feel free to push this off to the future.

Comment on lines +87 to +89
# Should be timed out 6 times, but do not fail on that exact count
wrapped_lock2_increment_timeout_interval_method.assert_called()
self.get_success(lock1.__aexit__(None, None, None))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to test whether we try_acquire_lock multiple times (what we actually care about).

Testing the internals isn't that useful.

Even more ideally, we wouldn't patch anything and instead acquire and release a lock silently (without notification) and see whether the second lock is still able to acquire some time later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants