fix: Cap WorkerLock timeout intervals to 15 minutes#19394
fix: Cap WorkerLock timeout intervals to 15 minutes#19394jason-famedly wants to merge 6 commits intoelement-hq:developfrom
WorkerLock timeout intervals to 15 minutes#19394Conversation
…es, continue logging at durations greater than 10 minutes
changelog.d/19394.bugfix
Outdated
| @@ -0,0 +1 @@ | |||
| Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly. | |||
There was a problem hiding this comment.
In #19390 (comment) (another Famedly PR),
I am submitting this PR as an employee of Famedly, who has signed the corporate CLA, and used my company email in the commit.
I assume the same applies here?
There was a problem hiding this comment.
Yes, this is correct. Will we have to state such each time we upstream changes?
There was a problem hiding this comment.
To explain here, we didn't have any corporate CLA signed from Famedly at the time but just got confirmation today that this has now happened ⏩
The whole https://github.com/famedly org is now excempt from the CLA check but you need to make your org membership public in order for the check to be able to work.
synapse/handlers/worker_lock.py
Outdated
| self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2) | ||
| if self._retry_interval > Duration(minutes=10).as_secs(): # >12 iterations |
There was a problem hiding this comment.
It would be nice to have these as constants WORKER_LOCK_MAX_RETRY_INTERVAL and WORKER_LOCK_WARN_RETRY_INTERVAL (perhaps better name) so we can share better describe these values.
There was a problem hiding this comment.
I had actually considered such before also considering that a more flexible approach for different locks may be worth exploring. For example: when a lock is taken because an event is being persisted, the retry interval could be capped to a much smaller value, and the same for the logging of excessive times. Whereas, instead, a lock for purging a room might start with a longer retry interval but keep the cap the same.
Perhaps as defaults, if that exploration bears fruit. I'll shall add that to my notes for more work in this area, but I would rather do such separately
There was a problem hiding this comment.
Sounds good to tackle that in a separate PR.
In the mean-time, I think my original suggestion still makes sense. And we can also assert that WORKER_LOCK_MAX_RETRY_INTERVAL > WORKER_LOCK_WARN_RETRY_INTERVAL so we always get a warning about excessive timeout.
Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>
|
After the issue occured again in our prod: My hypothesis would be: the issue is not primarily about the dimensions of the growing timeout but, about the timeout being ignored at all? At least the logged timeout is not reflected in the timestamp deltas of the log lines?! |
Yes there is more than one thing going on here. This fix(switch |
…only increase it when a timeout occurs
|
Added some additional work onto this:
This allows that a normal notification of another lock being released should not increment a timeout, when a time out has not actually occurred. It should cut down on what may end up otherwise being excessive log spam about locks having a long timeout duration when such is not true. The jitter value is maintained for the timeout, to help avoid a "thundering herd" situation when all locks may time out at the same time. |
WorkerLock retry intervals to 15 minutesWorkerLock timeout intervals to 15 minutes
| # Only increment the timeout interval if this was an actual timeout | ||
| self._timeout_interval = self._increment_timeout_interval() |
There was a problem hiding this comment.
We should explain why in the comments,
Adjusted that a timeout interval should only be increased if an actual timeout was reached
This allows that a normal notification of another lock being released should not increment a timeout, when a time out has not actually occurred. It should cut down on what may end up otherwise being excessive log spam about locks having a long timeout duration when such is not true.
-- @jason-famedly, #19394 (comment)
changelog.d/19394.bugfix
Outdated
| @@ -0,0 +1 @@ | |||
| Prevent excessively long numbers for the retry interval of `WorkerLock`s. Contributed by Famedly. | |||
There was a problem hiding this comment.
To explain here, we didn't have any corporate CLA signed from Famedly at the time but just got confirmation today that this has now happened ⏩
The whole https://github.com/famedly org is now excempt from the CLA check but you need to make your org membership public in order for the check to be able to work.
| def _increment_timeout_interval(self) -> float: | ||
| next = self._timeout_interval | ||
| next = min(Duration(minutes=15).as_secs(), next * 2) | ||
| if next > Duration(minutes=10).as_secs(): # >12 iterations | ||
| logger.warning( | ||
| "Lock timeout is getting excessive: %ss. There may be a deadlock.", | ||
| self._retry_interval, | ||
| next, | ||
| ) | ||
| return next * random.uniform(0.9, 1.1) |
There was a problem hiding this comment.
We should probably just keep track of the new self._timeout_interval in here instead of assigning outside of this.
I think this makes even more sense now that this is named _increment_timeout_interval
| def _increment_timeout_interval(self) -> float: | ||
| next = self._timeout_interval | ||
| next = min(Duration(minutes=15).as_secs(), next * 2) | ||
| if next > Duration(minutes=10).as_secs(): # >12 iterations | ||
| logger.warning( | ||
| "Lock timeout is getting excessive: %ss. There may be a deadlock.", | ||
| self._retry_interval, | ||
| next, | ||
| ) | ||
| return next * random.uniform(0.9, 1.1) |
There was a problem hiding this comment.
We should explain this in the comments:
The jitter value is maintained for the timeout, to help avoid a "thundering herd" situation when all locks may time out at the same time.
| self._timeout_interval = self._increment_timeout_interval() | ||
| except Exception as e: | ||
| logger.warning( | ||
| "Caught an exception while waiting on WaitingLock: %r", e |
There was a problem hiding this comment.
For better context, we can also add the self.lock_name and self.lock_key
| "Caught an exception while waiting on WaitingLock: %r", e | |
| "Caught an exception while waiting on WaitingLock(%s, %s): %r", self.lock_name, self.lock_key, e |
|
|
||
| try: | ||
| # Wait until the we get notified the lock might have been | ||
| # Wait until the notification the lock might have been |
There was a problem hiding this comment.
| # Wait until the notification the lock might have been | |
| # Wait until the notification that the lock might have been |
synapse/handlers/worker_lock.py
Outdated
| self._retry_interval = min(Duration(minutes=15).as_secs(), next * 2) | ||
| if self._retry_interval > Duration(minutes=10).as_secs(): # >12 iterations |
There was a problem hiding this comment.
Sounds good to tackle that in a separate PR.
In the mean-time, I think my original suggestion still makes sense. And we can also assert that WORKER_LOCK_MAX_RETRY_INTERVAL > WORKER_LOCK_WARN_RETRY_INTERVAL so we always get a warning about excessive timeout.
| # Wrap the WaitingLock object, so we can detect if the timeouts are being hit | ||
| with patch.object( | ||
| lock2, | ||
| "_increment_timeout_interval", | ||
| wraps=lock2._increment_timeout_interval, | ||
| ) as wrapped_lock2_increment_timeout_interval_method: | ||
| d2 = defer.ensureDeferred(lock2.__aenter__()) | ||
| self.assertNoResult(d2) | ||
|
|
||
| # The lock should not time out here | ||
| wrapped_lock2_increment_timeout_interval_method.assert_not_called() | ||
| self.get_success(lock1.__aexit__(None, None, None)) | ||
|
|
||
| self.get_success(d2) | ||
| self.get_success(lock2.__aexit__(None, None, None)) |
There was a problem hiding this comment.
The previous version of this test is only concerned that lock2 isn't acquired until lock1 releases.
I don't think we need to concern ourselves with internal details of timeouts here. We could have a separate test for this kind of thing but it doesn't have a ton of value.
| logger.warning( | ||
| "Lock timeout is getting excessive: %ss. There may be a deadlock.", | ||
| self._retry_interval, | ||
| next, | ||
| ) |
There was a problem hiding this comment.
This already exists but we can improve this message:
| logger.warning( | |
| "Lock timeout is getting excessive: %ss. There may be a deadlock.", | |
| self._retry_interval, | |
| next, | |
| ) | |
| "WaitingLock(%s, %s): We are having to wait a long time for the lock. Wait timeout is getting excessive: %ss. There may be a deadlock.", |
There was a problem hiding this comment.
I think we could do one step better and instead track the time when we start trying to acquire the lock in __aenter__ and be able to compare to the actual time we have been waiting overall. Then we can be more specific and say "We have been waiting over 10 minutes (excessive) to acquire the lock (%s, %s). There may be a deadlock."
Then it doesn't matter what's causing us to wait and we can get a more accurate picture.
Feel free to push this off to the future.
| # Should be timed out 6 times, but do not fail on that exact count | ||
| wrapped_lock2_increment_timeout_interval_method.assert_called() | ||
| self.get_success(lock1.__aexit__(None, None, None)) |
There was a problem hiding this comment.
Probably better to test whether we try_acquire_lock multiple times (what we actually care about).
Testing the internals isn't that useful.
Even more ideally, we wouldn't patch anything and instead acquire and release a lock silently (without notification) and see whether the second lock is still able to acquire some time later.
Fixes the symptoms of #19315 but not the underlying reason causing the number to grow so large in the first place.
Copied from the original pull request on Famedly's Synapse repo (with some edits):
Basing the time interval around a 5 seconds leaves a big window of waiting especially as this window is doubled each retry, when another worker could be making progress but can not.
Right now, the retry interval in seconds looks like
[0.2, 5, 10, 20, 40, 80, 160, 320, (continues to double)]after which logging should start about excessive times and (relatively quickly) end up with an extremely large retry interval with an unrealistic expectation past the heat death of the universe. 1 year in seconds = 31,536,000.With this change, retry intervals in seconds should look more like:
Further suggested work in this area could be to define the cap, the retry interval starting point and the multiplier depending on how frequently this lock should be checked. See data below for reasons why. Increasing the jitter range may also be a good idea
Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.