fix: Adjust timing on WorkerLocks to allow faster resolution of the lock#221
fix: Adjust timing on WorkerLocks to allow faster resolution of the lock#221jason-famedly wants to merge 1 commit intomasterfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #221 +/- ##
=======================================
Coverage 80.14% 80.14%
=======================================
Files 498 498
Lines 71133 71133
Branches 10683 10683
=======================================
+ Hits 57007 57009 +2
+ Misses 10886 10885 -1
+ Partials 3240 3239 -1
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
0d476d8 to
05e50a6
Compare
fa96612 to
26426ca
Compare
synapse/handlers/worker_lock.py
Outdated
| def _get_next_retry_interval(self) -> float: | ||
| next = self._retry_interval | ||
| self._retry_interval = max(5, next * 2) | ||
| self._retry_interval = min(5.0, next * 2) |
There was a problem hiding this comment.
With this change, retry intervals in seconds should look more like [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 5.0, 5.0, 5.0, (stays at 5.0)]
Any interest in upstreaming this change?
It was mentioned as part of the fix for element-hq/synapse#19315 to prevent getting to the point that the number grows so large that we see ValueError: Exceeds the limit (4300 digits) for integer string conversion; use sys.set_int_max_str_digits() to increase the limit.
The only concern with a low value of 5.0 seconds is that it increases the amount of database activity from all of the workers checking the lock more often. It's unclear if the increased database activity is even a problem in practice. At the very-least, we could introduce some high maximum like 15 minutes. 15 minutes is higher than the 10 minute sanity check we have so we could still notice this failure mode:
synapse/synapse/handlers/worker_lock.py
Lines 279 to 283 in b958daf
There was a problem hiding this comment.
Hello! I think it would be great to upstream. I have compiled some information on declarations of WorkerLocks, as I agree that a ceiling of 5 seconds is very much to low in general. I will leave the information here in another comment, but we are still investigating if it is possible to use this repo to submit to upstream
There was a problem hiding this comment.
NEW_EVENT_DURING_PURGE_LOCK_NAME = "new_event_during_purge_lock"
PURGE_PAGINATION_LOCK_NAME = "purge_pagination_lock"
ONE_TIME_KEY_UPLOAD = "one_time_key_upload_lock"
-
federation/federation_server.FederationServer._process_incoming_pdus_in_room_inner()- note: a
LockStore.Lock(_INBOUND_EVENT_HANDLING_LOCK_NAME, room_id)is taken before this is called(from_handle_received_pdu()). I believe this is a background task - lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- note: a
-
handlers/e2e_keys.E2eKeysHandler._upload_one_time_keys_for_user()- lock:
acquire_lock() - task:
ONE_TIME_KEY_UPLOAD - lock_key:
f"{user_id}_{device_id}"
- lock:
-
handlers/federation.FederationHandler.maybe_backfill()- note: Inside a
Linearizercontext manager keyed to room_id - lock:
acquire_read_write_lock() - task:
PURGE_PAGINATION_LOCK_NAME - lock_key: room_id
- write: False
- note: Inside a
-
handlers/message.EventCreationHandler.create_and_send_nonmember_event()- lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- lock:
-
handlers/message.EventCreationHandler._send_dummy_events_to_fill_extremities()- lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- lock:
-
handlers/pagination.PaginationHandler.purge_history()- lock:
acquire_read_write_lock() - task:
PURGE_PAGINATION_LOCK_NAME - lock_key: room_id
- write: True
- lock:
-
handlers/pagination.PaginationHandler.purge_room()- lock:
acquire_multi_read_write_lock() - task:
PURGE_PAGINATION_LOCK_NAMENEW_EVENT_DURING_PURGE_LOCK_NAME
- lock_key: room_id (for both tasks)
- write: True
- lock:
-
handlers/room_member.RoomMemberHandler.update_membership()- note: This is inside two
Linearizers, one for a given room and another for limiting appservice member changes to no more than 10 simultaneous - lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- note: This is inside two
-
rest/client/room_upgrade_rest_servlet.RoomUpgradeRestServlet.on_POST()- lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- lock:
-
storage/controllers/persist_events.EventsPersistenceStorageController._process_event_persist_queue_task()- lock:
acquire_read_write_lock() - task:
NEW_EVENT_DURING_PURGE_LOCK_NAME - lock_key: room_id
- write: False
- lock:
Of those, these are more likely to be high frequency attempts
federation/federation_server.FederationServer._process_incoming_pdus_in_room_inner()handlers/e2e_keys.E2eKeysHandler._upload_one_time_keys_for_user()handlers/message.EventCreationHandler.create_and_send_nonmember_event()handlers/message.EventCreationHandler._send_dummy_events_to_fill_extremities()handlers/room_member.RoomMemberHandler.update_membership()storage/controllers/persist_events.EventsPersistenceStorageController._process_event_persist_queue_task()
These are possibly likely to be between medium and high frequency attempts
handlers/federation.FederationHandler.maybe_backfill()rest/client/room_upgrade_rest_servlet.RoomUpgradeRestServlet.on_POST()
These are possibly low volume attempts
handlers/pagination.PaginationHandler.purge_history()handlers/pagination.PaginationHandler.purge_room()
It is probable that allowing a customizable retry attempt interval to be passed into the WorkerLock on creation would be helpful.
05e50a6 to
f58b379
Compare
…ch attempt and cap at 15 minutes
f58b379 to
b399876
Compare
|
Pushing this to upstream instead |
|
For easier reference -> element-hq/synapse#19394 (linked in the breadcrumbs here but easy to miss) |
Basing the time interval around a 5 seconds leaves a big window of waiting especially as this window is doubled each retry, when another worker could be making progress but can not.
Right now, the retry interval in seconds looks like [0.2, 5, 10, 20, 40, 80, 160, 320, (continues to double)] after which logging should start about excessive times
With this change, retry intervals in seconds should look more like:
Further suggested work in this area could be to define the cap, the retry interval starting point and the multiplier depending on how frequently this lock should be checked. See data below for reasons why. Increasing the jitter range may also be a good idea