Fix sliding sync performance slow down for long lived connections.#19206
Fix sliding sync performance slow down for long lived connections.#19206erikjohnston merged 75 commits intodevelopfrom
Conversation
We then filter them out before sending to the client, but it is unnecessary to do so and interferes with later changes.
This is so that clients know if they can use a cached `/members` response or not.
f67e114 to
0d6ccbe
Compare
This ensures that the set of required state doesn't keep growing as we add and remove member state. We then only load them from the DB when needed, rather than all state for all rooms when we get a request.
It was thinking the table name was `IN`, as it matched `connection_positi(on IS) NULL`.
0d6ccbe to
4984858
Compare
MadLittleMods
left a comment
There was a problem hiding this comment.
I haven't fully onboarded onto the concept and details to be confident in the approach.
synapse/storage/schema/main/delta/93/02_sliding_sync_members.sql
Outdated
Show resolved
Hide resolved
| else: | ||
| # For non-limited timelines we always return all | ||
| # membership changes. This is so that clients | ||
| # who have fetched the full membership list | ||
| # already can continue to maintain it for | ||
| # non-limited syncs. | ||
| # | ||
| # This assumes that for non-limited syncs there | ||
| # won't be many membership changes that wouldn't | ||
| # have been included already (this can only | ||
| # happen if membership state was rolled back due | ||
| # to state resolution anyway). | ||
| required_state_types.append((EventTypes.Member, None)) |
There was a problem hiding this comment.
This seems like a bigger behavioral change.
I think this fixes #18782 🤔 - If so, we should add a test.
There was a problem hiding this comment.
Ah, did mean to factor that out but it sneaked in as it needs to be accounted for in the lazy loading stuff.
There was a problem hiding this comment.
Added with test_lazy_load_state_reset ✅
There was a problem hiding this comment.
Actually, this only fixes it for non-limited syncs. I think we should also return state reset membership in limited timeline scenarios as well.
We should at-least leave a FIXME with a link to the issue in the if-block above.
There was a problem hiding this comment.
I don't think we want to return all membership changes when it is limited? Only the ones for users that appear in the timeline / required_state?
There was a problem hiding this comment.
When
limited, we should do this:If the state reset/rollback happened in the
timelinerange, we should give an update.If we don't want to do that in this PR, we should a) fix it properly b) leave a fixme behind or c) could consider any state rollback as relevant regardless (because sending more state is not wrong).
Why should we do that? In the limited scenario the client knows it has missed some membership updates, and so will need to requery them if needed.
There was a problem hiding this comment.
Because we're sending membership state for whatever is relevant in the timeline when lazy-loading. State rollbacks for membership can be just as relevant to the timeline.
We probably need to hop on a call for this.
There was a problem hiding this comment.
In the limited case if there is a state rollback for a user who has sent a message in the timeline, then that will get included? We only won't include a state rollback if that user is not referenced in the timeline?
There was a problem hiding this comment.
We only won't include a state rollback if that user is not referenced in the timeline?
Isn't that possible and ideally should be included?
There was a problem hiding this comment.
Related MSC discussion, matrix-org/matrix-spec-proposals#4186 (comment)
…iously_returned in tests
Co-authored-by: Eric Eastwood <[email protected]>
When fetching previously sent lazy members we didn't filter by room, which meant that we didn't send down member events in a room if we'd previously sent that user's member event in another room.
fe94608 to
ec45e00
Compare
synapse/storage/schema/main/delta/93/02_sliding_sync_members.sql
Outdated
Show resolved
Hide resolved
| -- When invalidating rows, we can just delete them. Technically this could | ||
| -- invalidate for a forked position, but this is acceptable as equivalent to a | ||
| -- cache eviction. | ||
| CREATE TABLE sliding_sync_connection_lazy_members ( |
| -- When invalidating rows, we can just delete them. Technically this could | ||
| -- invalidate for a forked position, but this is acceptable as equivalent to a | ||
| -- cache eviction. | ||
| CREATE TABLE sliding_sync_connection_lazy_members ( |
There was a problem hiding this comment.
I think the current iteration doesn't explain the problem well. We try to share rows in sliding_sync_connection_required_state across as many rooms in a list as possible. With lazy-loading room members, sliding_sync_connection_required_state constantly churns for each room individually and they can no longer be shared. And since sliding_sync_connection_required_state stores a big JSON list of state of all of the required state for each room, it's not efficient. We can instead store a single row for each user in each room in this new table sliding_sync_connection_lazy_members, etc.
(not very good words)
| @@ -0,0 +1 @@ | |||
| Fix sliding sync performance slow down for long lived connections. | |||
There was a problem hiding this comment.
In terms of the optimizations being applied on top of Sliding Sync, we already had a pretty high complexity in this area and now it's being multiplied again.
I fear for anyone else who has to try to understand and adapt this further. It's hard enough for me as the one familiar with all of the Sliding Sync code and being witness to all of it growing over time.
We do have decent tests and comments explaining the decisions here if you want to move this forward ⏩
There was a problem hiding this comment.
This PR is complex, but I think from a high-level PoV makes more sense. The concept is simple: we need to cache which memberships we've sent down when lazy-loading and we do that by storing it in a table. The actual implementation is definitely a bit finicky. If we were doing this from scratch I'd also factor out the optimisation for remembering what other state we've sent down too, as that is a great source of complexity.
Either way, we need to fix this bug ASAP as it's causing bad perf regressions for users.
Co-authored-by: Eric Eastwood <[email protected]>
|
Thanks for all the reviews @MadLittleMods ! ❤️ |
|
🎉 |
Fixes #19175
This PR moves tracking of what lazy loaded membership we've sent to each room out of the required state table. This avoids that table from continuously growing, which massively helps performance as we pull out all matching rows for the connection when we receive a request.
The new table is only read when we have data in a room to send, so we end up reading a lot fewer rows from the DB. Though we now read from that table for every room we have events to return in, rather than once at the start of the request.
For an explanation of how the new table works, see the comment on the table schema.
The table is designed so that we can later prune old entries if we wish, but that is not implemented in this PR.
Reviewable commit-by-commit.