Skip to content

Fix sliding sync performance slow down for long lived connections.#19206

Merged
erikjohnston merged 75 commits intodevelopfrom
erikj/sss_better_membership_storage2
Dec 12, 2025
Merged

Fix sliding sync performance slow down for long lived connections.#19206
erikjohnston merged 75 commits intodevelopfrom
erikj/sss_better_membership_storage2

Conversation

@erikjohnston
Copy link
Copy Markdown
Member

@erikjohnston erikjohnston commented Nov 20, 2025

Fixes #19175

This PR moves tracking of what lazy loaded membership we've sent to each room out of the required state table. This avoids that table from continuously growing, which massively helps performance as we pull out all matching rows for the connection when we receive a request.

The new table is only read when we have data in a room to send, so we end up reading a lot fewer rows from the DB. Though we now read from that table for every room we have events to return in, rather than once at the start of the request.

For an explanation of how the new table works, see the comment on the table schema.

The table is designed so that we can later prune old entries if we wish, but that is not implemented in this PR.

Reviewable commit-by-commit.

We then filter them out before sending to the client, but it is
unnecessary to do so and interferes with later changes.
This is so that clients know if they can use a cached `/members`
response or not.
@erikjohnston erikjohnston force-pushed the erikj/sss_better_membership_storage2 branch from f67e114 to 0d6ccbe Compare November 20, 2025 13:43
This ensures that the set of required state doesn't keep growing as we
add and remove member state. We then only load them from the DB when
needed, rather than all state for all rooms when we get a request.
It was thinking the table name was `IN`, as it matched
`connection_positi(on IS) NULL`.
@erikjohnston erikjohnston force-pushed the erikj/sss_better_membership_storage2 branch from 0d6ccbe to 4984858 Compare November 20, 2025 13:52
@erikjohnston erikjohnston marked this pull request as ready for review November 20, 2025 15:52
@erikjohnston erikjohnston requested a review from a team as a code owner November 20, 2025 15:52
Copy link
Copy Markdown
Contributor

@MadLittleMods MadLittleMods left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully onboarded onto the concept and details to be confident in the approach.

Comment on lines +1089 to +1101
else:
# For non-limited timelines we always return all
# membership changes. This is so that clients
# who have fetched the full membership list
# already can continue to maintain it for
# non-limited syncs.
#
# This assumes that for non-limited syncs there
# won't be many membership changes that wouldn't
# have been included already (this can only
# happen if membership state was rolled back due
# to state resolution anyway).
required_state_types.append((EventTypes.Member, None))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bigger behavioral change.

I think this fixes #18782 🤔 - If so, we should add a test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, did mean to factor that out but it sneaked in as it needs to be accounted for in the lazy loading stuff.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added with test_lazy_load_state_reset

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this only fixes it for non-limited syncs. I think we should also return state reset membership in limited timeline scenarios as well.

We should at-least leave a FIXME with a link to the issue in the if-block above.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to return all membership changes when it is limited? Only the ones for users that appear in the timeline / required_state?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When limited, we should do this:

If the state reset/rollback happened in the timeline range, we should give an update.

If we don't want to do that in this PR, we should a) fix it properly b) leave a fixme behind or c) could consider any state rollback as relevant regardless (because sending more state is not wrong).

Why should we do that? In the limited scenario the client knows it has missed some membership updates, and so will need to requery them if needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we're sending membership state for whatever is relevant in the timeline when lazy-loading. State rollbacks for membership can be just as relevant to the timeline.

We probably need to hop on a call for this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the limited case if there is a state rollback for a user who has sent a message in the timeline, then that will get included? We only won't include a state rollback if that user is not referenced in the timeline?

Copy link
Copy Markdown
Contributor

@MadLittleMods MadLittleMods Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only won't include a state rollback if that user is not referenced in the timeline?

Isn't that possible and ideally should be included?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikjohnston erikjohnston force-pushed the erikj/sss_better_membership_storage2 branch from fe94608 to ec45e00 Compare November 25, 2025 11:12
-- When invalidating rows, we can just delete them. Technically this could
-- invalidate for a forked position, but this is acceptable as equivalent to a
-- cache eviction.
CREATE TABLE sliding_sync_connection_lazy_members (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> aa2c426

-- When invalidating rows, we can just delete them. Technically this could
-- invalidate for a forked position, but this is acceptable as equivalent to a
-- cache eviction.
CREATE TABLE sliding_sync_connection_lazy_members (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current iteration doesn't explain the problem well. We try to share rows in sliding_sync_connection_required_state across as many rooms in a list as possible. With lazy-loading room members, sliding_sync_connection_required_state constantly churns for each room individually and they can no longer be shared. And since sliding_sync_connection_required_state stores a big JSON list of state of all of the required state for each room, it's not efficient. We can instead store a single row for each user in each room in this new table sliding_sync_connection_lazy_members, etc.

(not very good words)

@@ -0,0 +1 @@
Fix sliding sync performance slow down for long lived connections.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the optimizations being applied on top of Sliding Sync, we already had a pretty high complexity in this area and now it's being multiplied again.

I fear for anyone else who has to try to understand and adapt this further. It's hard enough for me as the one familiar with all of the Sliding Sync code and being witness to all of it growing over time.

We do have decent tests and comments explaining the decisions here if you want to move this forward ⏩

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is complex, but I think from a high-level PoV makes more sense. The concept is simple: we need to cache which memberships we've sent down when lazy-loading and we do that by storing it in a table. The actual implementation is definitely a bit finicky. If we were doing this from scratch I'd also factor out the optimisation for remembering what other state we've sent down too, as that is a great source of complexity.

Either way, we need to fix this bug ASAP as it's causing bad perf regressions for users.

@erikjohnston
Copy link
Copy Markdown
Member Author

Thanks for all the reviews @MadLittleMods ! ❤️

@erikjohnston erikjohnston merged commit dfd00a9 into develop Dec 12, 2025
77 of 80 checks passed
@erikjohnston erikjohnston deleted the erikj/sss_better_membership_storage2 branch December 12, 2025 10:02
@Hywan
Copy link
Copy Markdown
Member

Hywan commented Dec 16, 2025

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow sliding sync when connection metadata gets large

3 participants