This issue has been migrated from #14211.
We have have a problem where because events are persisted in a queue in a client_reader worker, there is no guarantee that they are available to read on other workers. So when we fire off a backfill request from /messages, those backfilled messages aren't necessarily available to paginate with after the backfill completes (even on the worker that put them in the persister queue).
CI failure: https://github.com/matrix-org/synapse/actions/runs/3182998161/jobs/5189731097#step:6:15343 (from discussion). This specific CI flake was addressed in matrix-org/complement#492
WORKERS=1 POSTGRES=1 COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestJumpToDateEndpoint/parallel/federation/can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint
Here is what happens:
serverB has event1 stored as an outlier from previous requests (specifically from MSC3030 jump to date pulling in a missing prev_event after backfilling)
- Client on
serverB calls /messages?dir=b
serverB:client_reader1 accepts the request and drives things
serverB:client_reader1 has some backward extremities in range and requests /backfill from serverA
serverB:client_reader1 processes the events from backfill including event1 and puts them in the _event_persist_queue
serverB:master picks up the events from the _event_persist_queue and persists them to the database, de-outliers event1 and invalidates its own cache and sends them over replication
serverB:client_reader1 starts assembling the /messages response and gets event1 out of the stale cache still as an outlier
serverB:client_reader1 responds to the /messages request without event1 because outliers are filtered out
serverB:client_reader1 finally gets the replication data and invalidates its own cache for event1 (too late, we already got the events from the stale cache and responded)
In a nutshell, we've written the test expecting "read-after-write consistency" but we don't have that.
-- matrix-org/synapse#13185 (comment)
It's exactly this but it really sucks that calling /messages doesn't include events we just backfilled for that request. This is a general problem with Synapse though, see issues labeled with https://github.com/matrix-org/synapse/labels/Z-Read-After-Write. In this case, it's all within the same /messages request so it's a little more insidious.
Having this be possible makes it even more of a reason that we should indicate gaps in /messages, MSC3871
This issue has been migrated from #14211.
We have have a problem where because events are persisted in a queue in a
client_readerworker, there is no guarantee that they are available to read on other workers. So when we fire off a backfill request from/messages, those backfilled messages aren't necessarily available to paginate with after the backfill completes (even on the worker that put them in the persister queue).CI failure: https://github.com/matrix-org/synapse/actions/runs/3182998161/jobs/5189731097#step:6:15343 (from discussion). This specific CI flake was addressed in matrix-org/complement#492
Here is what happens:
serverBhasevent1stored as anoutlierfrom previous requests (specifically from MSC3030 jump to date pulling in a missingprev_eventafter backfilling)serverBcalls/messages?dir=bserverB:client_reader1accepts the request and drives thingsserverB:client_reader1has some backward extremities in range and requests/backfillfromserverAserverB:client_reader1processes the events from backfill includingevent1and puts them in the_event_persist_queueserverB:masterpicks up the events from the_event_persist_queueand persists them to the database, de-outliersevent1and invalidates its own cache and sends them over replicationserverB:client_reader1starts assembling the/messagesresponse and getsevent1out of the stale cache still as anoutlierserverB:client_reader1responds to the/messagesrequest withoutevent1becauseoutliersare filtered outserverB:client_reader1finally gets the replication data and invalidates its own cache forevent1(too late, we already got the events from the stale cache and responded)It's exactly this but it really sucks that calling
/messagesdoesn't include events we just backfilled for that request. This is a general problem with Synapse though, see issues labeled with https://github.com/matrix-org/synapse/labels/Z-Read-After-Write. In this case, it's all within the same/messagesrequest so it's a little more insidious.Having this be possible makes it even more of a reason that we should indicate gaps in
/messages, MSC3871