replication timeouts due to message retention purge jobs

This issue has been migrated from [#16489](https://github.com/matrix-org/synapse/issues/16489).

---

### Description

Assumingly due to https://github.com/matrix-org/synapse/pull/13632, the master process is unable to handle replication requests by workers due to the load from purge jobs. It is happily logging updates on the purge job states while clients can't connect anymore.

### Steps to reproduce

- enable message retention
- maybe be a big instance idk
- wait for the scheduled job to execute


### Homeserver

tchncs.de

### Synapse Version

1.94.0

### Installation Method

pip (from PyPI)

### Database

PostgreSQL

### Workers

Multiple workers

### Platform

Debian GNU/Linux 12 (bookworm), dedicated

### Configuration

draupnir module, presence, retention

### Relevant log output

```shell
synapse.replication.tcp.client - 352 - INFO - _process_incoming_pdus_in_room_inner-124023-$fbrT_6mck678v_gNV527V0f5Jp4kvbDiQVSeHOmiN2E - Finished waiting for repl stream 'events' to reach 361593234 (event_persister1)
synapse.http.client - 923 - INFO - PUT-890470 - Received response to POST synapse-replication://master/_synapse/replication/fed_send_edu/m.receipt/IjFSBKBxIa: 200
synapse.replication.tcp.client - 332 - INFO - PUT-890470 - Waiting for repl stream 'caches' to reach 416737455 (master); currently at: 416710210
synapse.replication.tcp.client - 342 - WARNING - PUT-890464 - Timed out waiting for repl stream 'caches' to reach 416737417 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.http._base - 300 - WARNING - GET-2559861 - presence_set_state request timed out; retrying
synapse.replication.http._base - 312 - WARNING - PUT-899550 - fed_send_edu request connection failed; retrying in 1s: ConnectError(<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>)
synapse.http.client - 932 - INFO - PUT-901284 - Error sending request to  POST synapse-replication://master/_synapse/replication/fed_send_edu/m.presence/WCoECfmCdH: ConnectError [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]
```

<img width="2545" alt="SCR-20231014-kdvv" src="https://github.com/matrix-org/synapse/assets/10698140/9fb50058-46e5-405f-b907-de18e6bc5201">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replication timeouts due to message retention purge jobs #16489

Description

Steps to reproduce

Homeserver

Synapse Version

Installation Method

Database

Workers

Platform

Configuration

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

replication timeouts due to message retention purge jobs #16489

Description

Description

Steps to reproduce

Homeserver

Synapse Version

Installation Method

Database

Workers

Platform

Configuration

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions