-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Federation reader stops processing incoming requests after database crash #8470
Description
Description
Following my postgres instance being OOMkilled (a presumably unrelated issue), my federation reader worker stops processing incoming events (or processes them extremely slowly):
Here's the database server's memory usage chart showing the time at which the crash occurred:

Stacked up with the requests-in-flight (dark red is PUT FederationSendServlet on my federation_reader worker):

and age of last processed event (the new events that do come in are probably due to local activity?)

(I can provide other metrics graphs for this period upon request)
Note that the rest of the server continued working fine, it could exchange local messages and sync with clients without issues.
Log excerpt from the time of the crash attached (note that it appears to recover, the logs continue as if it were processing incoming requests but it doesn't seem to be reflected in the above graphs (or the observed behavior that messages from other servers stop coming in).
federation_reader.log.txt
Steps to reproduce
(note: I haven't attempted to reproduce this in isolation, but it has happened multiple times in situ with my current configuration)
- Set up the homeserver, with a postgres database, separate synapse.app.generic_worker handling the
^/_matrix/federation/v1/send/endpoint and redis replication.
(My worker config:
federation_reader.yaml.txt
- Kill postgres
Expected: possibly a few requests error out, but the worker should recover after the database comes back up
Actual: worker stops processing requests until killed and restarted
Version information
- Homeserver: matrix.cybre.space
If not matrix.org:
- Version:
{
"python_version": "3.6.8",
"server_version": "1.20.1 (b=master,86a72d1)" ,
}
-
Install method: pip
-
Platform: Ubuntu 18.04 VPS, not containerized.