Skip to content

Cleanup of postgres based messaging blocks workers from actually executing other tasks #20644

@ruifung

Description

@ruifung

Describe the bug

It seems like across 3 worker instances, they seem to not be efficiently executing tasks? I've observed from my postgres database a bunch of connections executing what seems to be the same delete statement.

This seems like really inefficient behavior? It seems like multiple "remove expired objects" tasks are in the task queue and you can apparently get 3 workers picking up the same job, which is really inefficient.

And I'm pretty sure this is a complete waste of my self hosted compute resources since the task queue is slowly increasing while it looks like 3 workers are contenting with each other to delete the same things.

How to reproduce

As far as I'm aware, it's just my installation with multiple workers, and admittedly, not a very fast database.

I've just noticed when checking the active connections from authentik in my postgres installation.

Expected behavior

I expect the workers to not assign ALL the workers simultaneously to trying to clean up expired objects. What's the point of running multiple workers when they can get stuck doing the same thing together.

Also, deleting each message row individually is really bad when apparently somehow on my installation it's generating, on average, a thousand messages per minute somehow. (Although a brief look through that table makes it look like it's all identical messages?)

Screenshots

image Image

Also look at all of them blocking on each other when I insert my own query to mass-delete expired messages.

Image

Additional context

Also, I've observed authentik (collectively) holding a significant number of idle connections over time in my postgres DB, as such, I actually enabled idle session timeout and idle in transaction session timeout on my database (It was literally kneeling over from authentik holding 140+ connections when I had a 200 connection limit).

Also there are 7,860,119 mesasges in that django_channels_postgres_message table (I just truncated it a few hours ago!), that the 3 workers seem to be clearing very inefficiently. Since all of them seem to be attempting to delete them one at a time.. on the same message across 3 workers.

These instances are all connecting directly to the primary postgres instance, with no connection pooler in between.

For now I've somewhat mitigated it by using pg_cron to run a pair of cleanup queries specifically targeting the django_channels_postgres_message and django_channels_postgres_groupchannel tables to remove expired entries.

Deployment Method

Kubernetes

Version

2026.2.0

Relevant log output

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriageAdd this label to issues that need to be triaged

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions