Skip to content

Bundle controller hits massive update conflicts removing finalizers from shared content resources at scale #4251

@manno

Description

@manno

Noticed this during my last scaling experiment.

When a single content resource is shared by a large number of bundledeployments (e.g., thousands), the bundle reconciler runs into a race condition when trying to remove finalizers from that shared content resource.

This results in a constant stream of update conflicts. The reconciler's exponential backoff kicks in (?), and the queue effectively grinds to a halt. I was seeing only a handful of reconciles completing every few minutes.
This stops all reconciles for that bundle of course. So it probably affects not only deletion, but also updates.

This is the spot in the code I was looking at: https://github.com/manno/fleet/blob/a3c2ba411c3f284f6140cdeb8c82861ca61eb106/internal/cmd/controller/reconciler/bundle_controller.go#L495-L497

Restarting the fleet-controller pod fixes the issue almost immediately.
We probably need to find a way to make this finalizer removal less "chatty" or handle the fan-out scenario more gracefully.

Metadata

Metadata

Assignees

Type

Projects

Status

👀 In review

Relationships

None yet

Development

No branches or pull requests

Issue actions