Long story short
When using kopf.index() with BatchingSettings(worker_limit=WORKER_LIMIT), operators can deadlock if WORKER_LIMIT is less than the total objects of the same indexed resource kind (e.g. index on Pod, cluster has 100 Pods, WORKER_LIMIT=50). Change-detection handlers never trigger, and the operator appears to hang after partial index population.
Root Cause
This is a deadlock caused by kopf's operator readiness mechanism:
- Each resource type has its own
Scheduler with the configured worker_limit
- During startup, kopf spawns one async worker per object to perform initial indexing
- After indexing completes, each worker blocks waiting for global operator readiness (all resources and objects indexed)
- With
worker_limit=1, only 1 worker can run per resource kind
- Deadlock: If we have 2 objects of the same kind to index,
Worker #1 is blocked waiting for Worker #2 to complete
indexing, but Worker #2 can't start because Worker #1 occupies the only available slot
Code Location (kopf internals)
The blocking occurs at kopf/_core/reactor/processing.py:106:
await operator_indexed.wait_for(True) # Blocks until ALL objects are indexed
But the Scheduler prevents spawning new workers at kopf/_cogs/aiokits/aiotasks.py:347-349:
def_can_spawn(self) -> bool:
return (not self._pending_coros.empty() and
(self._limit is None or len(self._running_tasks) < self._limit))
Kopf version
1.40.0
Kubernetes version
1.27.11
Python version
No response
Code
Logs
Additional information
This PR is trying to fix the issue #1218
Long story short
When using
kopf.index()withBatchingSettings(worker_limit=WORKER_LIMIT), operators can deadlock ifWORKER_LIMITis less than the total objects of the same indexed resource kind (e.g. index on Pod, cluster has 100 Pods, WORKER_LIMIT=50). Change-detection handlers never trigger, and the operator appears to hang after partial index population.Root Cause
This is a deadlock caused by kopf's operator readiness mechanism:
Schedulerwith the configuredworker_limitworker_limit=1, only 1 worker can run per resource kindWorker #1is blocked waiting forWorker #2to completeindexing, but
Worker #2can't start becauseWorker #1occupies the only available slotCode Location (kopf internals)
The blocking occurs at
kopf/_core/reactor/processing.py:106:But the Scheduler prevents spawning new workers at
kopf/_cogs/aiokits/aiotasks.py:347-349:Kopf version
1.40.0
Kubernetes version
1.27.11
Python version
No response
Code
Logs
Additional information
This PR is trying to fix the issue #1218