-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Description
Checklist:
- I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- I've included steps to reproduce the bug.
- I've pasted the output of
argocd version.
Describe the bug
Some of the application get stuck at refresh operation indefinitely until the application controller restarts. We have argocd deployment with 3-replica of application controllers and 2 replica of argocd-server.
After some debugging it looks that one of the application controller replica doesn't complete the invalidation of live state cache and it's reinitialization. Once an invalidate live state cache triggers then the problematic replica would stop the automatic reconciliation of the applications which it was responsible for handling and there are verify minimal logging in the problematic replica and Memory consumption for that replica remains constant later throughout and CPU almost drops to zero. The symptoms point to a possible deadlock scenario during the reinitialization of cluster cache.
This problem is only seen with some specific cluster's applications, only with the cluster which is handled by the problematic argocd application controller replica.
Restarting the application controller statefulset seems to resolve the issue.
Similar issue reported here #8116
To Reproduce
Restarting/Deleting one of the two argocd-server pod seems to trigger the invalidation of cluster cache in all the application controller replica after which one of the replica shows the above explained symptoms of hang.
Expected behavior
No application should be stuck at refresh operations.
Invalidation of cluster cache and it's reinitialization should complete without any issues for all the application controller replica. No applications should be stuck in refresh operation and automatic reconciliation of application should run without any issues.
Screenshots
Version
v2.4.12
argocd-server: v2.4.12+28b8fea.dirty
BuildDate: 2022-09-19T03:28:31Z
GitCommit: 28b8fea2e68a931543b05e988e78241eb9487058
GitTreeState: dirty
GoVersion: go1.18.6
Compiler: gc
Platform: linux/amd64
Kustomize Version: unknown 1970-01-01T00:00:00Z
Helm Version: v3.9.0+g7ceeda6
Kubectl Version: v0.23.1
Jsonnet Version: v0.18.0
Logs
Here is the logs at the time of invalidation of cache with some redacted cluster URL for privacy reasons.
Logs from other replica with proper Invalidation and reinitialization of cluster cache
Note: To validate the reinitialization of the cache, look for logs with live state cache invalidated and then Start syncing cluster logs for all the clusters assigned to this replica.
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Start syncing cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="live state cache invalidated"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://kubernetes.default.svc"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted 3"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="invalidating live state cache"
Logs from the problematic replica
Note: No live state cache invalidated and Start syncing cluster logs after triggering of invalidation of live state cache event
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
| | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="invalidating live state cache"