Delay when mounting Persistent Volumes #160724

remover · 2023-04-04T13:40:30Z

remover
Apr 4, 2023

We are deploying ARC on GKE and have followed the guide for docker layer caching. We've noticed that usually when the cluster is "cold", i.e. there's not many jobs running, it can take some time after mounting a PV on the pod before it starts pulling the ARC image (45s in this case):

╰─$ kubectl describe pod i-build-dev-org-2w9vx-0
Events:
  Type    Reason                  Age   From                                   Message
  ----    ------                  ----  ----                                   -------
  Normal  Scheduled               84s   gke.io/optimize-utilization-scheduler  Successfully assigned default/i-build-dev-org-2w9vx-0 to gke-runners-gew1-nap-t2d-standar-16-1-09ee136a-5sv1
  Normal  SuccessfulAttachVolume  80s   attachdetach-controller                AttachVolume.Attach succeeded for volume "pvc-28d2c2e0-69ca-4bc0-b078-dca4a3c4745c"
  Normal  Pulling                 35s   kubelet                                Pulling image "ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind:ubuntu-20.04"
...

ARC related logs for this pod show a lot of logs like: runnerpersistentvolumeclaim Retrying sync until statefulset gets removed

╰─$ kubectl logs actions-runner-controller-5b9d47f7f4-2987f  -n actions-runner-system | grep "i-build-dev-org-2w9vx"

2023-04-04T10:42:39Z	DEBUG	runnerset	Skipped reconcilation because owner is not synced yet	{"runnerset": "default/i-build-dev-org", "owner": "default/i-build-dev-org-2w9vx", "pods": null}
2023-04-04T10:42:39Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:42:39Z	DEBUG	runnerset	Added runner-statefulset-name label to PVC	{"runnerset": "default/i-build-dev-org", "ns": "default", "sts": "i-build-dev-org-2w9vx", "pvc": "var-lib-docker-i-build-dev-org-2w9vx-0"}
2023-04-04T10:42:39Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:42:39Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:42:49Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:42:59Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:43:09Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:43:19Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:43:22Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:43:29Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:43:39Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:43:49Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:43:59Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:44:09Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:44:19Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:44:20Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
2023-04-04T10:44:27Z	INFO	runnerpod	Runner pod has been stopped with a successful status.	{"runnerpod": "default/i-build-dev-org-2w9vx-0"}
2023-04-04T10:44:29Z	INFO	runnerpersistentvolumeclaim	Updated PV to unset claimRef	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "sts": "i-build-dev-org-2w9vx"}
2023-04-04T10:44:29Z	INFO	runnerpersistentvolumeclaim	Deleted unused PVC	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "sts": "i-build-dev-org-2w9vx"}
2023-04-04T10:44:29Z	INFO	runnerpersistentvolumeclaim	Updated PV to unset claimRef	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "sts": "i-build-dev-org-2w9vx"}
2023-04-04T10:44:29Z	INFO	runnerpersistentvolumeclaim	Deleted unused PVC	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "sts": "i-build-dev-org-2w9vx"}

It seems like this loop is affecting the speed at which the pod picks up the job.

We have terminationGracePeriodSeconds: 600 on RunnerSet and also scaleDownDelaySecondsAfterScaleOut: 600 on the HRA. I'm not sure if that could be affecting it...

Has anybody experienced anything similar or does anyone have any ideas for how to mitigate?

kevholmes · 2023-04-04T14:55:28Z

kevholmes
Apr 4, 2023

We have been running ARC on GKE with GHES for about six months now. Recently we enabled Dependabot and those runs were taking forever due to this running directly from a container which is kinda large (1.5+ GiB if I remember correctly - lots of pulling from ghcr.io with stall-outs and I believe some rate limiting.) I migrated those to a RunnerSet following the ARC documentation as you've linked to cache these images and eliminate our ghcr.io Dependabot image pull woes.

I haven't directly seen this kind of behavior you are describing, yet. We are running GKE 1.24.10-gke.2300 at the moment. One thing that came to mind here - are you running this as a regional or zonal cluster?

Also - what storage provisioner are you using? I PoC'd this out with kubernetes.io/gce-pd but quickly moved to pd.csi.storage.gke.io after reading that the kubernetes.io/gce-pd option might go away and would be replaced by the not-in-tree CSI driver.

4 replies

remover Apr 4, 2023
Author

it's a regional cluster. i was thinking of experimenting with moving to a single zone but wanted to try to understand the cause first. do you think that could help?

to your second question, we're also using pd.csi.storage.gke.io

kevholmes Apr 4, 2023

it's a regional cluster. i was thinking of experimenting with moving to a single zone but wanted to try to understand the cause first. do you think that could help?

to your second question, we're also using pd.csi.storage.gke.io

How "cold" do these get? Like do you go days or more between runs?

I have a theory that putting our own RunnerSets with PV dependencies into a single zone might improve performance but I have yet to test this out in real life.

Could it be possible that GKE brings a pod online to handle this runner, and it's in a zone where you currently do not have any GCE volumes already available to satisfy the pod - and that <60s delay is from this volume being created and mounted by the GCE CSI driver?

remover Apr 5, 2023
Author

How "cold" do these get? Like do you go days or more between runs?

it can be a matter of hours or days without a single runner running (mainly cos i'm experimenting with this. in production it will typically run far more frequently)

i am going to experiment with the single zone setup.

Could it be possible that GKE brings a pod online to handle this runner, and it's in a zone where you currently do not have any GCE volumes already available to satisfy the pod - and that <60s delay is from this volume being created and mounted by the GCE CSI driver?

i guess it could be but it feels like this happens too often for it to be the reason for the delay.

what i find odd about the above logs is that it seems to be waiting for a stateful set with the name i-build-dev-org-2w9vx to be deleted:

2023-04-04T10:42:39Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}

but then the pod that completes has a name of the same statefulset:

2023-04-04T10:44:27Z	INFO	runnerpod	Runner pod has been stopped with a successful status.	{"runnerpod": "default/i-build-dev-org-2w9vx-0"}

seems odd that it would be the same statefulset before and after, no? i guess i don't quite get the logs, though...

kevholmes Apr 5, 2023

i am going to experiment with the single zone setup.

Nice. Let us know how it goes.

what i find odd about the above logs is that it seems to be waiting for a stateful set with the name i-build-dev-org-2w9vx to be deleted:
2023-04-04T10:42:39Z	DEBUG	runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "default/var-lib-docker-i-build-dev-org-2w9vx-0", "requeueAfter": "10s"}
but then the pod that completes has a name of the same statefulset:
2023-04-04T10:44:27Z	INFO	runnerpod	Runner pod has been stopped with a successful status.	{"runnerpod": "default/i-build-dev-org-2w9vx-0"}
seems odd that it would be the same statefulset before and after, no? i guess i don't quite get the logs, though...

I agree that part looks very odd. There have been other discussions here about how ARC might not be using these PVCs correctly and perhaps this is what you've found here. I just did a little looking around in our system to see if we have similar messages in the controller logs, and we do.

Every RunnerSet pod I've got online at the moment (min of three available, with three available) has a PVC to a volume. When I look at the controller logs, I see these same error messages you've mentioned for the three claims to volumes currently in-use by pods that are waiting for a job to come in. So just to clarify these are currently online, running and ready to process a job from GH - but this DEBUG msg appears to indicate the controller is constantly trying to review/modify the runner's PVC and looking for state changes? Maybe these are just a red herring and not actually the cause of your issues?

2023-04-05T13:22:28Z	DEBUG	actions-runner-controller.runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-9m66h-0", "requeueAfter": "10s"}
2023-04-05T13:22:29Z	DEBUG	actions-runner-controller.runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-ggxtv-0", "requeueAfter": "10s"}
2023-04-05T13:22:31Z	DEBUG	actions-runner-controller.runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-njz6n-0", "requeueAfter": "10s"}
2023-04-05T13:22:35Z	DEBUG	actions-runner-controller.runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-9m66h-0", "requeueAfter": "10s"}
2023-04-05T13:22:35Z	DEBUG	actions-runner-controller.runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-ggxtv-0", "requeueAfter": "10s"}
2023-04-05T13:22:35Z	DEBUG	actions-runner-controller.runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-njz6n-0", "requeueAfter": "10s"}

I triggered an Actions workflow using one of these RunnerSet runners with PVs, the job finished and recycled out the old Pod for a new one as we're using ephemeral: true for these. This is the output in the controller if you'd like to compare to what you're seeing:

2023-04-05T13:31:20Z	DEBUG	actions-runner-controller.runnerset	Created replica(s)	{"runnerset": "actions-runners/std-ci-dependabot-runner-rootless-dind-1", "lastSyncTime": "2023-04-05T02:01:16Z", "effectiveTime": "<nil>", "templateHashDesired": "86f4dbf967", "replicasDesired": 3, "replicasPending": 0, "replicasRunning": 2, "replicasMaybeRunning": 2, "templateHashObserved": ["86f4dbf967"], "created": 1}
2023-04-05T13:31:20Z	DEBUG	actions-runner-controller.runnerset	Skipped reconcilation because owner is not synced yet	{"runnerset": "actions-runners/std-ci-dependabot-runner-rootless-dind-1", "owner": "actions-runners/std-ci-dependabot-runner-rootless-dind-1-276v5", "pods": null}
2023-04-05T13:31:20Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-runner-set-pod", "UID": "721c635b-71f5-40e7-9d18-f089dac13631", "kind": "/v1, Kind=Pod", "resource": {"group":"","version":"v1","resource":"pods"}}
2023-04-05T13:31:20Z	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-runner-set-pod", "code": 200, "reason": "", "UID": "721c635b-71f5-40e7-9d18-f089dac13631", "allowed": true}
2023-04-05T13:31:20Z	INFO	actions-runner-controller.runnerpod	Runner pod has been stopped with a successful status.	{"runnerpod": "actions-runners/std-ci-dependabot-runner-rootless-dind-1-njz6n-0"}
2023-04-05T13:31:20Z	DEBUG	actions-runner-controller.runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-276v5-0", "requeueAfter": "10s"}
2023-04-05T13:31:20Z	DEBUG	actions-runner-controller.runnerset	Added runner-statefulset-name label to PVC	{"runnerset": "actions-runners/std-ci-dependabot-runner-rootless-dind-1", "ns": "actions-runners", "sts": "std-ci-dependabot-runner-rootless-dind-1-276v5", "pvc": "dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-276v5-0"}
2023-04-05T13:31:20Z	DEBUG	actions-runner-controller.runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-276v5-0", "requeueAfter": "10s"}
2023-04-05T13:31:20Z	DEBUG	actions-runner-controller.runnerpersistentvolumeclaim	Retrying sync until statefulset gets removed	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-276v5-0", "requeueAfter": "10s"}
2023-04-05T13:31:21Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets released	{"pv": "/pvc-960869d9-0990-41ff-be2c-a84c89595275", "requeueAfter": "10s"}
2023-04-05T13:31:21Z	INFO	actions-runner-controller.runnerpersistentvolumeclaim	Updated PV to unset claimRef	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-njz6n-0", "sts": "std-ci-dependabot-runner-rootless-dind-1-njz6n"}
2023-04-05T13:31:21Z	INFO	actions-runner-controller.runnerpersistentvolumeclaim	Deleted unused PVC	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-njz6n-0", "sts": "std-ci-dependabot-runner-rootless-dind-1-njz6n"}
2023-04-05T13:31:21Z	INFO	actions-runner-controller.runnerpersistentvolumeclaim	Updated PV to unset claimRef	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-njz6n-0", "sts": "std-ci-dependabot-runner-rootless-dind-1-njz6n"}
2023-04-05T13:31:21Z	INFO	actions-runner-controller.runnerpersistentvolumeclaim	Deleted unused PVC	{"pvc": "actions-runners/dependabot-docker-data-dir-std-ci-dependabot-runner-rootless-dind-1-njz6n-0", "sts": "std-ci-dependabot-runner-rootless-dind-1-njz6n"}
2023-04-05T13:31:22Z	DEBUG	actions-runner-controller.runnerpersistentvolume	Retrying sync until pvc gets released	{"pv": "/pvc-960869d9-0990-41ff-be2c-a84c89595275", "requeueAfter": "10s"}
2023-04-05T13:31:22Z	INFO	actions-runner-controller.runnerpersistentvolume	PV should be Available now	{"pv": "/pvc-960869d9-0990-41ff-be2c-a84c89595275"}

Also, we are still on v0.26.0.

remover · 2023-04-11T12:31:31Z

remover
Apr 11, 2023
Author

thanks for digging out the logs.

i didn't find any improvement from moving to a single zone. tbh i've more questions than answers.

i found that you can get more information about the volume attach / mount process by sshing onto the node and running sudo journalctl -u kubelet -o cat | grep -E "fsck|<your-pv-name>". from there you can see logs like:

I0411 08:18:53.883980 2147 operation_generator.go:658] “MountVolume.MountDevice succeeded for volume \“pvc-333f4feb-e1d2-41ad-bd68-05b7f85c3f44\” (UniqueName: \“kubernetes.io/csi/pd.csi.storage.gke.io^projects/sc-builds-dev/zones/europe-west1-b/disks/pvc-333f4feb-e1d2-41ad-bd68-05b7f85c3f44\“) pod \“i-deploy-dev-org-nb26f-0\” (UID: \“fb7b8933-a53e-47ce-b5b3-dde5f65fc260\“) device mount path \“/var/lib/kubelet/plugins/kubernetes.io/csi/pd.csi.storage.gke.io/868da399912e22fcc7cef386cd5a2335786ec545eca234b3268ae3a8f98150fa/globalmount\“” pod=“default/i-deploy-dev-org-nb26f-0"

W0411 08:19:23.892195 2147 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/fb7b8933-a53e-47ce-b5b3-dde5f65fc260/volumes/kubernetes.io~csi/pvc-333f4feb-e1d2-41ad-bd68-05b7f85c3f44/mount and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see kubernetes/kubernetes#69699

I0411 08:19:53.591903 2147 operation_generator.go:730] “MountVolume.SetUp succeeded for volume \“pvc-333f4feb-e1d2-41ad-bd68-05b7f85c3f44\” (UniqueName: \“kubernetes.io/csi/pd.csi.storage.gke.io^projects/sc-builds-dev/zones/europe-west1-b/disks/pvc-333f4feb-e1d2-41ad-bd68-05b7f85c3f44\“) pod \“i-deploy-dev-org-nb26f-0\” (UID: \“fb7b8933-a53e-47ce-b5b3-dde5f65fc260\“) ” pod=“default/i-deploy-dev-org-nb26f-0"

notice the time difference between the 3 logs. i’ve tried following the advice in the link above (fsGroupChangePolicy: "OnRootMismatch") but it didn’t improve the performance either…

0 replies

2025-07-27T03:58:18Z

github-actions[bot]
bot Jul 27, 2025

🕒 Discussion Activity Reminder 🕒

This Discussion has been labeled as dormant by an automated system for having no activity in the last 60 days. Please consider one the following actions:

1️⃣ Close as Out of Date: If the topic is no longer relevant, close the Discussion as out of date at the bottom of the page.

2️⃣ Provide More Information: Share additional details or context — or let the community know if you've found a solution on your own.

3️⃣ Mark a Reply as Answer: If your question has been answered by a reply, mark the most helpful reply as the solution.

Note: This dormant notification will only apply to Discussions with the Question label. To learn more, see our recent announcement.

Thank you for helping bring this Discussion to a resolution! 💬

0 replies

GitHub Community

Delay when mounting Persistent Volumes #160724

Uh oh!

Uh oh!

remover Apr 4, 2023

Replies: 3 comments · 4 replies

Uh oh!

Uh oh!

kevholmes Apr 4, 2023

Uh oh!

remover Apr 4, 2023 Author

Uh oh!

kevholmes Apr 4, 2023

Uh oh!

Uh oh!

remover Apr 5, 2023 Author

Uh oh!

Uh oh!

kevholmes Apr 5, 2023

Uh oh!

remover Apr 11, 2023 Author

Uh oh!

github-actions[bot] bot Jul 27, 2025

remover
Apr 4, 2023

Replies: 3 comments 4 replies

kevholmes
Apr 4, 2023

remover Apr 4, 2023
Author

remover Apr 5, 2023
Author

remover
Apr 11, 2023
Author

github-actions[bot]
bot Jul 27, 2025