Skip to content

CreateVolume returns success when NFS privilege configuration fails (DSM Error 2370), causing unmountable PVCs #139

@paowuh

Description

@paowuh

Summary

When creating an NFS-based PVC under load, the driver may successfully create the shared folder on DSM but fail to configure the NFS permissions. The error from DSM (Error code:2370) is logged but not propagated to the CSI consumer. As a result, CreateVolume returns success, the PV is bound, but pods cannot mount the volume because the share has no NFS export rules.

This leads to repeated MountVolume.SetUp failed: ... reason given by server: No such file or directory errors that are misleading: the folder exists on DSM, but it's not exported via NFS.

Environment

  • synology-csi version: v1.2.1
  • DSM version: 7.x
  • Kubernetes version:
  • CSI sidecars:
    • csi-provisioner: v3.0.0
    • csi-attacher: v3.3.0
    • csi-resizer: v1.3.0
    • csi-snapshotter: v8.2.1
    • csi-node-driver-registrar: v2.3.0
  • Storage protocol: NFS (csi.storage.k8s.io/fstype: nfs)
  • Workload: Kasten K10 backup/restore operations + Helm chart deployments creating multiple PVCs concurrently

Reproduction

  1. Configure a StorageClass using protocol nfs against a Synology NAS
  2. Trigger creation of multiple PVCs in rapid succession (e.g. Helm chart with 5 PVCs, or Kasten restore with multiple volumes)
  3. Observe controller logs

Observed behavior

Logs from csi-plugin show the share creation succeeds but the privilege configuration fails:

[ERROR] [service/dsm.go:544] [10.20.58.200] Failed to create Volume: rpc error: code = Internal desc = Failed to create share, err: Share system is temporary busy
[ERROR] [driver/utils.go:126] GRPC error: rpc error: code = Internal desc = Couldn't find any host available to create Volume
...
[ERROR] [service/share_volume.go:208] [10.20.58.200] Failed to load share nfs privilege: DSM Api error. Error code:2370
[INFO] [driver/utils.go:128] GRPC response: {"volume":{"capacity_bytes":...,"volume_context":{"baseDir":"/volume1/k8s-csi-pvc-...","protocol":"nfs",...}}}

The GRPC response is a successful CreateVolume reply, even though the privilege setting failed.

On the NAS side, showmount -e <nas-ip> confirms that some shares are not exported (their NFS Permissions tab in DSM is empty), while their folders exist on /volume1/.

The corresponding pod fails to mount with:

MountVolume.SetUp failed for volume "pvc-XXXX" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o nfsvers=4.1 <nas-ip>:/volume1/k8s-csi-pvc-XXXX /var/lib/kubelet/...
Output: mount.nfs: mounting <nas-ip>:/volume1/k8s-csi-pvc-XXXX failed, reason given by server: No such file or directory

Expected behavior

If setSharePrivilege (or equivalent) fails, CreateVolume should:

  1. Return an error to the CSI client so that the external-provisioner can retry
  2. Roll back by deleting the orphaned shared folder, so that retries don't accumulate ghost folders on DSM

Impact

  • Kasten K10 restores fail unpredictably (some PVCs unmountable)
  • Helm chart deployments with multiple PVCs partially fail
  • Manual remediation required: editing each affected share on DSM and adding NFS permissions by hand
  • DSM accumulates ghost shared folders that exist but are not exported

Workaround applied

We added the following args to the csi-provisioner sidecar to reduce concurrency:

args:
  - --worker-threads=1
  - --retry-interval-start=10s
  - --retry-interval-max=300s
  - --timeout=180s

This reduces the frequency of the issue (by serializing CreateVolume calls) but does not eliminate it: the Error 2370 can still occur sporadically and the bug in error handling remains.

Suspected location in code

Based on log line service/share_volume.go:208, the error from setSharePrivilege (or whatever function configures NFS permissions on a newly created share) appears to be logged with [ERROR] but the function returns nil/success to the caller.

A correct fix would either:

  • (a) Return the error to abort CreateVolume, plus rollback the share creation
  • (b) Implement a retry loop with backoff specifically for DSM Error 2370 (which is transient: "share system busy"), and only fail after exhausting retries

Happy to provide additional logs or test a proposed fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions