Skip to content

Physical restore stuck in "running" forever #2190

@cjesper

Description

@cjesper

Report

Hi,
We're having issues with PBM physical restores getting stuck in "running". This happens across multiple clusters and with multiple backups per cluster (ie. not isolated to specific backups within the cluster)

More about the problem

The restore starts running - the backup-agent container gets stopped etc - and then we start seeing logs like:
DEBUG	checking if restore is allowed	{"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore", "PerconaServerMongoDBRestore": {"name":"restore-new-entity-candidate-5",
"namespace":"psmdb-new-entity-candidate"}, "namespace": "psmdb-new-entity-candidate", "name": "restore-new-entity-candidate-5", "reconcileID": "f15e7345-ab10-4077-b095-f4841a5f0083", "cluster": "new-entity-candidate", "namespace": "psmdb-new-entity-candidate"}
DEBUG	Check restore status	{"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore", 
"PerconaServerMongoDBRestore": {"name":"restore-new-entity-candidate-5","namespace":"psmdb-new-entity-candidate"}, "namespace": "psmdb-new-entity-candidate", "name": "restore-new-entity-candidate-5", "reconcileID": "f15e7345-ab10-4077-b095-f4841a5f0083", "command": ["/opt/percona/pbm", "describe-restore", "2026-01-13T08:37:06.41089787Z", 
"--config", "/etc/pbm/pbm_config.yaml", "--out", "json"], "pod": "new-entity-candidate-new-entity-candidate-0"}
DEBUG	PBM restore status	{"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore", 
"PerconaServerMongoDBRestore": {"name":"restore-new-entity-candidate-5","namespace":"psmdb-new-entity-candidate"},
 "namespace": "psmdb-new-entity-candidate", "name": "restore-new-entity-candidate-5", "reconcileID": "f15e7345-ab10-4077-b095-f4841a5f0083", "status": {"type":"physical","opid":"","name":"2026-01-13T08:37:06.41089787Z",
"replsets":[{"name":"new-entity-candidate","start_ts":0,"status":"running","size":0,"size_uncompressed":0,"last_transition_ts":1768293448,"first_write_ts":{"T":0,"I":0},"last_write_ts":{"T":0,"I":0},"node":"","conditions":null}],"compression":"","store":{"type":""},"size":0,"size_uncompressed":0,
"mongodb_version":"","fcv":"","start_ts":0,"last_transition_ts":1768293453,"first_write_ts":{"T":0,"I":0},"last_write_ts":{"T":0,"I":0},"hb":{"T":0,"I":0},"status":"running","conditions":null,"n":null,"pbm_version":"","balancer":""}}

It seems like the replica in the set gets stuck in an unhealthy state:

members: [
    {
      _id: 0,
      name: 'new-entity-candidate-new-entity-candidate-0.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      uptime: 2543,
      optime: { ts: Timestamp({ t: 1768295948, i: 1 }), t: Long('292') },
      optimeDate: ISODate('2026-01-13T09:19:08.000Z'),
      lastAppliedWallTime: ISODate('2026-01-13T09:19:08.043Z'),
      lastDurableWallTime: ISODate('2026-01-13T09:19:08.043Z'),
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      electionTime: Timestamp({ t: 1768293418, i: 1 }),
      electionDate: ISODate('2026-01-13T08:36:58.000Z'),
      configVersion: 13384721,
      configTerm: -1,
      self: true,
      lastHeartbeatMessage: ''
    },
    {
      _id: 1,
      name: 'new-entity-candidate-new-entity-candidate-1.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017',
      health: 0,
      state: 8,
      stateStr: '(not reachable/healthy)',
      uptime: 0,
      optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
      optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
      optimeDate: ISODate('1970-01-01T00:00:00.000Z'),
      optimeDurableDate: ISODate('1970-01-01T00:00:00.000Z'),
      lastAppliedWallTime: ISODate('2026-01-13T08:37:46.912Z'),
      lastDurableWallTime: ISODate('2026-01-13T08:37:46.912Z'),
      lastHeartbeat: ISODate('2026-01-13T09:19:06.549Z'),
      lastHeartbeatRecv: ISODate('2026-01-13T08:37:51.577Z'),
      pingMs: Long('0'),
      lastHeartbeatMessage: 'Error connecting to new-entity-candidate-new-entity-candidate-1.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017 (10.203.64.43:27017) :: caused by :: onInvoke :: caused by :: Connection refused',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 13384721,
      configTerm: -1
    }
  ],

And the container logs show endless uploading of ".hb"-files:

2026-01-13T08:38:19.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/index-90--4004084906419493573.wt.gz> to </data/db/index-90--4004084906419493573.wt>
2026-01-13T08:38:19.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/sizeStorer.wt.gz> to </data/db/sizeStorer.wt>
2026-01-13T08:38:19.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/WiredTiger.backup.gz> to </data/db/WiredTiger.backup>
2026-01-13T08:38:19.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/WiredTiger.gz> to </data/db/WiredTiger>
2026-01-13T08:38:19.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/storage.bson.gz.0-114> to </data/db/storage.bson>
2026-01-13T08:38:20.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/journal/WiredTigerLog.0000000029.gz> to </data/db/journal/WiredTigerLog.0000000029>
2026-01-13T08:38:20.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/key.db/WiredTigerLog.0000000029.gz> to </data/db/key.db/WiredTigerLog.0000000029>
2026-01-13T08:38:20.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/key.db/WiredTigerLog.0000000028.gz> to </data/db/key.db/WiredTigerLog.0000000028>
2026-01-13T08:38:20.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] preparing data
2026-01-13T08:38:21.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] oplogTruncateAfterPoint: {1768253519 1}
2026-01-13T08:38:23.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] recovering oplog as standalone
2026-01-13T08:38:27.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] clean-up and reset replicaset config
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmAgents'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmBackups'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmRestores'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmCmd'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmPITRChunks'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmPITR'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmOpLog'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmLog'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmLock'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmLock'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmLockOp'
2026-01-13T08:38:31.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] restore on node succeed
2026-01-13T08:38:31.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] moving to state done
2026-01-13T08:38:31.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] uploading ".pbm.restore/2026-01-13T08:37:06.41089787Z/rs.new-entity-candidate/node.new-entity-candidate-new-entity-candidate-1.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017.done" [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2026-01-13T08:38:31.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] waiting for `done` status in rs map[.pbm.restore/2026-01-13T08:37:06.41089787Z/rs.new-entity-candidate/node.new-entity-candidate-new-entity-candidate-0.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017:{} .pbm.restore/2026-01-13T08:37:06.41089787Z/rs.new-entity-candidate/node.new-entity-candidate-new-entity-candidate-1.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017:{}]
2026-01-13T08:39:07.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] uploading ".pbm.restore/2026-01-13T08:37:06.41089787Z/rs.new-entity-candidate/node.new-entity-candidate-new-entity-candidate-1.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017.hb" [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2026-01-13T08:39:07.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] uploading ".pbm.restore/2026-01-13T08:37:06.41089787Z/rs.new-entity-candidate/rs.hb" [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2026-01-13T08:39:07.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] uploading ".pbm.restore/2026-01-13T08:37:06.41089787Z/cluster.hb" [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]

The mongorestore resource in question looks like this:

k describe perconaservermongodbrestores -n psmdb-new-entity-candidate
Name:         restore-new-entity-candidate-5
Namespace:    psmdb-new-entity-candidate
Labels:       <none>
Annotations:  <none>
API Version:  psmdb.percona.com/v1
Kind:         PerconaServerMongoDBRestore
Metadata:
  Creation Timestamp:  2026-01-13T08:34:43Z
  Generation:          1
  Resource Version:    205262868
  UID:                 7490e4af-dab6-4a84-9c3b-dade5136fc06
Spec:
  Backup Name:   cron-new-entity-candi-20260111230100-9pz6g
  Cluster Name:  new-entity-candidate
Status:
  Pbm Name:  2026-01-13T08:37:06.41089787Z
  State:     running
Events:      <none>

And the backups:

NAME                                        CLUSTER           STORAGE      DESTINATION                                                        TYPE       SIZE   STATUS   COMPLETED   AGE
cron-annotation-util-20251227010100-slvmr   annotation-util   s3-us-east   s3://mongodb-backups/annotation-util/2025-12-27T01:01:06Z   physical          ready    17d         17d
cron-annotation-util-20251227030100-bbrtr   annotation-util   s3-us-east   s3://mongodb-backups/annotation-util/2025-12-27T03:14:35Z   physical          ready    17d         17d
cron-annotation-util-20251227050100-h9l8t   annotation-util   s3-us-east   s3://mongodb-backups/annotation-util/2025-12-27T05:05:27Z   physical          ready    17d         17d

Full PSMDB resource spec:

kubectl describe perconaservermongodb -n psmdb-new-entity-candidate
Name:         new-entity-candidate
Namespace:    psmdb-new-entity-candidate
Labels:       app.kubernetes.io/instance=new-entity-candidate-app
              argocd.argoproj.io/instance=new-entity-candidate
Annotations:  argocd.argoproj.io/tracking-id: new-entity-candidate:psmdb.percona.com/PerconaServerMongoDB:psmdb-new-entity-candidate/new-entity-candidate
API Version:  psmdb.percona.com/v1
Kind:         PerconaServerMongoDB
Metadata:
  Creation Timestamp:  2025-12-12T13:37:44Z
  Generation:          6
  Resource Version:    205262527
  UID:                 8626393c-300d-43d5-ba74-451ac38f7c68
Spec:
  Backup:
    Container Security Context:
      Allow Privilege Escalation:  false
      Capabilities:
        Drop:
          ALL
      Privileged:       false
      Run As Non Root:  true
      Run As User:      1001
    Enabled:            true
    Image:              perconalab/percona-server-mongodb-operator:main-backup
    Pitr:
      Enabled:  false
    Storages:
      s3-us-east:
        s3:
          Bucket:              mongodb-backups
          Credentials Secret:  pbm-backup-credentials
          Prefix:              new-entity-candidate
          Region:              us-east-1
        Type:                  s3
    Tasks:
      Compression Level:    6
      Compression Type:     gzip
      Enabled:              true
      Keep:                 3
      Name:                 new-entity-candidate-physical
      Schedule:             0 0 * * *
      Storage Name:         s3-us-east
      Type:                 physical
  Cr Version:               1.20.1
  Enable Volume Expansion:  true
  Image:                    percona/percona-server-mongodb:7.0.18-11
  Init Container Security Context:
    Allow Privilege Escalation:  false
    Capabilities:
      Drop:
        ALL
    Run As Group:     1001
    Run As Non Root:  true
    Run As User:      1001
  Replsets:
    Affinity:
      Advanced:
        Node Affinity:
          Required During Scheduling Ignored During Execution:
            Node Selector Terms:
              Match Expressions:
                Key:      test.io/instance-group
                Operator:  In
                Values:
                  mongo-ebs
                  mongo-ebs-xlarge
                  mongo-ebs-4xlarge
        Pod Anti Affinity:
          Required During Scheduling Ignored During Execution:
            Label Selector:
              Match Expressions:
                Key:       app.kubernetes.io/instance
                Operator:  In
                Values:
                  new-entity-candidate
            Topology Key:  kubernetes.io/hostname
    Annotations:
      k8s.grafana.com/job:                 scrape-percona-server-mongodb
      k8s.grafana.com/metrics.portNumber:  9216
      k8s.grafana.com/scrape:              true
    Configuration:                         replication:
  oplogSizeMB: 1000
security:
  clusterAuthMode: keyFile
  keyFile: /etc/mongodb-secrets/mongodb-key
setParameter:
  enableLocalhostAuthBypass: true

    Container Security Context:
      Allow Privilege Escalation:  false
      Capabilities:
        Drop:
          ALL
      Privileged:       false
      Run As Group:     1001
      Run As Non Root:  true
      Run As User:      1001
    Expose:
      Enabled:  false
    Name:       new-entity-candidate
    Pod Security Context:
      Fs Group:      1001
      Run As Group:  1001
      Run As User:   1001
      Supplemental Groups:
        1001
    Resources:
      Limits:
        Cpu:     1500m
        Memory:  2Gi
      Requests:
        Cpu:     1500m
        Memory:  1Gi
    Sidecars:
      Args:
        --mongodb.uri=mongodb://user:[email protected]:27017/admin
        --collect-all
      Image:  percona/mongodb_exporter:0.40
      Name:   rs-mongo-exporter-0
      Security Context:
        Allow Privilege Escalation:  false
        Capabilities:
          Drop:
            ALL
        Privileged:       false
        Run As Non Root:  true
        Run As User:      1001
    Size:                 2
    Split Horizons:
      new-entity-candidate-new-entity-candidate-0:
        External:  mon-dt-new-entity-candidate-0.test.io
      new-entity-candidate-new-entity-candidate-1:
        External:  mon-dt-new-entity-candidate-1.test.io
    Tolerations:
      Effect:    NoSchedule
      Key:       test.io/mongo-ebs
      Operator:  Exists
    Volume Spec:
      Persistent Volume Claim:
        Annotations:
          ebs.test.io/iops:  3000
        Resources:
          Requests:
            Storage:  10Gi
  Secrets:
    Key File:  new-entity-candidate-keyfile
    Ssl:       mongodb-ssl
  Sharding:
    Enabled:  false
  Tls:
    Allow Invalid Certificates:  true
    Mode:                        preferTLS
  Unsafe Flags:
    Replset Size:   true
    Tls:            true
  Update Strategy:  SmartUpdate
  Users:
    Db:    admin
    Name:  user
    Password Secret Ref:
      Key:   password
      Name:  mongo-user-password
    Roles:
      Db:    admin
      Name:  clusterAdmin
      Db:    admin
      Name:  userAdminAnyDatabase
      Db:    admin
      Name:  readWriteAnyDatabase
Status:
  Backup Config Hash:  6f43a10d901f6b6dad1513a07aff782eb4df1157602b153ca647178feccc9665
  Backup Image:        perconalab/percona-server-mongodb-operator:main-backup
  Backup Version:      2.12.0
  Conditions:
    Last Transition Time:  2025-12-12T13:38:05Z
    Status:                False
    Type:                  sharding
    Last Transition Time:  2025-12-12T13:38:05Z
    Status:                True
    Type:                  initializing
    Last Transition Time:  2025-12-12T13:39:10Z
    Message:               new-entity-candidate: ready
    Reason:                RSReady
    Status:                True
    Type:                  ready
    Last Transition Time:  2025-12-27T18:42:06Z
    Message:               failed to get mongo client: ping mongo: connection() error occurred during connection handshake: dial tcp: lookup new-entity-candidate-new-entity-candidate-0.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local on 100.64.0.10:53: no such host
    Reason:                ErrorReconcile
    Status:                True
    Type:                  error
  Host:                    new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local
  Mongo Image:             percona/percona-server-mongodb:7.0.18-11
  Mongo Version:           7.0.18-11
  Observed Generation:     6
  Ready:                   2
  Replsets:
    New - Entity - Candidate:
      Initialized:  true
      Ready:        2
      Size:         2
      Status:       ready
  Size:             2
  State:            initializing
Events:             <none>

Steps to reproduce

  1. Create a PSMDB mongodb cluster
  2. Take a backup
  3. Start a mongorestore

Versions

  1. Kubernetes - 1.29.6
  2. Operator - 1.21.1
  3. Database - 7.0.18

Anything else?

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions