-
Notifications
You must be signed in to change notification settings - Fork 170
Open
Labels
Description
Report
Hi,
We're having issues with PBM physical restores getting stuck in "running". This happens across multiple clusters and with multiple backups per cluster (ie. not isolated to specific backups within the cluster)
More about the problem
The restore starts running - the backup-agent container gets stopped etc - and then we start seeing logs like:
DEBUG checking if restore is allowed {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore", "PerconaServerMongoDBRestore": {"name":"restore-new-entity-candidate-5",
"namespace":"psmdb-new-entity-candidate"}, "namespace": "psmdb-new-entity-candidate", "name": "restore-new-entity-candidate-5", "reconcileID": "f15e7345-ab10-4077-b095-f4841a5f0083", "cluster": "new-entity-candidate", "namespace": "psmdb-new-entity-candidate"}
DEBUG Check restore status {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore",
"PerconaServerMongoDBRestore": {"name":"restore-new-entity-candidate-5","namespace":"psmdb-new-entity-candidate"}, "namespace": "psmdb-new-entity-candidate", "name": "restore-new-entity-candidate-5", "reconcileID": "f15e7345-ab10-4077-b095-f4841a5f0083", "command": ["/opt/percona/pbm", "describe-restore", "2026-01-13T08:37:06.41089787Z",
"--config", "/etc/pbm/pbm_config.yaml", "--out", "json"], "pod": "new-entity-candidate-new-entity-candidate-0"}
DEBUG PBM restore status {"controller": "psmdbrestore-controller", "controllerGroup": "psmdb.percona.com", "controllerKind": "PerconaServerMongoDBRestore",
"PerconaServerMongoDBRestore": {"name":"restore-new-entity-candidate-5","namespace":"psmdb-new-entity-candidate"},
"namespace": "psmdb-new-entity-candidate", "name": "restore-new-entity-candidate-5", "reconcileID": "f15e7345-ab10-4077-b095-f4841a5f0083", "status": {"type":"physical","opid":"","name":"2026-01-13T08:37:06.41089787Z",
"replsets":[{"name":"new-entity-candidate","start_ts":0,"status":"running","size":0,"size_uncompressed":0,"last_transition_ts":1768293448,"first_write_ts":{"T":0,"I":0},"last_write_ts":{"T":0,"I":0},"node":"","conditions":null}],"compression":"","store":{"type":""},"size":0,"size_uncompressed":0,
"mongodb_version":"","fcv":"","start_ts":0,"last_transition_ts":1768293453,"first_write_ts":{"T":0,"I":0},"last_write_ts":{"T":0,"I":0},"hb":{"T":0,"I":0},"status":"running","conditions":null,"n":null,"pbm_version":"","balancer":""}}
It seems like the replica in the set gets stuck in an unhealthy state:
members: [
{
_id: 0,
name: 'new-entity-candidate-new-entity-candidate-0.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017',
health: 1,
state: 1,
stateStr: 'PRIMARY',
uptime: 2543,
optime: { ts: Timestamp({ t: 1768295948, i: 1 }), t: Long('292') },
optimeDate: ISODate('2026-01-13T09:19:08.000Z'),
lastAppliedWallTime: ISODate('2026-01-13T09:19:08.043Z'),
lastDurableWallTime: ISODate('2026-01-13T09:19:08.043Z'),
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
electionTime: Timestamp({ t: 1768293418, i: 1 }),
electionDate: ISODate('2026-01-13T08:36:58.000Z'),
configVersion: 13384721,
configTerm: -1,
self: true,
lastHeartbeatMessage: ''
},
{
_id: 1,
name: 'new-entity-candidate-new-entity-candidate-1.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017',
health: 0,
state: 8,
stateStr: '(not reachable/healthy)',
uptime: 0,
optime: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
optimeDurable: { ts: Timestamp({ t: 0, i: 0 }), t: Long('-1') },
optimeDate: ISODate('1970-01-01T00:00:00.000Z'),
optimeDurableDate: ISODate('1970-01-01T00:00:00.000Z'),
lastAppliedWallTime: ISODate('2026-01-13T08:37:46.912Z'),
lastDurableWallTime: ISODate('2026-01-13T08:37:46.912Z'),
lastHeartbeat: ISODate('2026-01-13T09:19:06.549Z'),
lastHeartbeatRecv: ISODate('2026-01-13T08:37:51.577Z'),
pingMs: Long('0'),
lastHeartbeatMessage: 'Error connecting to new-entity-candidate-new-entity-candidate-1.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017 (10.203.64.43:27017) :: caused by :: onInvoke :: caused by :: Connection refused',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
configVersion: 13384721,
configTerm: -1
}
],
And the container logs show endless uploading of ".hb"-files:
2026-01-13T08:38:19.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/index-90--4004084906419493573.wt.gz> to </data/db/index-90--4004084906419493573.wt>
2026-01-13T08:38:19.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/sizeStorer.wt.gz> to </data/db/sizeStorer.wt>
2026-01-13T08:38:19.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/WiredTiger.backup.gz> to </data/db/WiredTiger.backup>
2026-01-13T08:38:19.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/WiredTiger.gz> to </data/db/WiredTiger>
2026-01-13T08:38:19.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/storage.bson.gz.0-114> to </data/db/storage.bson>
2026-01-13T08:38:20.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/journal/WiredTigerLog.0000000029.gz> to </data/db/journal/WiredTigerLog.0000000029>
2026-01-13T08:38:20.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/key.db/WiredTigerLog.0000000029.gz> to </data/db/key.db/WiredTigerLog.0000000029>
2026-01-13T08:38:20.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] copy <2026-01-12T21:31:57Z/new-entity-candidate/key.db/WiredTigerLog.0000000028.gz> to </data/db/key.db/WiredTigerLog.0000000028>
2026-01-13T08:38:20.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] preparing data
2026-01-13T08:38:21.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] oplogTruncateAfterPoint: {1768253519 1}
2026-01-13T08:38:23.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] recovering oplog as standalone
2026-01-13T08:38:27.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] clean-up and reset replicaset config
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmAgents'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmBackups'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmRestores'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmCmd'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmPITRChunks'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmPITR'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmOpLog'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmLog'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmLock'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmLock'
2026-01-13T08:38:29.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] dropping 'admin.pbmLockOp'
2026-01-13T08:38:31.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] restore on node succeed
2026-01-13T08:38:31.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] moving to state done
2026-01-13T08:38:31.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] uploading ".pbm.restore/2026-01-13T08:37:06.41089787Z/rs.new-entity-candidate/node.new-entity-candidate-new-entity-candidate-1.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017.done" [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2026-01-13T08:38:31.000+0000 I [restore/2026-01-13T08:37:06.41089787Z] waiting for `done` status in rs map[.pbm.restore/2026-01-13T08:37:06.41089787Z/rs.new-entity-candidate/node.new-entity-candidate-new-entity-candidate-0.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017:{} .pbm.restore/2026-01-13T08:37:06.41089787Z/rs.new-entity-candidate/node.new-entity-candidate-new-entity-candidate-1.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017:{}]
2026-01-13T08:39:07.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] uploading ".pbm.restore/2026-01-13T08:37:06.41089787Z/rs.new-entity-candidate/node.new-entity-candidate-new-entity-candidate-1.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local:27017.hb" [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2026-01-13T08:39:07.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] uploading ".pbm.restore/2026-01-13T08:37:06.41089787Z/rs.new-entity-candidate/rs.hb" [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
2026-01-13T08:39:07.000+0000 D [restore/2026-01-13T08:37:06.41089787Z] uploading ".pbm.restore/2026-01-13T08:37:06.41089787Z/cluster.hb" [size hint: 10 (10.00B); part size: 10485760 (10.00MB)]
The mongorestore resource in question looks like this:
k describe perconaservermongodbrestores -n psmdb-new-entity-candidate
Name: restore-new-entity-candidate-5
Namespace: psmdb-new-entity-candidate
Labels: <none>
Annotations: <none>
API Version: psmdb.percona.com/v1
Kind: PerconaServerMongoDBRestore
Metadata:
Creation Timestamp: 2026-01-13T08:34:43Z
Generation: 1
Resource Version: 205262868
UID: 7490e4af-dab6-4a84-9c3b-dade5136fc06
Spec:
Backup Name: cron-new-entity-candi-20260111230100-9pz6g
Cluster Name: new-entity-candidate
Status:
Pbm Name: 2026-01-13T08:37:06.41089787Z
State: running
Events: <none>
And the backups:
NAME CLUSTER STORAGE DESTINATION TYPE SIZE STATUS COMPLETED AGE
cron-annotation-util-20251227010100-slvmr annotation-util s3-us-east s3://mongodb-backups/annotation-util/2025-12-27T01:01:06Z physical ready 17d 17d
cron-annotation-util-20251227030100-bbrtr annotation-util s3-us-east s3://mongodb-backups/annotation-util/2025-12-27T03:14:35Z physical ready 17d 17d
cron-annotation-util-20251227050100-h9l8t annotation-util s3-us-east s3://mongodb-backups/annotation-util/2025-12-27T05:05:27Z physical ready 17d 17d
Full PSMDB resource spec:
kubectl describe perconaservermongodb -n psmdb-new-entity-candidate
Name: new-entity-candidate
Namespace: psmdb-new-entity-candidate
Labels: app.kubernetes.io/instance=new-entity-candidate-app
argocd.argoproj.io/instance=new-entity-candidate
Annotations: argocd.argoproj.io/tracking-id: new-entity-candidate:psmdb.percona.com/PerconaServerMongoDB:psmdb-new-entity-candidate/new-entity-candidate
API Version: psmdb.percona.com/v1
Kind: PerconaServerMongoDB
Metadata:
Creation Timestamp: 2025-12-12T13:37:44Z
Generation: 6
Resource Version: 205262527
UID: 8626393c-300d-43d5-ba74-451ac38f7c68
Spec:
Backup:
Container Security Context:
Allow Privilege Escalation: false
Capabilities:
Drop:
ALL
Privileged: false
Run As Non Root: true
Run As User: 1001
Enabled: true
Image: perconalab/percona-server-mongodb-operator:main-backup
Pitr:
Enabled: false
Storages:
s3-us-east:
s3:
Bucket: mongodb-backups
Credentials Secret: pbm-backup-credentials
Prefix: new-entity-candidate
Region: us-east-1
Type: s3
Tasks:
Compression Level: 6
Compression Type: gzip
Enabled: true
Keep: 3
Name: new-entity-candidate-physical
Schedule: 0 0 * * *
Storage Name: s3-us-east
Type: physical
Cr Version: 1.20.1
Enable Volume Expansion: true
Image: percona/percona-server-mongodb:7.0.18-11
Init Container Security Context:
Allow Privilege Escalation: false
Capabilities:
Drop:
ALL
Run As Group: 1001
Run As Non Root: true
Run As User: 1001
Replsets:
Affinity:
Advanced:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: test.io/instance-group
Operator: In
Values:
mongo-ebs
mongo-ebs-xlarge
mongo-ebs-4xlarge
Pod Anti Affinity:
Required During Scheduling Ignored During Execution:
Label Selector:
Match Expressions:
Key: app.kubernetes.io/instance
Operator: In
Values:
new-entity-candidate
Topology Key: kubernetes.io/hostname
Annotations:
k8s.grafana.com/job: scrape-percona-server-mongodb
k8s.grafana.com/metrics.portNumber: 9216
k8s.grafana.com/scrape: true
Configuration: replication:
oplogSizeMB: 1000
security:
clusterAuthMode: keyFile
keyFile: /etc/mongodb-secrets/mongodb-key
setParameter:
enableLocalhostAuthBypass: true
Container Security Context:
Allow Privilege Escalation: false
Capabilities:
Drop:
ALL
Privileged: false
Run As Group: 1001
Run As Non Root: true
Run As User: 1001
Expose:
Enabled: false
Name: new-entity-candidate
Pod Security Context:
Fs Group: 1001
Run As Group: 1001
Run As User: 1001
Supplemental Groups:
1001
Resources:
Limits:
Cpu: 1500m
Memory: 2Gi
Requests:
Cpu: 1500m
Memory: 1Gi
Sidecars:
Args:
--mongodb.uri=mongodb://user:[email protected]:27017/admin
--collect-all
Image: percona/mongodb_exporter:0.40
Name: rs-mongo-exporter-0
Security Context:
Allow Privilege Escalation: false
Capabilities:
Drop:
ALL
Privileged: false
Run As Non Root: true
Run As User: 1001
Size: 2
Split Horizons:
new-entity-candidate-new-entity-candidate-0:
External: mon-dt-new-entity-candidate-0.test.io
new-entity-candidate-new-entity-candidate-1:
External: mon-dt-new-entity-candidate-1.test.io
Tolerations:
Effect: NoSchedule
Key: test.io/mongo-ebs
Operator: Exists
Volume Spec:
Persistent Volume Claim:
Annotations:
ebs.test.io/iops: 3000
Resources:
Requests:
Storage: 10Gi
Secrets:
Key File: new-entity-candidate-keyfile
Ssl: mongodb-ssl
Sharding:
Enabled: false
Tls:
Allow Invalid Certificates: true
Mode: preferTLS
Unsafe Flags:
Replset Size: true
Tls: true
Update Strategy: SmartUpdate
Users:
Db: admin
Name: user
Password Secret Ref:
Key: password
Name: mongo-user-password
Roles:
Db: admin
Name: clusterAdmin
Db: admin
Name: userAdminAnyDatabase
Db: admin
Name: readWriteAnyDatabase
Status:
Backup Config Hash: 6f43a10d901f6b6dad1513a07aff782eb4df1157602b153ca647178feccc9665
Backup Image: perconalab/percona-server-mongodb-operator:main-backup
Backup Version: 2.12.0
Conditions:
Last Transition Time: 2025-12-12T13:38:05Z
Status: False
Type: sharding
Last Transition Time: 2025-12-12T13:38:05Z
Status: True
Type: initializing
Last Transition Time: 2025-12-12T13:39:10Z
Message: new-entity-candidate: ready
Reason: RSReady
Status: True
Type: ready
Last Transition Time: 2025-12-27T18:42:06Z
Message: failed to get mongo client: ping mongo: connection() error occurred during connection handshake: dial tcp: lookup new-entity-candidate-new-entity-candidate-0.new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local on 100.64.0.10:53: no such host
Reason: ErrorReconcile
Status: True
Type: error
Host: new-entity-candidate-new-entity-candidate.psmdb-new-entity-candidate.svc.cluster.local
Mongo Image: percona/percona-server-mongodb:7.0.18-11
Mongo Version: 7.0.18-11
Observed Generation: 6
Ready: 2
Replsets:
New - Entity - Candidate:
Initialized: true
Ready: 2
Size: 2
Status: ready
Size: 2
State: initializing
Events: <none>
Steps to reproduce
- Create a PSMDB mongodb cluster
- Take a backup
- Start a mongorestore
Versions
- Kubernetes - 1.29.6
- Operator - 1.21.1
- Database - 7.0.18
Anything else?
No response
Reactions are currently unavailable