-
Notifications
You must be signed in to change notification settings - Fork 170
Description
Report
Hi,
We've recently - out of the blue - started having issues with new clusters failing to bootstrap We have not been able to identify exactly -when- this happens - it seems to be It appears that the replica set fails to initialise and / or the admin user. This is the error I am seeing:
handleReplsetInit: exec add admin user: command terminated with exit code 1 / / MongoServerError: not primary
More about the problem
Race Condition in Replica Set Initialization Causing "not primary" Error
Summary
The Percona Server for MongoDB Operator v1.21.0-1.21.2 fails to initialize sharded replica sets due to a race condition where it attempts to create admin users before the primary election completes. This results in the cluster being stuck in an error state with the message: MongoServerError: not primary.
Note: This issue has already been fixed in commit 65f5c6f8 (K8SPSMDB-1451) but the fix is not yet included in any released version.
Environment
- Operator Version: 1.21.0
- Helm Chart Version: psmdb-operator-1.21.1
- Operator Image:
percona/percona-server-mongodb-operator:1.21.0 - Image SHA:
sha256:bcbf630f7179ca6399c853bebc8a6bce13619158c723d389b2d9534846eb11b0 - Kubernetes Version: v1.29.6
- MongoDB Version: 7.0.28-15
- Cluster Configuration:
- Sharding enabled:
true - Replica set size: 3
- Config server size: 2
- Mongos instances: 2
- Role:
shardsvr
- Sharding enabled:
Problem Description
When deploying a new sharded MongoDB cluster, the operator successfully runs rs.initiate() but does not wait long enough for MongoDB to complete the primary election process. In the old code (v1.21.0-1.21.2), the operator uses a fixed 5-second sleep after rs.initiate(), which is insufficient for sharded clusters where primary election can take longer.
The operator then attempts to create the admin user while the node is still transitioning to PRIMARY state, resulting in:
MongoServerError: not primary
This leaves the cluster in an indefinite error state where:
- The replica set never completes initialization
- Pods fail liveness probes after ~4 minutes
- Pods restart in a loop
- The cluster status shows
errorwith message:handleReplsetInit: exec add admin user: command terminated with exit code 1 / / MongoServerError: not primary
Steps to Reproduce
- Deploy operator
- Create a new PSMDB cluster with sharding enabled:
apiVersion: psmdb.percona.com/v1 kind: PerconaServerMongoDB metadata: name: mon namespace: psmdb-newer-cluster spec: crVersion: 1.21.0 sharding: enabled: true configsvrReplSet: size: 2 mongos: size: 2 replsets: - name: newer-cluster size: 3 configuration: | operationProfiling: mode: slowOp
- Observe the operator logs and cluster status
Expected Behavior
The operator should:
- Run
rs.initiate() - Poll
db.hello().isWritablePrimarywith exponential backoff until the node becomes PRIMARY - Only then create the admin user
- Mark the replica set as initialized
Actual Behavior
Operator logs show:
2026-02-03T14:46:14.225Z INFO initiating replset {"replset": "newer-cluster", "pod": "mon-newer-cluster-0"}
2026-02-03T14:46:15.103Z INFO replset initialized {"replset": "newer-cluster", "pod": "mon-newer-cluster-0"}
2026-02-03T14:46:20.103Z INFO creating user admin {"replset": "newer-cluster", "pod": "mon-newer-cluster-0", "user": "userAdmin"}
2026-02-03T14:46:20.636Z ERROR failed to reconcile cluster {"replset": "newer-cluster", "error": "handleReplsetInit: exec add admin user: command terminated with exit code 1 / / MongoServerError: not primary\n"}
Cluster status:
status:
conditions:
- lastTransitionTime: "2026-02-03T14:18:47Z"
message: "handleReplsetInit: exec add admin user: command terminated with exit code 1 / / MongoServerError: not primary"
reason: ErrorReconcile
status: "True"
type: error
replsets:
newer-cluster:
ready: 0
size: 3
status: initializing
state: errorMongoDB state:
$ mongosh --eval "db.hello()"
{
isWritablePrimary: false,
secondary: false,
ok: 1
}
$ mongosh --eval "rs.status()"
MongoServerError: no replset config has been receivedPod events:
Warning Unhealthy Liveness probe failed
Normal Killing Container mongod failed liveness probe, will be restarted
MongoDB logs show:
{"c":"QUERY", "msg":"Aggregate command executor error", "error":{"codeName":"NamespaceNotFound","errmsg":"Unable to retrieve storageStats in $collStats stage :: caused by :: Collection [local.oplog.rs] not found."}}
{"c":"-", "msg":"Failed to refresh key cache", "error":"ReadConcernMajorityNotAvailableYet: Read concern majority reads are currently not possible."}The oplog was never created because the replica set initialization never completed.
Additional Information
Full cluster status
Name: mon
Namespace: psmdb-newer-cluster
Labels: app.kubernetes.io/instance=newer-cluster-app
argocd.argoproj.io/instance=newer-cluster
Annotations: argocd.argoproj.io/tracking-id: newer-cluster:psmdb.percona.com/PerconaServerMongoDB:psmdb-newer-cluster/mon
API Version: psmdb.percona.com/v1
Kind: PerconaServerMongoDB
Metadata:
Creation Timestamp: 2026-02-03T14:17:45Z
Generation: 1
Resource Version: 232418146
UID: c47890a5-5aae-4147-8eff-5e2dc25aa5be
Spec:
Backup:
Container Security Context:
Allow Privilege Escalation: false
Capabilities:
Drop:
ALL
Privileged: false
Run As Non Root: true
Run As User: 1001
Enabled: true
Image: perconalab/percona-server-mongodb-operator:main-backup
Pitr:
Enabled: false
Resources:
Limits:
Cpu: 500m
Memory: 1Gi
Starting Deadline Seconds: 1200
Storages:
s3-us-east:
s3:
Bucket: mongodb-k8s-backups
Credentials Secret: pbm-backup-credentials
Prefix: newer-cluster
Region: us-east-1
Type: s3
Tasks:
Compression Level: 6
Compression Type: gzip
Enabled: true
Keep: 3
Name: newer-cluster-physical
Schedule: 0 0 * * *
Storage Name: s3-us-east
Type: physical
Cr Version: 1.20.1
Enable Volume Expansion: true
Image: percona/percona-server-mongodb:7.0.28-15
Init Container Security Context:
Allow Privilege Escalation: false
Capabilities:
Drop:
ALL
Run As Group: 1001
Run As Non Root: true
Run As User: 1001
Replsets:
Affinity:
Advanced:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: test.io/instance-group
Operator: In
Values:
mongo-ebs
mongo-ebs-xlarge
mongo-ebs-4xlarge
Pod Anti Affinity:
Required During Scheduling Ignored During Execution:
Label Selector:
Match Expressions:
Key: app.kubernetes.io/instance
Operator: In
Values:
newer-cluster
Topology Key: kubernetes.io/hostname
Annotations:
k8s.grafana.com/job: scrape-percona-server-mongodb
k8s.grafana.com/metrics.portNumber: 9216
k8s.grafana.com/scrape: true
Configuration: replication:
oplogSizeMB: 1000
security:
clusterAuthMode: keyFile
keyFile: /etc/mongodb-secrets/mongodb-key
setParameter:
enableLocalhostAuthBypass: true
initialSyncSourceReadPreference: secondaryPreferred
Container Security Context:
Allow Privilege Escalation: false
Capabilities:
Drop:
ALL
Privileged: false
Run As Group: 1001
Run As Non Root: true
Run As User: 1001
Liveness Probe:
Failure Threshold: 5
Initial Delay Seconds: 90
Startup Delay Seconds: 300
Timeout Seconds: 30
Name: newer-cluster
Pod Security Context:
Fs Group: 1001
Run As Group: 1001
Run As User: 1001
Supplemental Groups:
1001
Resources:
Limits:
Cpu: 1
Memory: 2Gi
Requests:
Cpu: 500m
Memory: 1Gi
Sidecars:
Args:
--mongodb.uri=mongodb://$(MONGO_USER):$(MONGO_PASSWORD)@127.0.0.1:27017/admin
--collect-all
--compatible-mode
Env:
Name: MONGO_USER
Value: clusterMonitor
Name: MONGO_PASSWORD
Value From:
Secret Key Ref:
Key: MONGODB_CLUSTER_MONITOR_PASSWORD
Name: percona-server-mongodb-users
Image: percona/mongodb_exporter:0.40
Name: rs-mongo-exporter-0
Security Context:
Allow Privilege Escalation: false
Capabilities:
Drop:
ALL
Privileged: false
Run As Non Root: true
Run As User: 1001
Size: 3
Split Horizons:
newer-cluster-newer-cluster-0:
External: mon-dt-newer-cluster-0.test.io
newer-cluster-newer-cluster-1:
External: mon-dt-newer-cluster-1.test.io
newer-cluster-newer-cluster-2:
External: mon-dt-newer-cluster-2.test.io
Tolerations:
Effect: NoSchedule
Key: test.io/mongo-ebs
Operator: Exists
Volume Spec:
Persistent Volume Claim:
Annotations:
ebs.test.io/iops: 3000
Resources:
Requests:
Storage: 20Gi
Secrets:
Key File: newer-cluster-keyfile
Ssl: mongodb-ssl
Sharding:
Configsvr Repl Set:
Affinity:
Advanced:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: test.io/instance-group
Operator: In
Values:
mongo-ebs
mongo-ebs-xlarge
mongo-ebs-4xlarge
Pod Anti Affinity:
Required During Scheduling Ignored During Execution:
Label Selector:
Match Expressions:
Key: app.kubernetes.io/instance
Operator: In
Values:
newer-cluster
Topology Key: kubernetes.io/hostname
Container Security Context:
Allow Privilege Escalation: false
Capabilities:
Drop:
ALL
Privileged: false
Run As Group: 1001
Run As Non Root: true
Run As User: 1001
Pod Security Context:
Fs Group: 1001
Run As Group: 1001
Run As User: 1001
Supplemental Groups:
1001
Size: 2
Tolerations:
Effect: NoSchedule
Key: test.io/mongo-ebs
Operator: Exists
Volume Spec:
Persistent Volume Claim:
Annotations:
ebs.test.io/iops: 3000
Resources:
Requests:
Storage: 10Gi
Enabled: true
Mongos:
Affinity:
Advanced:
Node Affinity:
Required During Scheduling Ignored During Execution:
Node Selector Terms:
Match Expressions:
Key: test.io/instance-group
Operator: In
Values:
mongo-ebs
mongo-ebs-xlarge
mongo-ebs-4xlarge
Pod Anti Affinity:
Required During Scheduling Ignored During Execution:
Label Selector:
Match Expressions:
Key: app.kubernetes.io/instance
Operator: In
Values:
newer-cluster
Topology Key: kubernetes.io/hostname
Container Security Context:
Allow Privilege Escalation: false
Capabilities:
Drop:
ALL
Privileged: false
Run As Group: 1001
Run As Non Root: true
Run As User: 1001
Expose:
Annotations:
external-dns.alpha.kubernetes.io/hostname: mon-dt-newer-cluster.test.io
service.beta.kubernetes.io/aws-load-balancer-internal: true
service.beta.kubernetes.io/aws-load-balancer-security-groups: services-elb.k8s-devtrack-alpha-mongodb.test.io, mongodb-elb.k8s-devtrack-alpha-mongodb.test.io
Load Balancer Class: service.k8s.aws/nlb
Type: LoadBalancer
Labels:
app.kubernetes.io/component: mongos
Attach - Statefulset - Servicename: newer-cluster-cluster-ip
Pod Security Context:
Fs Group: 1001
Run As Group: 1001
Run As User: 1001
Supplemental Groups:
1001
Resources:
Limits:
Cpu: 2
Memory: 2Gi
Requests:
Cpu: 1
Memory: 1500Mi
Size: 2
Tolerations:
Effect: NoSchedule
Key: test.io/mongo-ebs
Operator: Exists
Tls:
Allow Invalid Certificates: true
Mode: preferTLS
Unsafe Flags:
Replset Size: true
Tls: true
Update Strategy: SmartUpdate
Users:
Db: admin
Name: op
Password Secret Ref:
Key: op
Name: mongo-global-passwords
Roles:
Db: admin
Name: clusterAdmin
Db: admin
Name: userAdminAnyDatabase
Db: admin
Name: readWriteAnyDatabase
Db: admin
Name: users_management
Password Secret Ref:
Key: password
Name: users-management-password
Roles:
Db: admin
Name: userAdminAnyDatabase
Status:
Conditions:
Last Transition Time: 2026-02-03T14:17:46Z
Status: True
Type: sharding
Last Transition Time: 2026-02-03T14:17:47Z
Status: True
Type: initializing
Last Transition Time: 2026-02-03T14:18:47Z
Message: handleReplsetInit: exec add admin user: command terminated with exit code 1 / / MongoServerError: not primary
Reason: ErrorReconcile
Status: True
Type: error
Last Transition Time: 2026-02-03T14:32:43Z
Reason: MongosReady
Status: True
Type: ready
Host: k8s-psmdbnew-monmongo-cca3ef6254-910797f09652cf21.elb.us-east-1.amazonaws.com
Message: Error: handleReplsetInit: exec add admin user: command terminated with exit code 1 / / MongoServerError: not primary
Mongos:
Ready: 2
Size: 2
Status: ready
Observed Generation: 1
Ready: 4
Replsets:
Cfg:
Initialized: true
Members:
mon-cfg-0:
Name: mon-cfg-0.mon-cfg.psmdb-newer-cluster.svc.cluster.local:27017
State: 1
State Str: PRIMARY
mon-cfg-1:
Name: mon-cfg-1.mon-cfg.psmdb-newer-cluster.svc.cluster.local:27017
State: 2
State Str: SECONDARY
Ready: 2
Size: 2
Status: ready
Newer - Cluster:
Message: mongod: back-off 5m0s restarting failed container=mongod pod=mon-newer-cluster-0_psmdb-newer-cluster(d9069047-15d6-4e84-aee7-ed44570e1839); mongod: back-off 5m0s restarting failed container=mongod pod=mon-newer-cluster-1_psmdb-newer-cluster(e91de43f-c936-4139-8429-f53c4f20d35e); mongod: back-off 5m0s restarting failed container=mongod pod=mon-newer-cluster-2_psmdb-newer-cluster(95134698-d7ce-46f1-9f28-22c30e1ea472);
Ready: 0
Size: 3
Status: initializing
Size: 7
State: error
Events: <none>
Shows replset status as initializing with 0/3 ready members, and error message about admin user creation failing.
Pod resource usage
NAME CPU(cores) MEMORY(bytes)
mon-newer-cluster-0 23m 212Mi
mon-newer-cluster-1 18m 198Mi
mon-newer-cluster-2 15m 217Mi
Steps to reproduce
Versions
- Kubernetes - 1.29.6
- Operator - 1.21.1
- Database - 7.0.18-22
Anything else?
No response