Sharded cluster bootstrapping fails

### Report

Hi,
We've recently - out of the blue - started having issues with new clusters failing to bootstrap We have not been able to identify exactly -when- this happens - it seems to be  It appears that the replica set fails to initialise and / or the admin user. This is the error I am seeing:
`handleReplsetInit: exec add admin user: command terminated with exit
   code 1 / / MongoServerError: not primary`

### More about the problem

# Race Condition in Replica Set Initialization Causing "not primary" Error

## Summary

The Percona Server for MongoDB Operator v1.21.0-1.21.2 fails to initialize sharded replica sets due to a race condition where it attempts to create admin users before the primary election completes. This results in the cluster being stuck in an error state with the message: `MongoServerError: not primary`.

**Note:** This issue has already been fixed in commit [65f5c6f8](https://github.com/percona/percona-server-mongodb-operator/commit/65f5c6f8150142241ba6c9f52ed0a3ff5275af1a) (K8SPSMDB-1451) but the fix is not yet included in any released version.

## Environment

- **Operator Version:** 1.21.0
- **Helm Chart Version:** psmdb-operator-1.21.1
- **Operator Image:** `percona/percona-server-mongodb-operator:1.21.0`
- **Image SHA:** `sha256:bcbf630f7179ca6399c853bebc8a6bce13619158c723d389b2d9534846eb11b0`
- **Kubernetes Version:** v1.29.6
- **MongoDB Version:** 7.0.28-15
- **Cluster Configuration:**
  - Sharding enabled: `true`
  - Replica set size: 3
  - Config server size: 2
  - Mongos instances: 2
  - Role: `shardsvr`

## Problem Description

When deploying a new sharded MongoDB cluster, the operator successfully runs `rs.initiate()` but does not wait long enough for MongoDB to complete the primary election process. In the old code (v1.21.0-1.21.2), the operator uses a fixed 5-second sleep after `rs.initiate()`, which is insufficient for sharded clusters where primary election can take longer.

The operator then attempts to create the admin user while the node is still transitioning to PRIMARY state, resulting in:
```
MongoServerError: not primary
```

This leaves the cluster in an indefinite error state where:
- The replica set never completes initialization
- Pods fail liveness probes after ~4 minutes
- Pods restart in a loop
- The cluster status shows `error` with message: `handleReplsetInit: exec add admin user: command terminated with exit code 1 / / MongoServerError: not primary`

## Steps to Reproduce

1. Deploy operator
2. Create a new PSMDB cluster with sharding enabled:
   ```yaml
   apiVersion: psmdb.percona.com/v1
   kind: PerconaServerMongoDB
   metadata:
     name: mon
     namespace: psmdb-newer-cluster
   spec:
     crVersion: 1.21.0
     sharding:
       enabled: true
       configsvrReplSet:
         size: 2
       mongos:
         size: 2
     replsets:
       - name: newer-cluster
         size: 3
         configuration: |
           operationProfiling:
             mode: slowOp
   ```
3. Observe the operator logs and cluster status

## Expected Behavior

The operator should:
1. Run `rs.initiate()`
2. Poll `db.hello().isWritablePrimary` with exponential backoff until the node becomes PRIMARY
3. Only then create the admin user
4. Mark the replica set as initialized

## Actual Behavior

**Operator logs show:**
```
2026-02-03T14:46:14.225Z  INFO  initiating replset  {"replset": "newer-cluster", "pod": "mon-newer-cluster-0"}
2026-02-03T14:46:15.103Z  INFO  replset initialized  {"replset": "newer-cluster", "pod": "mon-newer-cluster-0"}
2026-02-03T14:46:20.103Z  INFO  creating user admin  {"replset": "newer-cluster", "pod": "mon-newer-cluster-0", "user": "userAdmin"}
2026-02-03T14:46:20.636Z  ERROR failed to reconcile cluster  {"replset": "newer-cluster", "error": "handleReplsetInit: exec add admin user: command terminated with exit code 1 /  / MongoServerError: not primary\n"}
```

**Cluster status:**
```yaml
status:
  conditions:
    - lastTransitionTime: "2026-02-03T14:18:47Z"
      message: "handleReplsetInit: exec add admin user: command terminated with exit code 1 /  / MongoServerError: not primary"
      reason: ErrorReconcile
      status: "True"
      type: error
  replsets:
    newer-cluster:
      ready: 0
      size: 3
      status: initializing
  state: error
```

**MongoDB state:**
```bash
$ mongosh --eval "db.hello()"
{
  isWritablePrimary: false,
  secondary: false,
  ok: 1
}

$ mongosh --eval "rs.status()"
MongoServerError: no replset config has been received
```

**Pod events:**
```
Warning  Unhealthy  Liveness probe failed
Normal   Killing    Container mongod failed liveness probe, will be restarted
```

**MongoDB logs show:**
```json
{"c":"QUERY", "msg":"Aggregate command executor error", "error":{"codeName":"NamespaceNotFound","errmsg":"Unable to retrieve storageStats in $collStats stage :: caused by :: Collection [local.oplog.rs] not found."}}
{"c":"-", "msg":"Failed to refresh key cache", "error":"ReadConcernMajorityNotAvailableYet: Read concern majority reads are currently not possible."}
```

The oplog was never created because the replica set initialization never completed.

## Additional Information

<details>
<summary>Full cluster status</summary>

```
Name:         mon
Namespace:    psmdb-newer-cluster
Labels:       app.kubernetes.io/instance=newer-cluster-app
              argocd.argoproj.io/instance=newer-cluster
Annotations:  argocd.argoproj.io/tracking-id: newer-cluster:psmdb.percona.com/PerconaServerMongoDB:psmdb-newer-cluster/mon
API Version:  psmdb.percona.com/v1
Kind:         PerconaServerMongoDB
Metadata:
  Creation Timestamp:  2026-02-03T14:17:45Z
  Generation:          1
  Resource Version:    232418146
  UID:                 c47890a5-5aae-4147-8eff-5e2dc25aa5be
Spec:
  Backup:
    Container Security Context:
      Allow Privilege Escalation:  false
      Capabilities:
        Drop:
          ALL
      Privileged:       false
      Run As Non Root:  true
      Run As User:      1001
    Enabled:            true
    Image:              perconalab/percona-server-mongodb-operator:main-backup
    Pitr:
      Enabled:  false
    Resources:
      Limits:
        Cpu:                    500m
        Memory:                 1Gi
    Starting Deadline Seconds:  1200
    Storages:
      s3-us-east:
        s3:
          Bucket:              mongodb-k8s-backups
          Credentials Secret:  pbm-backup-credentials
          Prefix:              newer-cluster
          Region:              us-east-1
        Type:                  s3
    Tasks:
      Compression Level:    6
      Compression Type:     gzip
      Enabled:              true
      Keep:                 3
      Name:                 newer-cluster-physical
      Schedule:             0 0 * * *
      Storage Name:         s3-us-east
      Type:                 physical
  Cr Version:               1.20.1
  Enable Volume Expansion:  true
  Image:                    percona/percona-server-mongodb:7.0.28-15
  Init Container Security Context:
    Allow Privilege Escalation:  false
    Capabilities:
      Drop:
        ALL
    Run As Group:     1001
    Run As Non Root:  true
    Run As User:      1001
  Replsets:
    Affinity:
      Advanced:
        Node Affinity:
          Required During Scheduling Ignored During Execution:
            Node Selector Terms:
              Match Expressions:
                Key:       test.io/instance-group
                Operator:  In
                Values:
                  mongo-ebs
                  mongo-ebs-xlarge
                  mongo-ebs-4xlarge
        Pod Anti Affinity:
          Required During Scheduling Ignored During Execution:
            Label Selector:
              Match Expressions:
                Key:       app.kubernetes.io/instance
                Operator:  In
                Values:
                  newer-cluster
            Topology Key:  kubernetes.io/hostname
    Annotations:
      k8s.grafana.com/job:                 scrape-percona-server-mongodb
      k8s.grafana.com/metrics.portNumber:  9216
      k8s.grafana.com/scrape:              true
    Configuration:                         replication:
  oplogSizeMB: 1000
security:
  clusterAuthMode: keyFile
  keyFile: /etc/mongodb-secrets/mongodb-key
setParameter:
  enableLocalhostAuthBypass: true
  initialSyncSourceReadPreference: secondaryPreferred

    Container Security Context:
      Allow Privilege Escalation:  false
      Capabilities:
        Drop:
          ALL
      Privileged:       false
      Run As Group:     1001
      Run As Non Root:  true
      Run As User:      1001
    Liveness Probe:
      Failure Threshold:      5
      Initial Delay Seconds:  90
      Startup Delay Seconds:  300
      Timeout Seconds:        30
    Name:                     newer-cluster
    Pod Security Context:
      Fs Group:      1001
      Run As Group:  1001
      Run As User:   1001
      Supplemental Groups:
        1001
    Resources:
      Limits:
        Cpu:     1
        Memory:  2Gi
      Requests:
        Cpu:     500m
        Memory:  1Gi
    Sidecars:
      Args:
        --mongodb.uri=mongodb://$(MONGO_USER):$(MONGO_PASSWORD)@127.0.0.1:27017/admin
        --collect-all
        --compatible-mode
      Env:
        Name:   MONGO_USER
        Value:  clusterMonitor
        Name:   MONGO_PASSWORD
        Value From:
          Secret Key Ref:
            Key:   MONGODB_CLUSTER_MONITOR_PASSWORD
            Name:  percona-server-mongodb-users
      Image:       percona/mongodb_exporter:0.40
      Name:        rs-mongo-exporter-0
      Security Context:
        Allow Privilege Escalation:  false
        Capabilities:
          Drop:
            ALL
        Privileged:       false
        Run As Non Root:  true
        Run As User:      1001
    Size:                 3
    Split Horizons:
      newer-cluster-newer-cluster-0:
        External:  mon-dt-newer-cluster-0.test.io
      newer-cluster-newer-cluster-1:
        External:  mon-dt-newer-cluster-1.test.io
      newer-cluster-newer-cluster-2:
        External:  mon-dt-newer-cluster-2.test.io
    Tolerations:
      Effect:    NoSchedule
      Key:       test.io/mongo-ebs
      Operator:  Exists
    Volume Spec:
      Persistent Volume Claim:
        Annotations:
          ebs.test.io/iops:  3000
        Resources:
          Requests:
            Storage:  20Gi
  Secrets:
    Key File:  newer-cluster-keyfile
    Ssl:       mongodb-ssl
  Sharding:
    Configsvr Repl Set:
      Affinity:
        Advanced:
          Node Affinity:
            Required During Scheduling Ignored During Execution:
              Node Selector Terms:
                Match Expressions:
                  Key:      test.io/instance-group
                  Operator:  In
                  Values:
                    mongo-ebs
                    mongo-ebs-xlarge
                    mongo-ebs-4xlarge
          Pod Anti Affinity:
            Required During Scheduling Ignored During Execution:
              Label Selector:
                Match Expressions:
                  Key:       app.kubernetes.io/instance
                  Operator:  In
                  Values:
                    newer-cluster
              Topology Key:  kubernetes.io/hostname
      Container Security Context:
        Allow Privilege Escalation:  false
        Capabilities:
          Drop:
            ALL
        Privileged:       false
        Run As Group:     1001
        Run As Non Root:  true
        Run As User:      1001
      Pod Security Context:
        Fs Group:      1001
        Run As Group:  1001
        Run As User:   1001
        Supplemental Groups:
          1001
      Size:  2
      Tolerations:
        Effect:    NoSchedule
        Key:      test.io/mongo-ebs
        Operator:  Exists
      Volume Spec:
        Persistent Volume Claim:
          Annotations:
            ebs.test.io/iops:  3000
          Resources:
            Requests:
              Storage:  10Gi
    Enabled:            true
    Mongos:
      Affinity:
        Advanced:
          Node Affinity:
            Required During Scheduling Ignored During Execution:
              Node Selector Terms:
                Match Expressions:
                  Key:      test.io/instance-group
                  Operator:  In
                  Values:
                    mongo-ebs
                    mongo-ebs-xlarge
                    mongo-ebs-4xlarge
          Pod Anti Affinity:
            Required During Scheduling Ignored During Execution:
              Label Selector:
                Match Expressions:
                  Key:       app.kubernetes.io/instance
                  Operator:  In
                  Values:
                    newer-cluster
              Topology Key:  kubernetes.io/hostname
      Container Security Context:
        Allow Privilege Escalation:  false
        Capabilities:
          Drop:
            ALL
        Privileged:       false
        Run As Group:     1001
        Run As Non Root:  true
        Run As User:      1001
      Expose:
        Annotations:
          external-dns.alpha.kubernetes.io/hostname:                     mon-dt-newer-cluster.test.io
          service.beta.kubernetes.io/aws-load-balancer-internal:         true
          service.beta.kubernetes.io/aws-load-balancer-security-groups:  services-elb.k8s-devtrack-alpha-mongodb.test.io, mongodb-elb.k8s-devtrack-alpha-mongodb.test.io
        Load Balancer Class:                                             service.k8s.aws/nlb
        Type:                                                            LoadBalancer
      Labels:
        app.kubernetes.io/component:         mongos
        Attach - Statefulset - Servicename:  newer-cluster-cluster-ip
      Pod Security Context:
        Fs Group:      1001
        Run As Group:  1001
        Run As User:   1001
        Supplemental Groups:
          1001
      Resources:
        Limits:
          Cpu:     2
          Memory:  2Gi
        Requests:
          Cpu:     1
          Memory:  1500Mi
      Size:        2
      Tolerations:
        Effect:    NoSchedule
        Key:      test.io/mongo-ebs
        Operator:  Exists
  Tls:
    Allow Invalid Certificates:  true
    Mode:                        preferTLS
  Unsafe Flags:
    Replset Size:   true
    Tls:            true
  Update Strategy:  SmartUpdate
  Users:
    Db:    admin
    Name:  op
    Password Secret Ref:
      Key:   op
      Name:  mongo-global-passwords
    Roles:
      Db:    admin
      Name:  clusterAdmin
      Db:    admin
      Name:  userAdminAnyDatabase
      Db:    admin
      Name:  readWriteAnyDatabase
    Db:      admin
    Name:    users_management
    Password Secret Ref:
      Key:   password
      Name:  users-management-password
    Roles:
      Db:    admin
      Name:  userAdminAnyDatabase
Status:
  Conditions:
    Last Transition Time:  2026-02-03T14:17:46Z
    Status:                True
    Type:                  sharding
    Last Transition Time:  2026-02-03T14:17:47Z
    Status:                True
    Type:                  initializing
    Last Transition Time:  2026-02-03T14:18:47Z
    Message:               handleReplsetInit: exec add admin user: command terminated with exit code 1 /  / MongoServerError: not primary

    Reason:                ErrorReconcile
    Status:                True
    Type:                  error
    Last Transition Time:  2026-02-03T14:32:43Z
    Reason:                MongosReady
    Status:                True
    Type:                  ready
  Host:                    k8s-psmdbnew-monmongo-cca3ef6254-910797f09652cf21.elb.us-east-1.amazonaws.com
  Message:                 Error: handleReplsetInit: exec add admin user: command terminated with exit code 1 /  / MongoServerError: not primary

  Mongos:
    Ready:              2
    Size:               2
    Status:             ready
  Observed Generation:  1
  Ready:                4
  Replsets:
    Cfg:
      Initialized:  true
      Members:
        mon-cfg-0:
          Name:       mon-cfg-0.mon-cfg.psmdb-newer-cluster.svc.cluster.local:27017
          State:      1
          State Str:  PRIMARY
        mon-cfg-1:
          Name:       mon-cfg-1.mon-cfg.psmdb-newer-cluster.svc.cluster.local:27017
          State:      2
          State Str:  SECONDARY
      Ready:          2
      Size:           2
      Status:         ready
    Newer - Cluster:
      Message:  mongod: back-off 5m0s restarting failed container=mongod pod=mon-newer-cluster-0_psmdb-newer-cluster(d9069047-15d6-4e84-aee7-ed44570e1839); mongod: back-off 5m0s restarting failed container=mongod pod=mon-newer-cluster-1_psmdb-newer-cluster(e91de43f-c936-4139-8429-f53c4f20d35e); mongod: back-off 5m0s restarting failed container=mongod pod=mon-newer-cluster-2_psmdb-newer-cluster(95134698-d7ce-46f1-9f28-22c30e1ea472); 
      Ready:    0
      Size:     3
      Status:   initializing
  Size:         7
  State:        error
Events:         <none>

```

Shows replset status as `initializing` with 0/3 ready members, and error message about admin user creation failing.

</details>

<details>
<summary>Pod resource usage</summary>

```
NAME                  CPU(cores)   MEMORY(bytes)   
mon-newer-cluster-0   23m          212Mi           
mon-newer-cluster-1   18m          198Mi           
mon-newer-cluster-2   15m          217Mi
```



### Steps to reproduce

1. 
2. 
3. 


### Versions

1. Kubernetes - 1.29.6
2. Operator - 1.21.1
3. Database - 7.0.18-22


### Anything else?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded cluster bootstrapping fails #2228

Report

More about the problem

Race Condition in Replica Set Initialization Causing "not primary" Error

Summary

Environment

Problem Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Information

Steps to reproduce

Versions

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sharded cluster bootstrapping fails #2228

Description

Report

More about the problem

Race Condition in Replica Set Initialization Causing "not primary" Error

Summary

Environment

Problem Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Information

Steps to reproduce

Versions

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions