Skip to content

Cluster initialization hangs with retry_join #16486

@timotheencl

Description

@timotheencl

Describe the bug
I set up a 3 nodes Vault cluster (only servers) with the integrated raft storage backend. With retry_join stanza in configuration.

Since version 1.11.1, the cluster doesn't initialize properly. Only the first node is initialized, the other nodes hangs because they havent find an available leader early, and they do not retry infinitely (Only 2 times), waiting for the first node to be initialized.

To Reproduce
Steps to reproduce the behavior:

  1. Run a vault agents on 3 VMs
  2. Run vault init on the first node
  3. Run vault unseal on the first node
  4. Wait for the other nodes to be initialized (it fails here, only the first node is initialized)

Expected behavior
At the step 3, all three nodes of the cluster should be initialized, and nodes 2 and 3 could be unsealed next.
In orther words, nodes 2 and 3 should always retry to join the cluster for the initialization process. Not only 2 times.

Possible workaround
After the first node is initilized, restarting vault agents on nodes 2 and 3 will succesfully terminate the cluster initialization. This is because we are forcing the retry join with a vault agent restart. And at this time, the first node is initialized and unsealed, so it's a leader available to init other nodes.

Environment:

  • Vault Server Version (retrieve with vault status): 1.11.1
  • Vault CLI Version (retrieve with vault version): 1.11.1
  • Server Operating System/Architecture: Ubuntu 22.04

Additional context
I've tested with vault v1.11.0 and I do not experiment this issue. It's seems that the last security patch has introduced this issue.

Metadata

Metadata

Assignees

Labels

bugUsed to indicate a potential bugstorage/raft

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions