Skip to content

Update Ra from 2.15.0 to 2.16.0#314

Merged
dumbbell merged 4 commits into
mainfrom
update-ra-to-2.16.0
Jan 28, 2025
Merged

Update Ra from 2.15.0 to 2.16.0#314
dumbbell merged 4 commits into
mainfrom
update-ra-to-2.16.0

Conversation

@dumbbell
Copy link
Copy Markdown
Collaborator

@dumbbell dumbbell commented Jan 27, 2025

Release notes:
https://github.com/rabbitmq/ra/releases/tag/v2.16.0

One important change is rabbitmq/ra#493: it delays the machine version upgrade until all members are up-to-date. This ensures that a Khepri cluster member doesn't become stuck because it is running an older version compared to the cluster leader.

V2: Adapt cluster_SUITE to a new behavior in Ra.

V3: Require Erlang/OTP 26 because Ra requires it now.

V4: Handle {error, normal} like {error, shutdown}: instead of emitting a {normal, ...} exit signal, the Ra server returns it as a regular {error, normal} reture value.

It also means that wait_for_leader/2 and wait_for_cluster_change_permitted/2 can return {error, noproc}. In do_reset/4, if wait_for_cluster_change_permitted/2 returns {error, noproc}, we keep the same behavior as with the {normal, ...} exit signal.

@dumbbell dumbbell added this to the v0.17.0 milestone Jan 27, 2025
@dumbbell dumbbell self-assigned this Jan 27, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 27, 2025

Codecov Report

Attention: Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.

Project coverage is 89.16%. Comparing base (a9cb351) to head (99ebba6).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/khepri_cluster.erl 81.81% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #314      +/-   ##
==========================================
- Coverage   89.48%   89.16%   -0.32%     
==========================================
  Files          22       22              
  Lines        3290     3295       +5     
==========================================
- Hits         2944     2938       -6     
- Misses        346      357      +11     
Flag Coverage Δ
erlang-25 ?
erlang-26 89.13% <81.81%> (+0.19%) ⬆️
erlang-27 89.01% <54.54%> (-0.29%) ⬇️
os-ubuntu-latest 89.13% <81.81%> (-0.35%) ⬇️
os-windows-latest 89.07% <54.54%> (-0.29%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dumbbell dumbbell force-pushed the update-ra-to-2.16.0 branch 4 times, most recently from a79fcc8 to a7f87a1 Compare January 28, 2025 11:05
[Why]
We are about to update Ra to version 2.16.0 and it requires Erlang/OTP
26. Therefore, Khepri will require the same version as a consequence.
Let's make it explicit.
@dumbbell dumbbell force-pushed the update-ra-to-2.16.0 branch from a048047 to ee36376 Compare January 28, 2025 16:00
[Why]
With Ra 2.16.0, an inconsistency with return values was fixed and it can
now return `{error, normal}` in the Ra server exits with reason `normal`
while a synchronous call is performed. Before, Ra would throw an
exception with the `{normal, ...}` reason.

[How]
The new return value is treated like `{error, shutdown}`.

This uncovers a small issue with `wait_for_cluster_change_permitted/2`.
This function relies on `wait_for_leader/2`. The retry loop in
`wait_for_leader/2` is fine but for the purpose of
`wait_for_cluster_change_permitted/2`, it waits until timeout if the Ra
server is not running instead of returning immediately. This affects
`do_join_locked/4` which fails as a consequence.

A new option was introduced to stop the retry in `wait_for_leader/2` if
the Ra server is not running. This better suits the need of
`wait_for_cluster_change_permitted/2` and the cluster formation code.
... instead of hard-coding a 200 ms sleep.

[Why]
The same value is used in other retry loops in the same module. It
should have been used there as well.
Release notes:
https://github.com/rabbitmq/ra/releases/tag/v2.16.0

One important change is rabbitmq/ra#493: it delays the machine version
upgrade until all members are up-to-date. This ensures that a Khepri
cluster member doesn't become stuck because it is running an older
version compared to the cluster leader.

V2: Adapt `cluster_SUITE` to a new behavior in Ra.
@dumbbell dumbbell force-pushed the update-ra-to-2.16.0 branch from 7927180 to 99ebba6 Compare January 28, 2025 16:52
@dumbbell dumbbell marked this pull request as ready for review January 28, 2025 16:59
@dumbbell dumbbell merged commit 5bd4853 into main Jan 28, 2025
@dumbbell dumbbell deleted the update-ra-to-2.16.0 branch January 28, 2025 16:59
dumbbell added a commit that referenced this pull request Nov 6, 2025
[Why]
Ra can return `{error, normal}` when a Ra server exits. We already
handle this return value, thanks to pull requests #314 and #335.
However, it can happen in `process_sync_command/3: this was reported
once in CI. Thus it is rare, but still possible.

[How]
Like in previous patches, we treat `{error, normal}` like `{error,
noproc}`.
dumbbell added a commit that referenced this pull request Nov 6, 2025
[Why]
Ra can return `{error, normal}` when a Ra server exits. We already
handle this return value, thanks to pull requests #314 and #335.
However, it can happen in `process_sync_command/3`: this was reported
once in CI. Thus it is rare, but still possible.

[How]
Like in previous patches, we treat `{error, normal}` like `{error,
noproc}`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant