Skip to content

Commit 79b14b1

Browse files
authored
Merge pull request #400 from xcp-ng/xha-doc
Add XHA architecture (shared SRs like NFS + XOSTOR)
2 parents 393dfb0 + a3e18a3 commit 79b14b1

File tree

4 files changed

+52
-0
lines changed

4 files changed

+52
-0
lines changed

assets/img/xha-shared-sr.png

70.4 KB
Loading
55.3 KB
Loading

assets/img/xha-xostor-sr.png

96.3 KB
Loading

docs/management/ha.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -243,3 +243,55 @@ Immediatly after fencing, **Minion 1** will be booted on the other host.
243243
#### Pull the network cable
244244

245245
Finally, the worst case: keep the storage operational, but "cut" the (management) network interface. Same procedure: unplug the cable physically and wait... Because **lab1** cannot contact any other host in the pool (in this case, **lab2**), it starts the fencing procedure. The result is exactly the same as the previous test. It's gone for the pool master, displayed as **Halted** until we re-plug the cable.
246+
247+
## 🤓 Architecture
248+
249+
### General
250+
251+
The diagram below shows how HA is managed on a pool.
252+
253+
![](../../assets/img/xha-shared-sr.png)
254+
255+
As you can see, a `XHA daemon` is running on each host and two main paths are used: one for the network, another for storage.
256+
For HA to operate properly, two communication paths are used: one over the network and another reserved for storage.
257+
There are two paths because this is the solution chosen to distinguish a network problem between hosts and a storage issue.
258+
In both cases, data is constantly transmitted through these paths to ensure proper HA operation:
259+
- UDP packets are exchanged over the network management interface so that each server can indicate it is alive.
260+
- Disk data is written to and read by the hosts through a volume called `ha-statefile`. In this example, the volume resides on an NFS SR, which is shared and accessible by the entire pool. It’s a standard SR containing VHD files used by VMs.
261+
The only difference is that `ha-statefile` is a raw volume in which data is written directly.
262+
263+
Regarding the structure of this SR heartbeat volume:
264+
265+
![](../../assets/img/xha-statefile-structure.png)
266+
267+
- As the picture shows, this volume contains a single entry for each host, where each host writes to its own dedicated area AND can also read the state of other hosts. In other words, each host writes a “heartbeat” value indicating that it’s alive at a given time, which is verified by the whole pool.
268+
269+
- There is no write lock; each host can write at any time. It's why there is one entry for each host.
270+
271+
### XOSTOR
272+
273+
For DRBD/LINSTOR experts, and with the general architecture explanation, you can understand what happens when we replace the NFS hearbeat volume by a DRBD device.
274+
275+
We must change our architecture because — basically — a DRBD volume can only be opened in one place at a time. We cannot easily write in each volume at the same time, because we would have to open or close the heartbeat volume several times per second. Or, we would have to set up a mechanism in the xha daemon so that each one writes in turn. Since this is complex to set up, we chose another approach.
276+
277+
![](../../assets/img/xha-xostor-sr.png)
278+
279+
To support the fact that only one DRBD volume can be PRIMARY, and to avoid making significant changes to the xha/XHAPI modules, we had to be a bit creative. Instead of writing to or reading directly from the heartbeat volume on all hosts, we use an `NBD HTTP server` daemon. It's a process that listens through an NBD device, which is seen as the heartbeat volume by the XHA daemon.
280+
281+
As a result, each read and write request from the XHA daemon goes through an NBD device. It is then transmitted via the HTTP protocol to another daemon called `HTTP disk server`, which is responsible for writing to the DRBD heartbeat volume.
282+
This whole new path acts as a proxy, hiding direct access to the actual heartbeat volume of the SR.
283+
284+
Now, what happens if the host running the active `HTTP disk server` daemon crashes?
285+
In that case, the heartbeat volume loses its PRIMARY status on that host. Among the surviving hosts, another server daemon will then attempt to become PRIMARY in its place.
286+
Then, the one that successfully becomes PRIMARY will start receiving the heartbeat requests from the `NBD HTTP server` daemons.
287+
288+
For those curious about the LINSTOR/DRBD options used behind the heartbeat volume, these are the ones used by LINBIT in their LINSTOR guide for creating a Highly Available LINSTOR cluster:
289+
290+
```
291+
'DrbdOptions/auto-quorum': 'disabled',
292+
'DrbdOptions/Resource/auto-promote': 'no',
293+
'DrbdOptions/Resource/on-no-data-accessible': 'io-error',
294+
'DrbdOptions/Resource/on-no-quorum': 'io-error',
295+
'DrbdOptions/Resource/on-suspended-primary-outdated': 'force-secondary',
296+
'DrbdOptions/Resource/quorum': 'majority'
297+
```

0 commit comments

Comments
 (0)