azimuth-cloud · MaxBed4d · Oct 10, 2025 · Oct 13, 2025 · Oct 13, 2025 · Oct 23, 2025
diff --git a/docs/CAPI-mgmt/01-prerequisites.md b/docs/CAPI-mgmt/01-prerequisites.md
@@ -0,0 +1,71 @@
+# Prerequisites
+
+Although described in greater detail
+[here](https://stackhpc-kayobe-config.readthedocs.io/en/stackhpc-2025.1/configuration/magnum-capi.html#deployment-prerequisites),
+a brief summary of the requirements for deploying a CAPI management cluster for Magnum will
+be covered below.
+
+Additionally, general instructions of how to deploy the CAPI management cluster can be found at the
+following [link](https://stackhpc-kayobe-config.readthedocs.io/en/stackhpc-2025.1/configuration/magnum-capi.html).
+
+## OpenStack cloud
+
+This guide won't cover any OpenStack requirements which this cluster may be running on
+and a baseline understanding of OpenStack is assumed.
+
+### Networking
+
+The Cluster API architecture relies on a CAPI management cluster in order to run Kubernetes operators
+which directly interact with the cloud APIs. In the OpenStack case, the [Cluster API Provider OpenStack (CAPO)](https://github.com/kubernetes-sigs/cluster-api-provider-openstack) is used.
+
+This management cluster has two main requirements in order to operate:
+
+
+<!-- markdownlint-disable MD007 -->
+<!-- prettier-ignore-start -->
+
+- Firstly, it must be capable of reaching the public OpenStack APIs.
+- Secondly, the management cluster must be reachable from the control
+  plane nodes on which the Magnum containers are running.
+
+    - This is so that the Magnum conductor(s) may reach the management
+      cluster’s API server address listed in the `kubeconfig`.
+
+<!-- prettier-ignore-end -->
+<!-- markdownlint-enable MD007 -->
+
+### OpenStack project quotas
+
+For a production-ready, highly-available (HA) deployment with a seed node, 3 control plane nodes and
+3 worker nodes, the recommended capacity of available resources in your project should be sufficient for:
+
+
+- 1 x network, 1 x subnet, 1 x router
+- 1 x seed node (4 vCPU, 8 GB)
+- 3 x control plane nodes (4 vCPU, 8 GB) + 1 x extra when undergoing a rolling upgrade
+- 3 x worker nodes (8 vCPU, 16 GB) + 1 x extra when undergoing a rolling upgrade
+
+There are further suggested resources, as per the [following](../configuration/01-prerequisites.md#openstack-project-quotas),
+but these are optional.
+
+However, as with any of the configuration here, tailor these values to whatever
+best suits your needs and usecases!
+
+<!-- prettier-ignore-start -->
+!!! tip
+    It is recommended to have a separate OpenStack project for each concrete environment that is being deployed, for example a staging and production CAPI management clusters, particularly for high-availability (HA) deployments.
+<!-- prettier-ignore-end -->
+
+## Application Credential
+
+You should create an
+[Application Credential](https://docs.openstack.org/keystone/latest/user/application_credentials.html)
+for the project and save the resulting `clouds.yaml` as `./environments/<name>/clouds.yaml`.
+These application credentials should be encrypted with the use of `git-crypt`, especially if
+they are to be pushed to a git repository, these [docs](../repository/secrets.md#managing-secrets)
+provide instructions & further information regarding this.
+
+<!-- prettier-ignore-start -->
+!!! warning
+    Each concrete environment should have a separate application credential.
+<!-- prettier-ignore-end -->
diff --git a/docs/CAPI-mgmt/02-kubernetes-config.md b/docs/CAPI-mgmt/02-kubernetes-config.md
@@ -0,0 +1,146 @@
+# Kubernetes configuration
+
+The concepts in this section apply to the Cluster API management clusters, and not
+the tenant cluster; configuration concerning the tenant cluster are set via Magnum cluster labels.
+
+
+The variables used to configure HA deployments are the same as those for Azimuth and so
+only a surface level of detail will be covered below. For further details follow the link
+to the
+[default values](https://github.com/azimuth-cloud/ansible-collection-azimuth-ops/blob/main/roles/azimuth_capi_operator/defaults/main.yml).
+
+## Images
+
+The clusters deployed by the Magnum CAPI Helm driver will require
+an Ubuntu Kubernetes image and a Magnum cluster template.
+
+The way these user-facing images are managed differs from those of
+[Azimuth](../configuration/03-kubernetes-config.md#images), instead the images
+and Magnum cluster templates are managed by tools found in the openstack-config
+[repository](https://github.com/stackhpc/openstack-config#magnum-cluster-templates).
+
+<!-- prettier-ignore-start -->
+!!! note
+    The way in which these Magnum templates and images are managed, as explained above,
+    is under review.
+
+<!-- prettier-ignore-end -->
+
+## Multiple external networks
+
+In cases where multiple external networks are available, you must define which one the HA cluster
+should use:
+
+```yaml title="environments/my-site/inventory/group_vars/all/variables.yml"
+# The ID of the external network to use
+capi_cluster_external_network_id: "<network id>"
+
+<!-- prettier-ignore-start -->
+!!! note
+    This does **not** currently respect the "portal-external" tag.
+<!-- prettier-ignore-end -->
+
+## Volume-backed instances
+
+It is possible to use volume-backed instances if flavors predefined with large root disks are
+not available on the target cloud.
+
+<!-- prettier-ignore-start -->
+!!! danger "etcd and spinning disks"
+    The configuration options in this section should be used subject to the advice in the prerequisites.
+    See [prerequisites](../configuration/01-prerequisites.md#cinder-volumes-and-kubernetes) about using
+    Cinder volumes with Kubernetes.
+
+!!! warning "ceph spinning disks"
+    It is advised to make sure that the root disk **isnt** a spinning disk being provided ceph, rather
+    than the default local disk. These disks will be too slow to be able to provide a stable and
+    satisfactory user experience; please read [here](../configuration/01-prerequisites.md#cinder-volumes-and-kubernetes)
+    for more detail.
+
+!!! tip "etcd on a separate block device"
+    If you only have a limited amount of SSD or local disk, available, consider placing etcd on a
+    separate block device.
+    See [etcd block device](#etcd-configuration) to make best use of limited capacity.
+<!-- prettier-ignore-end -->
+
+The following variables can be used to configure Kubernetes clusters to use volume-backed instances
+(i.e. using a Cinder volume as the root disk):
+
+```yaml title="environments/my-site/inventory/group_vars/all/variables.yml"
+#### For the HA cluster ####
+
+# The size of the root volumes for Kubernetes nodes
+capi_cluster_root_volume_size: 100
+# The volume type to use for root volumes for Kubernetes nodes
+capi_cluster_root_volume_type: nvme
+```
+
+<!-- prettier-ignore-start -->
+!!! tip
+    The available volume types can be listed using the OpenStack CLI:
+    ```sh
+    openstack volume type list
+    ```
+<!-- prettier-ignore-end -->
+
+## Etcd configuration
+
+As discussed [here](../configuration/01-prerequisites.md#cinder-volumes-and-kubernetes),
+`etcd` is extremely sensitive to write latency. As such, it is possible
+to configure `etcd` onto a separate block device, meaning the disk's volume
+type can differ from the root disk, allowing efficient use of SSD-backed storage.
+More detail on this can be found [here](../configuration/03-kubernetes-config.md#etcd-configuration).
+
+<!-- prettier-ignore-start -->
+!!! tip "Use local disk for etcd whenever possible"
+    Using local disk when possible minises the write latency for etcd and also eliminates network instability as a cause of latency problems.
+<!-- prettier-ignore-end -->
+
+The following variables are used to configure the etcd block device for an HA cluster:
+
+```yaml title="environments/my-site/inventory/group_vars/all/variables.yml"
+# Specifies the size of the etcd block device in GB
+# This is typically between 2GB and 10GB - Amazon recommends 8GB for EKS
+# Defaults to 0, meaning etcd stays on the root device
+capi_cluster_etcd_blockdevice_size: 8
+
+# The type of block device that will be used for etcd
+# Specify "Volume" (the default) to use a Cinder volume
+# Specify "Local" to use local disk (the flavor must support ephemeral disk)
+capi_cluster_etcd_blockdevice_type: Volume
+
+# The Cinder volume type to use for the etcd block device
+# Only used if "Volume" is specified as block device type
+# If not given, the default volume type for the cloud will be used
+capi_cluster_etcd_blockdevice_volume_type: nvme
+
+# The Cinder availability zone to use for the etcd block device
+# Only used if "Volume" is specified as block device type
+# Defaults to "nova"
+capi_cluster_etcd_blockdevice_volume_az: nova
+```
+
+## Load-balancer provider
+
+If the target cloud uses [OVN networking](https://wiki.openstack.org/wiki/Neutron/ML2), and the
+[OVN Octavia provider](https://docs.openstack.org/ovn-octavia-provider/latest/admin/driver.html)
+is enabled, then Kubernetes clusters should be configured to use the OVN provider for
+any load-balancers that are created:
+
+```yaml title="environments/my-site/inventory/group_vars/all/variables.yml"
+openstack_loadbalancer_provider: ovn
+```
+
+<!-- prettier-ignore-start -->
+!!! tip
+    You can see the available load-balancer providers using the OpenStack CLI:
+    ```sh
+    openstack loadbalancer provider list
+    ```
+<!-- prettier-ignore-end -->
+
+## Availability zones
+
+By default, it is assumed that there is only a single
+[availability zone (AZ)](https://docs.openstack.org/nova/latest/admin/availability-zones.html)
+called `nova`. If this is not the case for your target cloud, use the `capi_cluster_*` variables described [here](../configuration/03-kubernetes-config.md#availability-zones).
diff --git a/docs/CAPI-mgmt/03-monitoring.md b/docs/CAPI-mgmt/03-monitoring.md
@@ -0,0 +1,65 @@
+# Monitoring and alerting
+
+Just like standard Azimuth installations, CAPI management clusters are deployed with a
+monitoring and alert stack, including [Prometheus](https://prometheus.io/) for metric collection
+and [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) for alert generation
+based on those metrics.
+
+
+Apart from aforementioned monitoring services, there are also log aggregate services,
+[Loki](https://grafana.com/oss/loki/) and [Promtail](https://grafana.com/docs/loki/latest/clients/promtail/),deployed as part of the stack. Further components of the deployed monitoring stack are covered in Azimuth's
+[monitoring documents](../configuration/14-monitoring.md#monitoring-and-alerting).
+
+## Accessing web interfaces
+
+The monitoring and alerting web dashboards are currently exposed via the use of this
+port-forwarding [script](https://github.com/azimuth-cloud/azimuth-config/blob/devel/bin/port-forward).
+The monitoring services are accessible using the provided port-forwarding script in azimuth-config. For example, running `./bin/port-forward grafana 1234` would make the Grafana web interface available on  `http://localhost:1234`. The following services are exposed:
+
+- `grafana` for the Grafana dashboards
+- `prometheus` for the Prometheus web interface
+- `alertmanager` for the Alertmanager web interface
+
+## Persistence and retention
+
+HA deployments configure Prometheus, Alertmanager and Loki to use persistent volumes in order
+for metrics, alert state (e.g. silences) and logs to persist across pod restarts and cluster upgrades.
+
+As such, it is important to consider, due to the vast amount of storage that monitoring data and logs
+are capable of consuming, how much storage is going to be dedicated to storing it (volume size), in
+addition to, how long should the data be kept before it is discarded (retention period).
+The variables controlling these for Alertmanager, Prometheus and Loki are shown below alongside
+their default values:
+
+```yaml title="environments/my-site/inventory/group_vars/all/variables.yml"
+# Alertmanager retention and volume size
+capi_cluster_addons_monitoring_alertmanager_retention: 168h
+capi_cluster_addons_monitoring_alertmanager_volume_size: 10Gi
+
+# Prometheus retention and volume size
+capi_cluster_addons_monitoring_prometheus_retention: 90d
+capi_cluster_addons_monitoring_prometheus_volume_size: 10Gi
+
+# Loki retention and volume size
+capi_cluster_addons_monitoring_loki_retention: 744h
+capi_cluster_addons_monitoring_loki_volume_size: 10Gi
+```
+
+<!-- prettier-ignore-start -->
+!!! danger
+    Volumes can only be **increased** in size. Any attempt to reduce the size of a volume will be rejected.
+<!-- prettier-ignore-end -->
+
+## Slack alerts
+
+If your organisation uses [Slack](https://slack.com/), it is possible to configure Alertmanager to send
+condition-based alerts to a Slack channel using [Incoming Webhooks](https://api.slack.com/messaging/webhooks).
+
+<!-- prettier-ignore-start -->
+!!! danger
+    The webhook URL should be kept secret. If you want to keep it in Git - which is recommended - then it must be encrypted.
+    See [secrets](../repository/secrets.md).
+<!-- prettier-ignore-end -->
+
+The instructions on how to enable Slack alerts can be found in Azimuth's own Slack alerts
+[documentation](../configuration/14-monitoring.md#slack-alerts).
diff --git a/docs/CAPI-mgmt/04-disaster-recovery.md b/docs/CAPI-mgmt/04-disaster-recovery.md
@@ -0,0 +1,26 @@
+# Disaster Recovery
+
+CAPI management clusters can be configured to use [Velero](https://velero.io) as a disaster
+recovery solution. Velero provides the ability to back up Kubernetes API resources to an object
+store and has a plugin-based system to enable snapshotting of a cluster's persistent volumes.
+
+<!-- prettier-ignore-start -->
+!!! warning
+    Backup and restore is only available for production-grade HA installations of clusters.
+<!-- prettier-ignore-end -->
+
+The playbooks install Velero on the HA management cluster and the Velero command-line-tool on the seed node.
+Once configured with the appropriate credentials, the installation process will create a
+[Schedule](https://velero.io/docs/latest/api-types/schedule/) on the HA cluster, which triggers a daily
+backup at midnight and cleans up backups older which are more than 1 week old.
+
+<!-- prettier-ignore-start -->
+!!! note
+    - The [AWS Velero plugin](https://github.com/vmware-tanzu/velero-plugin-for-aws) is used for S3 support.
+    - The [CSI plugin](https://github.com/vmware-tanzu/velero-plugin-for-csi) for volume snapshots.
+    - The CSI plugin uses Kubernetes generic support for [Volume Snapshots](https://kubernetes.io/docs/concepts/storage/volume-snapshots/).
+        - This is implemented for OpenStack by the [Cinder CSI plugin](https://github.com/kubernetes/cloud-provider-openstack).
+<!-- prettier-ignore-end -->
+
+Information on how to configure and use disaster recovery can be found
+[here](../configuration/15-disaster-recovery.md#configuration)
diff --git a/docs/CAPI-mgmt/index.md b/docs/CAPI-mgmt/index.md
@@ -0,0 +1,15 @@
+# Configuring Standalone CAPI Management Clusters
+
+In recent years, the kubernetes
+[Cluster API project](https://cluster-api.sigs.k8s.io/) has been more widely adopted for the role as
+the main driver for managing OpenStack infrastructure for container orchestration engines (COE), such as Magnum.
+This is the same Cluster API (CAPI) used by Azimuth and thus their configuration and operations share a lot in common.
+Therefore, this document will outline how to use
+[azimuth-config](https://github.com/azimuth-cloud/azimuth-config) to deploy an Azimuth-free,
+standalone CAPI management cluster, using Magnum as the chosen COE.
+
+<!-- prettier-ignore-start -->
+!!! note
+    It is assumed that you have already followed the steps in setting up a configuration repository, and so have an environment for your site that is ready to be configured.
+    See [Setting up a configuration repository](../repository/index.md).
+<!-- prettier-ignore-end -->
diff --git a/docs/configuration/01-prerequisites.md b/docs/configuration/01-prerequisites.md
@@ -82,7 +82,7 @@ A standard high-availability (HA) deployment with a seed node, 3 control plane n
 - 500GB Cinder storage
 - 3 x floating IPs
   - One for accessing the seed node
-  - One fo the ingress controller for accessing HTTP services
+  - One for the ingress controller for accessing HTTP services
   - One for the Zenith SSHD server
 
 <!-- prettier-ignore-start -->

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -53,6 +53,12 @@ nav:
       - debugging/caas.md
   - Developing:
       - developing/index.md
+  - CAPI management cluster:
+      - CAPI-mgmt/index.md
+      - CAPI-mgmt/01-prerequisites.md
+      - CAPI-mgmt/02-kubernetes-config.md
+      - CAPI-mgmt/03-monitoring.md
+      - CAPI-mgmt/04-disaster-recovery.md
 
 theme:
   name: material