-
Notifications
You must be signed in to change notification settings - Fork 15
Add documentation for CAPI Management Cluster #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
MaxBed4d
wants to merge
14
commits into
devel
Choose a base branch
from
magnum-docs
base: devel
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
cc4c026
Add document chapters
MaxBed4d da2e660
Add content for first two sections
MaxBed4d 6a9a583
Add remainder of outline/skeleton
MaxBed4d 36fcd55
Fix typo & first draft of new docs
MaxBed4d 881ce59
Add Cluster config section
MaxBed4d d704441
Amendments & improvements
MaxBed4d d3ff022
Merge branch 'devel' into magnum-docs
MaxBed4d 8bab73a
Preqs amendments
MaxBed4d d346a35
k8s conf amendments
MaxBed4d 0ccb5e3
Linting fix
MaxBed4d ad81ff2
Amend docs
MaxBed4d 26f119e
Apply Scott's suggestions
MaxBed4d e64590f
Amend docs
MaxBed4d f173df3
Make suggested changes
MaxBed4d File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| # Prerequisites | ||
|
|
||
| Although described in greater detail | ||
| [here](https://stackhpc-kayobe-config.readthedocs.io/en/stackhpc-2025.1/configuration/magnum-capi.html#deployment-prerequisites), | ||
| a brief summary of the requirements for deploying a CAPI management cluster for Magnum will | ||
| be covered below. | ||
|
|
||
| Additionally, general instructions of how to deploy the CAPI management cluster can be found at the | ||
| following [link](https://stackhpc-kayobe-config.readthedocs.io/en/stackhpc-2025.1/configuration/magnum-capi.html). | ||
|
|
||
| ## OpenStack cloud | ||
|
|
||
| This guide won't cover any OpenStack requirements which this cluster may be running on | ||
| and a baseline understanding of OpenStack is assumed. | ||
|
|
||
| ### Networking | ||
|
|
||
| The Cluster API architecture relies on a CAPI management cluster in order to run Kubernetes operators | ||
| which directly interact with the cloud APIs. In the OpenStack case, the [Cluster API Provider OpenStack (CAPO)](https://github.com/kubernetes-sigs/cluster-api-provider-openstack) is used. | ||
|
|
||
| This management cluster has two main requirements in order to operate: | ||
|
|
||
|
|
||
| <!-- markdownlint-disable MD007 --> | ||
| <!-- prettier-ignore-start --> | ||
|
|
||
| - Firstly, it must be capable of reaching the public OpenStack APIs. | ||
| - Secondly, the management cluster must be reachable from the control | ||
| plane nodes on which the Magnum containers are running. | ||
|
|
||
| - This is so that the Magnum conductor(s) may reach the management | ||
| cluster’s API server address listed in the `kubeconfig`. | ||
|
|
||
| <!-- prettier-ignore-end --> | ||
| <!-- markdownlint-enable MD007 --> | ||
|
|
||
| ### OpenStack project quotas | ||
|
|
||
| For a production-ready, highly-available (HA) deployment with a seed node, 3 control plane nodes and | ||
| 3 worker nodes, the recommended capacity of available resources in your project should be sufficient for: | ||
|
|
||
|
|
||
| - 1 x network, 1 x subnet, 1 x router | ||
| - 1 x seed node (4 vCPU, 8 GB) | ||
| - 3 x control plane nodes (4 vCPU, 8 GB) + 1 x extra when undergoing a rolling upgrade | ||
| - 3 x worker nodes (8 vCPU, 16 GB) + 1 x extra when undergoing a rolling upgrade | ||
|
|
||
| There are further suggested resources, as per the [following](../configuration/01-prerequisites.md#openstack-project-quotas), | ||
| but these are optional. | ||
|
|
||
| However, as with any of the configuration here, tailor these values to whatever | ||
| best suits your needs and usecases! | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! tip | ||
| It is recommended to have a separate OpenStack project for each concrete environment that is being deployed, for example a staging and production CAPI management clusters, particularly for high-availability (HA) deployments. | ||
| <!-- prettier-ignore-end --> | ||
|
|
||
| ## Application Credential | ||
|
|
||
| You should create an | ||
| [Application Credential](https://docs.openstack.org/keystone/latest/user/application_credentials.html) | ||
| for the project and save the resulting `clouds.yaml` as `./environments/<name>/clouds.yaml`. | ||
| These application credentials should be encrypted with the use of `git-crypt`, especially if | ||
| they are to be pushed to a git repository, these [docs](../repository/secrets.md#managing-secrets) | ||
| provide instructions & further information regarding this. | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! warning | ||
| Each concrete environment should have a separate application credential. | ||
| <!-- prettier-ignore-end --> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,146 @@ | ||
| # Kubernetes configuration | ||
|
|
||
| The concepts in this section apply to the Cluster API management clusters, and not | ||
| the tenant cluster; configuration concerning the tenant cluster are set via Magnum cluster labels. | ||
|
|
||
|
|
||
| The variables used to configure HA deployments are the same as those for Azimuth and so | ||
| only a surface level of detail will be covered below. For further details follow the link | ||
| to the | ||
| [default values](https://github.com/azimuth-cloud/ansible-collection-azimuth-ops/blob/main/roles/azimuth_capi_operator/defaults/main.yml). | ||
|
|
||
| ## Images | ||
|
|
||
| The clusters deployed by the Magnum CAPI Helm driver will require | ||
| an Ubuntu Kubernetes image and a Magnum cluster template. | ||
|
|
||
| The way these user-facing images are managed differs from those of | ||
| [Azimuth](../configuration/03-kubernetes-config.md#images), instead the images | ||
| and Magnum cluster templates are managed by tools found in the openstack-config | ||
| [repository](https://github.com/stackhpc/openstack-config#magnum-cluster-templates). | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! note | ||
| The way in which these Magnum templates and images are managed, as explained above, | ||
| is under review. | ||
|
|
||
| <!-- prettier-ignore-end --> | ||
|
|
||
| ## Multiple external networks | ||
|
|
||
| In cases where multiple external networks are available, you must define which one the HA cluster | ||
| should use: | ||
|
|
||
| ```yaml title="environments/my-site/inventory/group_vars/all/variables.yml" | ||
| # The ID of the external network to use | ||
| capi_cluster_external_network_id: "<network id>" | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! note | ||
| This does **not** currently respect the "portal-external" tag. | ||
| <!-- prettier-ignore-end --> | ||
|
|
||
| ## Volume-backed instances | ||
|
|
||
| It is possible to use volume-backed instances if flavors predefined with large root disks are | ||
| not available on the target cloud. | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! danger "etcd and spinning disks" | ||
| The configuration options in this section should be used subject to the advice in the prerequisites. | ||
| See [prerequisites](../configuration/01-prerequisites.md#cinder-volumes-and-kubernetes) about using | ||
| Cinder volumes with Kubernetes. | ||
MaxBed4d marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| !!! warning "ceph spinning disks" | ||
| It is advised to make sure that the root disk **isnt** a spinning disk being provided ceph, rather | ||
| than the default local disk. These disks will be too slow to be able to provide a stable and | ||
| satisfactory user experience; please read [here](../configuration/01-prerequisites.md#cinder-volumes-and-kubernetes) | ||
| for more detail. | ||
|
|
||
| !!! tip "etcd on a separate block device" | ||
| If you only have a limited amount of SSD or local disk, available, consider placing etcd on a | ||
| separate block device. | ||
| See [etcd block device](#etcd-configuration) to make best use of limited capacity. | ||
| <!-- prettier-ignore-end --> | ||
|
|
||
| The following variables can be used to configure Kubernetes clusters to use volume-backed instances | ||
| (i.e. using a Cinder volume as the root disk): | ||
|
|
||
| ```yaml title="environments/my-site/inventory/group_vars/all/variables.yml" | ||
| #### For the HA cluster #### | ||
|
|
||
| # The size of the root volumes for Kubernetes nodes | ||
| capi_cluster_root_volume_size: 100 | ||
| # The volume type to use for root volumes for Kubernetes nodes | ||
| capi_cluster_root_volume_type: nvme | ||
| ``` | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! tip | ||
| The available volume types can be listed using the OpenStack CLI: | ||
| ```sh | ||
| openstack volume type list | ||
| ``` | ||
| <!-- prettier-ignore-end --> | ||
|
|
||
| ## Etcd configuration | ||
|
|
||
| As discussed [here](../configuration/01-prerequisites.md#cinder-volumes-and-kubernetes), | ||
| `etcd` is extremely sensitive to write latency. As such, it is possible | ||
| to configure `etcd` onto a separate block device, meaning the disk's volume | ||
| type can differ from the root disk, allowing efficient use of SSD-backed storage. | ||
| More detail on this can be found [here](../configuration/03-kubernetes-config.md#etcd-configuration). | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! tip "Use local disk for etcd whenever possible" | ||
| Using local disk when possible minises the write latency for etcd and also eliminates network instability as a cause of latency problems. | ||
| <!-- prettier-ignore-end --> | ||
|
|
||
| The following variables are used to configure the etcd block device for an HA cluster: | ||
|
|
||
| ```yaml title="environments/my-site/inventory/group_vars/all/variables.yml" | ||
| # Specifies the size of the etcd block device in GB | ||
| # This is typically between 2GB and 10GB - Amazon recommends 8GB for EKS | ||
| # Defaults to 0, meaning etcd stays on the root device | ||
| capi_cluster_etcd_blockdevice_size: 8 | ||
|
|
||
| # The type of block device that will be used for etcd | ||
| # Specify "Volume" (the default) to use a Cinder volume | ||
| # Specify "Local" to use local disk (the flavor must support ephemeral disk) | ||
| capi_cluster_etcd_blockdevice_type: Volume | ||
|
|
||
| # The Cinder volume type to use for the etcd block device | ||
| # Only used if "Volume" is specified as block device type | ||
| # If not given, the default volume type for the cloud will be used | ||
| capi_cluster_etcd_blockdevice_volume_type: nvme | ||
|
|
||
| # The Cinder availability zone to use for the etcd block device | ||
| # Only used if "Volume" is specified as block device type | ||
| # Defaults to "nova" | ||
| capi_cluster_etcd_blockdevice_volume_az: nova | ||
| ``` | ||
|
|
||
| ## Load-balancer provider | ||
|
|
||
| If the target cloud uses [OVN networking](https://wiki.openstack.org/wiki/Neutron/ML2), and the | ||
| [OVN Octavia provider](https://docs.openstack.org/ovn-octavia-provider/latest/admin/driver.html) | ||
| is enabled, then Kubernetes clusters should be configured to use the OVN provider for | ||
| any load-balancers that are created: | ||
|
|
||
| ```yaml title="environments/my-site/inventory/group_vars/all/variables.yml" | ||
| openstack_loadbalancer_provider: ovn | ||
| ``` | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! tip | ||
| You can see the available load-balancer providers using the OpenStack CLI: | ||
| ```sh | ||
| openstack loadbalancer provider list | ||
| ``` | ||
| <!-- prettier-ignore-end --> | ||
|
|
||
| ## Availability zones | ||
|
|
||
| By default, it is assumed that there is only a single | ||
| [availability zone (AZ)](https://docs.openstack.org/nova/latest/admin/availability-zones.html) | ||
| called `nova`. If this is not the case for your target cloud, use the `capi_cluster_*` variables described [here](../configuration/03-kubernetes-config.md#availability-zones). | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # Monitoring and alerting | ||
|
|
||
| Just like standard Azimuth installations, CAPI management clusters are deployed with a | ||
| monitoring and alert stack, including [Prometheus](https://prometheus.io/) for metric collection | ||
| and [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) for alert generation | ||
| based on those metrics. | ||
|
|
||
|
|
||
| Apart from aforementioned monitoring services, there are also log aggregate services, | ||
| [Loki](https://grafana.com/oss/loki/) and [Promtail](https://grafana.com/docs/loki/latest/clients/promtail/),deployed as part of the stack. Further components of the deployed monitoring stack are covered in Azimuth's | ||
| [monitoring documents](../configuration/14-monitoring.md#monitoring-and-alerting). | ||
|
|
||
| ## Accessing web interfaces | ||
|
|
||
| The monitoring and alerting web dashboards are currently exposed via the use of this | ||
| port-forwarding [script](https://github.com/azimuth-cloud/azimuth-config/blob/devel/bin/port-forward). | ||
| The monitoring services are accessible using the provided port-forwarding script in azimuth-config. For example, running `./bin/port-forward grafana 1234` would make the Grafana web interface available on `http://localhost:1234`. The following services are exposed: | ||
|
|
||
| - `grafana` for the Grafana dashboards | ||
| - `prometheus` for the Prometheus web interface | ||
| - `alertmanager` for the Alertmanager web interface | ||
|
|
||
| ## Persistence and retention | ||
|
|
||
| HA deployments configure Prometheus, Alertmanager and Loki to use persistent volumes in order | ||
| for metrics, alert state (e.g. silences) and logs to persist across pod restarts and cluster upgrades. | ||
|
|
||
| As such, it is important to consider, due to the vast amount of storage that monitoring data and logs | ||
| are capable of consuming, how much storage is going to be dedicated to storing it (volume size), in | ||
| addition to, how long should the data be kept before it is discarded (retention period). | ||
| The variables controlling these for Alertmanager, Prometheus and Loki are shown below alongside | ||
| their default values: | ||
|
|
||
| ```yaml title="environments/my-site/inventory/group_vars/all/variables.yml" | ||
| # Alertmanager retention and volume size | ||
| capi_cluster_addons_monitoring_alertmanager_retention: 168h | ||
| capi_cluster_addons_monitoring_alertmanager_volume_size: 10Gi | ||
|
|
||
| # Prometheus retention and volume size | ||
| capi_cluster_addons_monitoring_prometheus_retention: 90d | ||
| capi_cluster_addons_monitoring_prometheus_volume_size: 10Gi | ||
|
|
||
| # Loki retention and volume size | ||
| capi_cluster_addons_monitoring_loki_retention: 744h | ||
| capi_cluster_addons_monitoring_loki_volume_size: 10Gi | ||
| ``` | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! danger | ||
| Volumes can only be **increased** in size. Any attempt to reduce the size of a volume will be rejected. | ||
| <!-- prettier-ignore-end --> | ||
|
|
||
| ## Slack alerts | ||
|
|
||
| If your organisation uses [Slack](https://slack.com/), it is possible to configure Alertmanager to send | ||
| condition-based alerts to a Slack channel using [Incoming Webhooks](https://api.slack.com/messaging/webhooks). | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! danger | ||
| The webhook URL should be kept secret. If you want to keep it in Git - which is recommended - then it must be encrypted. | ||
| See [secrets](../repository/secrets.md). | ||
| <!-- prettier-ignore-end --> | ||
|
|
||
| The instructions on how to enable Slack alerts can be found in Azimuth's own Slack alerts | ||
| [documentation](../configuration/14-monitoring.md#slack-alerts). |
sd109 marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # Disaster Recovery | ||
|
|
||
| CAPI management clusters can be configured to use [Velero](https://velero.io) as a disaster | ||
| recovery solution. Velero provides the ability to back up Kubernetes API resources to an object | ||
| store and has a plugin-based system to enable snapshotting of a cluster's persistent volumes. | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! warning | ||
| Backup and restore is only available for production-grade HA installations of clusters. | ||
| <!-- prettier-ignore-end --> | ||
|
|
||
| The playbooks install Velero on the HA management cluster and the Velero command-line-tool on the seed node. | ||
| Once configured with the appropriate credentials, the installation process will create a | ||
| [Schedule](https://velero.io/docs/latest/api-types/schedule/) on the HA cluster, which triggers a daily | ||
| backup at midnight and cleans up backups older which are more than 1 week old. | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! note | ||
| - The [AWS Velero plugin](https://github.com/vmware-tanzu/velero-plugin-for-aws) is used for S3 support. | ||
| - The [CSI plugin](https://github.com/vmware-tanzu/velero-plugin-for-csi) for volume snapshots. | ||
| - The CSI plugin uses Kubernetes generic support for [Volume Snapshots](https://kubernetes.io/docs/concepts/storage/volume-snapshots/). | ||
| - This is implemented for OpenStack by the [Cinder CSI plugin](https://github.com/kubernetes/cloud-provider-openstack). | ||
| <!-- prettier-ignore-end --> | ||
|
|
||
| Information on how to configure and use disaster recovery can be found | ||
| [here](../configuration/15-disaster-recovery.md#configuration) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # Configuring Standalone CAPI Management Clusters | ||
|
|
||
| In recent years, the kubernetes | ||
| [Cluster API project](https://cluster-api.sigs.k8s.io/) has been more widely adopted for the role as | ||
| the main driver for managing OpenStack infrastructure for container orchestration engines (COE), such as Magnum. | ||
| This is the same Cluster API (CAPI) used by Azimuth and thus their configuration and operations share a lot in common. | ||
| Therefore, this document will outline how to use | ||
| [azimuth-config](https://github.com/azimuth-cloud/azimuth-config) to deploy an Azimuth-free, | ||
| standalone CAPI management cluster, using Magnum as the chosen COE. | ||
|
|
||
| <!-- prettier-ignore-start --> | ||
| !!! note | ||
| It is assumed that you have already followed the steps in setting up a configuration repository, and so have an environment for your site that is ready to be configured. | ||
| See [Setting up a configuration repository](../repository/index.md). | ||
| <!-- prettier-ignore-end --> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.