Skip to content

Commit 2dc4385

Browse files
Add the role-group as a node attribute and document cluster scaling (#63)
* docs: Describe scaling of OpenSearch clusters * feat: Add the role-group as a node attribute * docs: Refine the text with GPT-4o mini * docs: Do not announce future support of smart and auto scaling * chore: Fix the changelog
1 parent 9e4b511 commit 2dc4385

File tree

7 files changed

+285
-2
lines changed

7 files changed

+285
-2
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@ All notable changes to this project will be documented in this file.
44

55
## [Unreleased]
66

7+
### Added
8+
9+
- Add the role group as a node attribute ([#63]).
10+
11+
[#63]: https://github.com/stackabletech/opensearch-operator/pull/63
12+
713
## [25.11.0] - 2025-11-07
814

915
## [25.11.0-rc1] - 2025-11-06
Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
= Scaling OpenSearch clusters
2+
:description: OpenSearch clusters can be scaled after provisioning but manual steps are required.
3+
4+
OpenSearch clusters can be scaled after provisioning.
5+
CPU and memory settings can be easily adjusted, as detailed in the xref:opensearch:usage-guide/storage-resource-configuration.adoc#_resource_requests[Resource Requests].
6+
However, when changing the number of nodes or resizing volumes, the following considerations must be kept in mind.
7+
8+
Horizontal scaling, which involves adjusting the replica count of role groups, can be easily accomplished for non-data nodes by modifying the OpenSearchCluster specification.
9+
Additionally, the number of data nodes can be increased.
10+
However, reducing the number of data nodes requires manual intervention.
11+
If a pod that manages data is simply shut down, its data becomes inaccessible.
12+
Therefore, it is necessary to manually drain the data from the nodes before removing them.
13+
14+
Vertical scaling, which refers to changing the volume size of nodes, is not supported by the operator.
15+
Whether the size of a volume can be changed depends on its CSI driver.
16+
OpenSearch allows for multiple data paths within a single data node, but adding volumes to additional data paths typically does not resolve low disk space issues, as the data is not automatically rebalanced across multiple data paths.
17+
18+
[NOTE]
19+
====
20+
The OpenSearch operator is currently in the early stages of development.
21+
Smart scaling (adapting resources without data loss) and auto scaling (scaling the cluster based on load) are not supported.
22+
====
23+
24+
== Manually scaling
25+
26+
As noted earlier, scaling can be quite challenging;
27+
however, an easy workaround exists, which will be presented here.
28+
29+
For example, the following OpenSearchCluster has been deployed with three cluster-manager nodes and five small data nodes:
30+
31+
[source,yaml]
32+
----
33+
spec:
34+
nodes:
35+
roleGroups:
36+
cluster-manager:
37+
config:
38+
nodeRoles:
39+
- cluster_manager
40+
replicas: 3
41+
data-small:
42+
config:
43+
nodeRoles:
44+
- data
45+
- ingest
46+
- remote_cluster_client
47+
resources:
48+
storage:
49+
data:
50+
capacity: 10Gi
51+
replicas: 5
52+
----
53+
54+
You have decided that three large data nodes would be more suitable than five small ones.
55+
To implement this change, you can replace the role group `data-small` with your preferred option.
56+
57+
First, add the new role group `data-large` with three replicas, each having a capacity of 100 Gi per node:
58+
59+
[source,yaml]
60+
----
61+
spec:
62+
nodes:
63+
roleGroups:
64+
cluster-manager:
65+
config:
66+
nodeRoles:
67+
- cluster_manager
68+
replicas: 3
69+
data-small:
70+
config:
71+
nodeRoles:
72+
- data
73+
- ingest
74+
- remote_cluster_client
75+
resources:
76+
storage:
77+
data:
78+
capacity: 10Gi
79+
replicas: 5
80+
data-large:
81+
config:
82+
nodeRoles:
83+
- data
84+
- ingest
85+
- remote_cluster_client
86+
resources:
87+
storage:
88+
data:
89+
capacity: 100Gi
90+
replicas: 3
91+
----
92+
93+
The data must now be transferred from `data-small` to `data-large`.
94+
By using the cluster setting `cluster.routing.allocation.exclude`, you can exclude nodes from shard allocation.
95+
If rebalancing has not been disabled, existing data will automatically move from the specified nodes to the allowed ones—in this case, from `data-small` to `data-large`.
96+
97+
[TIP]
98+
====
99+
The OpenSearch operator assigns a role group attribute to each OpenSearch node, making it easier to reference all nodes associated with a specific role group.
100+
====
101+
102+
The following REST call excludes the `data-small` role group from shard allocation:
103+
104+
[source,http]
105+
----
106+
PUT _cluster/settings
107+
{
108+
"persistent": {
109+
"cluster": {
110+
"routing": {
111+
"allocation.exclude": {
112+
"role-group": "data-small"
113+
}
114+
}
115+
}
116+
}
117+
}
118+
----
119+
120+
You must wait until all data has been transferred from `data-small` to `data-large`.
121+
You can request the current shard allocation at the `_cat/shards` endpoint, for example:
122+
123+
[source,http]
124+
----
125+
GET _cat/shards?v
126+
index shard prirep state docs store ip node
127+
logs 0 r STARTED 14074 6.9mb 10.244.0.60 opensearch-nodes-data-large-2
128+
logs 0 p RELOCATING 14074 8.5mb 10.244.0.52 opensearch-nodes-data-small-4
129+
-> 10.244.0.59 NFjQBBmWSm-pijXcxrXnvQ opensearch-nodes-data-large-1
130+
...
131+
132+
GET _cat/shards?v
133+
index shard prirep state docs store ip node
134+
logs 0 r STARTED 14074 6.9mb 10.244.0.60 opensearch-nodes-data-large-2
135+
logs 0 p STARTED 14074 6.9mb 10.244.0.59 opensearch-nodes-data-large-1
136+
...
137+
----
138+
139+
Statistics, particularly the document count, can be retrieved from the `_nodes/role-group:data-small/stats` endpoint, for example:
140+
141+
[source,http]
142+
----
143+
GET _nodes/role-group:data-small/stats/indices/docs
144+
{
145+
"_nodes": {
146+
"total": 5,
147+
"successful": 5,
148+
"failed": 0
149+
},
150+
"cluster_name": "opensearch",
151+
"nodes": {
152+
"wjaeQJUXQX6eNWYUeiScgQ": {
153+
"timestamp": 1761992580239,
154+
"name": "opensearch-nodes-data-small-4",
155+
"transport_address": "10.244.0.52:9300",
156+
"host": "10.244.0.52",
157+
"ip": "10.244.0.52:9300",
158+
"roles": [
159+
"data",
160+
"ingest",
161+
"remote_cluster_client"
162+
],
163+
"attributes": {
164+
"role-group": "data-small",
165+
"shard_indexing_pressure_enabled": "true"
166+
},
167+
"indices": {
168+
"docs": {
169+
"count": 14686,
170+
"deleted": 0
171+
}
172+
}
173+
},
174+
...
175+
}
176+
}
177+
178+
GET _nodes/role-group:data-small/stats/indices/docs
179+
{
180+
"_nodes": {
181+
"total": 5,
182+
"successful": 5,
183+
"failed": 0
184+
},
185+
"cluster_name": "opensearch",
186+
"nodes": {
187+
"wjaeQJUXQX6eNWYUeiScgQ": {
188+
"timestamp": 1761992817422,
189+
"name": "opensearch-nodes-data-small-4",
190+
"transport_address": "10.244.0.52:9300",
191+
"host": "10.244.0.52",
192+
"ip": "10.244.0.52:9300",
193+
"roles": [
194+
"data",
195+
"ingest",
196+
"remote_cluster_client"
197+
],
198+
"attributes": {
199+
"role-group": "data-small",
200+
"shard_indexing_pressure_enabled": "true"
201+
},
202+
"indices": {
203+
"docs": {
204+
"count": 0,
205+
"deleted": 0
206+
}
207+
}
208+
},
209+
...
210+
}
211+
}
212+
213+
----
214+
215+
Once all shards have been transferred, the `data-small` role group can be removed from the OpenSearchCluster specification:
216+
217+
[source,yaml]
218+
----
219+
spec:
220+
nodes:
221+
roleGroups:
222+
cluster-manager:
223+
config:
224+
nodeRoles:
225+
- cluster_manager
226+
replicas: 3
227+
data-large:
228+
config:
229+
nodeRoles:
230+
- data
231+
- ingest
232+
- remote_cluster_client
233+
resources:
234+
storage:
235+
data:
236+
capacity: 100Gi
237+
replicas: 3
238+
----
239+
240+
Finally, the shard exclusion should be removed from the cluster settings:
241+
242+
[source,http]
243+
----
244+
PUT _cluster/settings
245+
{
246+
"persistent": {
247+
"cluster": {
248+
"routing": {
249+
"allocation.exclude": {
250+
"role-group": null
251+
}
252+
}
253+
}
254+
}
255+
}
256+
----
257+
258+
If your OpenSearch clients connected to the cluster exclusively through the cluster-manager nodes, the switch from one data role group to another should have been seamless for them.

docs/modules/opensearch/pages/usage-guide/storage-resource-configuration.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ nodes:
1313
config:
1414
resources:
1515
storage:
16-
logDirs:
16+
data:
1717
capacity: 50Gi
1818
----
1919

docs/modules/opensearch/partials/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
** xref:opensearch:usage-guide/monitoring.adoc[]
1010
** xref:opensearch:usage-guide/logging.adoc[]
1111
** xref:opensearch:usage-guide/opensearch-dashboards.adoc[]
12+
** xref:opensearch:usage-guide/scaling.adoc[]
1213
** xref:opensearch:usage-guide/operations/index.adoc[]
1314
*** xref:opensearch:usage-guide/operations/cluster-operations.adoc[]
1415
*** xref:opensearch:usage-guide/operations/pod-placement.adoc[]

rust/operator-binary/src/controller/build/node_config.rs

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ use crate::{
1010
controller::OpenSearchRoleGroupConfig,
1111
crd::v1alpha1,
1212
framework::{
13-
ServiceName,
13+
RoleGroupName, ServiceName,
1414
builder::pod::container::{EnvVarName, EnvVarSet},
1515
role_group_utils,
1616
},
@@ -41,6 +41,10 @@ pub const CONFIG_OPTION_INITIAL_CLUSTER_MANAGER_NODES: &str =
4141
/// Type: string
4242
pub const CONFIG_OPTION_NETWORK_HOST: &str = "network.host";
4343

44+
/// The custom node attribute "role-group"
45+
/// Type: string
46+
pub const CONFIG_OPTION_NODE_ATTR_ROLE_GROUP: &str = "node.attr.role-group";
47+
4448
/// A descriptive name for the node.
4549
/// Type: string
4650
pub const CONFIG_OPTION_NODE_NAME: &str = "node.name";
@@ -61,6 +65,7 @@ pub const CONFIG_OPTION_PLUGINS_SECURITY_SSL_HTTP_ENABLED: &str =
6165
/// Configuration of an OpenSearch node based on the cluster and role-group configuration
6266
pub struct NodeConfig {
6367
cluster: ValidatedCluster,
68+
role_group_name: RoleGroupName,
6469
role_group_config: OpenSearchRoleGroupConfig,
6570
discovery_service_name: ServiceName,
6671
}
@@ -70,11 +75,13 @@ pub struct NodeConfig {
7075
impl NodeConfig {
7176
pub fn new(
7277
cluster: ValidatedCluster,
78+
role_group_name: RoleGroupName,
7379
role_group_config: OpenSearchRoleGroupConfig,
7480
discovery_service_name: ServiceName,
7581
) -> Self {
7682
Self {
7783
cluster,
84+
role_group_name,
7885
role_group_config,
7986
discovery_service_name,
8087
}
@@ -111,6 +118,10 @@ impl NodeConfig {
111118
CONFIG_OPTION_PLUGINS_SECURITY_NODES_DN.to_owned(),
112119
json!(["CN=generated certificate for pod".to_owned()]),
113120
);
121+
config.insert(
122+
CONFIG_OPTION_NODE_ATTR_ROLE_GROUP.to_owned(),
123+
json!(self.role_group_name),
124+
);
114125

115126
for (setting, value) in self
116127
.role_group_config
@@ -311,6 +322,8 @@ mod tests {
311322
let image: ProductImage = serde_json::from_str(r#"{"productVersion": "3.1.0"}"#)
312323
.expect("should be a valid ProductImage");
313324

325+
let role_group_name = RoleGroupName::from_str_unsafe("data");
326+
314327
let role_group_config = OpenSearchRoleGroupConfig {
315328
replicas: test_config.replicas,
316329
config: ValidatedOpenSearchConfig {
@@ -374,6 +387,7 @@ mod tests {
374387

375388
NodeConfig::new(
376389
cluster,
390+
role_group_name,
377391
role_group_config,
378392
ServiceName::from_str_unsafe("my-opensearch-cluster-manager"),
379393
)
@@ -391,6 +405,7 @@ mod tests {
391405
"cluster.name: \"my-opensearch-cluster\"\n",
392406
"discovery.type: \"zen\"\n",
393407
"network.host: \"0.0.0.0\"\n",
408+
"node.attr.role-group: \"data\"\n",
394409
"plugins.security.nodes_dn: [\"CN=generated certificate for pod\"]\n",
395410
"test: \"value\""
396411
)

rust/operator-binary/src/controller/build/role_group_builder.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ impl<'a> RoleGroupBuilder<'a> {
101101
cluster: cluster.clone(),
102102
node_config: NodeConfig::new(
103103
cluster.clone(),
104+
role_group_name.clone(),
104105
role_group_config.clone(),
105106
discovery_service_name,
106107
),

tests/templates/kuttl/smoke/10-assert.yaml.j2

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -650,6 +650,7 @@ data:
650650
cluster.routing.allocation.disk.threshold_enabled: "false"
651651
discovery.type: "zen"
652652
network.host: "0.0.0.0"
653+
node.attr.role-group: "cluster-manager"
653654
node.store.allow_mmap: "false"
654655
plugins.security.allow_default_init_securityindex: "true"
655656
plugins.security.nodes_dn: ["CN=generated certificate for pod"]
@@ -685,6 +686,7 @@ data:
685686
cluster.routing.allocation.disk.threshold_enabled: "false"
686687
discovery.type: "zen"
687688
network.host: "0.0.0.0"
689+
node.attr.role-group: "data"
688690
node.store.allow_mmap: "false"
689691
plugins.security.allow_default_init_securityindex: "true"
690692
plugins.security.nodes_dn: ["CN=generated certificate for pod"]

0 commit comments

Comments
 (0)