Skip to content

Commit 4dd718b

Browse files
authored
[receiver/kubeletstats] Add k8s.container.cpu.node.utilization metric (#32295)
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> At the moment. We calculate the `k8s.container.cpu_limit_utilization` as [a ratio of the container's limits](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/documentation.md#k8scontainercpu_limit_utilization) at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/867d6700c31446172e6998e602c55fbf7351831f/receiver/kubeletstatsreceiver/internal/kubelet/cpu.go#L30. Similarly we can calculate the cpu utilization as ratio of the whole node's allocatable cpu, if we divide by the total number of node's cores. We can retrieve this information from the Node's `Status.Capacity`, for example: ```console $ k get nodes kind-control-plane -ojsonpath='{.status.capacity}' {"cpu":"8","ephemeral-storage":"485961008Ki","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"32564732Ki","pods":"110"} ``` ## Performance concerns In order to get the Node's capacity we need an API call to the k8s API in order to get the Node object. Something to consider here is the performance impact that this extra API call would bring. We can always choose to have this metric as disabled by default and clearly specify in the docs that this metric comes with an extra API call to get the Node of the Pods. The good thing is that `kubeletstats` receiver target's only one node so I believe it's a safe assumption to only fetch the current node because all the observed Pods will belong to the one single local node. Correct me if I miss anything here. In addition, instead of performing the API call explicitly on every single `scrape` we can use an informer instead and leverage its cache. I can change this patch to this direction if we agree on this. Would love to hear other's opinions on this. ## Todos ✅ 1) Apply this change behind a feature gate as it was indicated at #27885 (comment) ✅ 2) Use an Informer instead of direct API calls. **Link to tracking Issue:** <Issue number if applicable> ref: #27885 **Testing:** <Describe what testing was performed and which tests were added.> I experimented with this approach and the results look correct. In order to verify this I deployed a stress Pod on my machine to consume a target cpu of 4 cores: ```yaml apiVersion: v1 kind: Pod metadata: name: cpu-stress spec: containers: - name: cpu-stress image: polinux/stress command: ["stress"] args: ["-c", "4"] ``` And then the collected `container.cpu.utilization` for that Pod's container was at `0,5` as exepcted, based that my machine-node comes with 8 cores in total: ![cpu-stress](https://github.com/open-telemetry/opentelemetry-collector-contrib/assets/11754898/3abe4a0d-6c99-4b4e-a704-da5789dde01b) Unit test is also included. **Documentation:** <Describe the documentation added.> Added: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32295/files#diff-8ad3b506fb1132c961e8da99b677abd31f0108e3f9ed6999dd96ad3297b51e08 --------- Signed-off-by: ChrsMark <[email protected]>
1 parent e13b1a3 commit 4dd718b

25 files changed

+702
-38
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Use this changelog template to create an entry for release notes.
2+
3+
# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
4+
change_type: 'enhancement'
5+
6+
# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
7+
component: kubeletstatsreceiver
8+
9+
# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
10+
note: Add k8s.container.cpu.node.utilization metric
11+
12+
# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
13+
issues: [27885]
14+
15+
# (Optional) One or more lines of additional information to render under the primary note.
16+
# These lines will be padded with 2 spaces and then inserted directly into the document.
17+
# Use pipe (|) for multiline entries.
18+
subtext:
19+
20+
# If your change doesn't affect end users or the exported elements of any package,
21+
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
22+
# Optional: The change log or logs in which this entry should be included.
23+
# e.g. '[user]' or '[user, api]'
24+
# Include 'user' if the change is relevant to end users.
25+
# Include 'api' if there is a change to a library API.
26+
# Default: '[user]'
27+
change_logs: []

receiver/kubeletstatsreceiver/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,36 @@ receivers:
218218
- pod
219219
```
220220

221+
### Collect k8s.container.cpu.node.utilization as ratio of total node's capacity
222+
223+
In order to calculate the `k8s.container.cpu.node.utilization` metric, the information of the node's capacity
224+
must be retrieved from the k8s API. In this, the `k8s_api_config` needs to be set.
225+
In addition, the node name must be identified properly. The `K8S_NODE_NAME` env var can be set using the
226+
downward API inside the collector pod spec as follows:
227+
228+
```yaml
229+
env:
230+
- name: K8S_NODE_NAME
231+
valueFrom:
232+
fieldRef:
233+
fieldPath: spec.nodeName
234+
```
235+
Then set `node` value to `${env:K8S_NODE_NAME}` in the receiver's configuration:
236+
237+
```yaml
238+
receivers:
239+
kubeletstats:
240+
collection_interval: 10s
241+
auth_type: 'serviceAccount'
242+
endpoint: '${env:K8S_NODE_NAME}:10250'
243+
node: '${env:K8S_NODE_NAME}'
244+
k8s_api_config:
245+
auth_type: serviceAccount
246+
metrics:
247+
k8s.container.cpu.node.utilization:
248+
enabled: true
249+
```
250+
221251
### Optional parameters
222252

223253
The following parameters can also be specified:

receiver/kubeletstatsreceiver/config.go

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,19 @@ type Config struct {
4040
// Configuration of the Kubernetes API client.
4141
K8sAPIConfig *k8sconfig.APIConfig `mapstructure:"k8s_api_config"`
4242

43+
// NodeName is the node name to limit the discovery of nodes.
44+
// For example, node name can be set using the downward API inside the collector
45+
// pod spec as follows:
46+
//
47+
// env:
48+
// - name: K8S_NODE_NAME
49+
// valueFrom:
50+
// fieldRef:
51+
// fieldPath: spec.nodeName
52+
//
53+
// Then set this value to ${env:K8S_NODE_NAME} in the configuration.
54+
NodeName string `mapstructure:"node"`
55+
4356
// MetricsBuilderConfig allows customizing scraped metrics/attributes representation.
4457
metadata.MetricsBuilderConfig `mapstructure:",squash"`
4558
}
@@ -105,3 +118,10 @@ func (cfg *Config) Unmarshal(componentParser *confmap.Conf) error {
105118

106119
return nil
107120
}
121+
122+
func (cfg *Config) Validate() error {
123+
if cfg.Metrics.K8sContainerCPUNodeUtilization.Enabled && cfg.NodeName == "" {
124+
return errors.New("for k8s.container.cpu.node.utilization node setting is required. Check the readme on how to set the required setting")
125+
}
126+
return nil
127+
}

receiver/kubeletstatsreceiver/config_test.go

Lines changed: 39 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,9 @@ func TestLoadConfig(t *testing.T) {
3131
duration := 10 * time.Second
3232

3333
tests := []struct {
34-
id component.ID
35-
expected component.Config
36-
expectedErr error
34+
id component.ID
35+
expected component.Config
36+
expectedValidationErr string
3737
}{
3838
{
3939
id: component.NewIDWithName(metadata.Type, "default"),
@@ -173,6 +173,34 @@ func TestLoadConfig(t *testing.T) {
173173
MetricsBuilderConfig: metadata.DefaultMetricsBuilderConfig(),
174174
},
175175
},
176+
{
177+
id: component.NewIDWithName(metadata.Type, "container_cpu_node_utilization"),
178+
expected: &Config{
179+
ControllerConfig: scraperhelper.ControllerConfig{
180+
CollectionInterval: duration,
181+
InitialDelay: time.Second,
182+
},
183+
ClientConfig: kube.ClientConfig{
184+
APIConfig: k8sconfig.APIConfig{
185+
AuthType: "tls",
186+
},
187+
},
188+
MetricGroupsToCollect: []kubelet.MetricGroup{
189+
kubelet.ContainerMetricGroup,
190+
kubelet.PodMetricGroup,
191+
kubelet.NodeMetricGroup,
192+
},
193+
MetricsBuilderConfig: metadata.MetricsBuilderConfig{
194+
Metrics: metadata.MetricsConfig{
195+
K8sContainerCPUNodeUtilization: metadata.MetricConfig{
196+
Enabled: true,
197+
},
198+
},
199+
ResourceAttributes: metadata.DefaultResourceAttributesConfig(),
200+
},
201+
},
202+
expectedValidationErr: "for k8s.container.cpu.node.utilization node setting is required. Check the readme on how to set the required setting",
203+
},
176204
}
177205

178206
for _, tt := range tests {
@@ -184,8 +212,14 @@ func TestLoadConfig(t *testing.T) {
184212
require.NoError(t, err)
185213
require.NoError(t, component.UnmarshalConfig(sub, cfg))
186214

187-
assert.NoError(t, component.ValidateConfig(cfg))
188-
assert.Equal(t, tt.expected, cfg)
215+
err = component.ValidateConfig(cfg)
216+
if tt.expectedValidationErr != "" {
217+
assert.EqualError(t, err, tt.expectedValidationErr)
218+
} else {
219+
assert.NoError(t, err)
220+
assert.Equal(t, tt.expected, cfg)
221+
}
222+
189223
})
190224
}
191225
}

receiver/kubeletstatsreceiver/documentation.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -402,6 +402,14 @@ The time since the container started
402402
| ---- | ----------- | ---------- | ----------------------- | --------- |
403403
| s | Sum | Int | Cumulative | true |
404404
405+
### k8s.container.cpu.node.utilization
406+
407+
Container cpu utilization as a ratio of the node's capacity
408+
409+
| Unit | Metric Type | Value Type |
410+
| ---- | ----------- | ---------- |
411+
| 1 | Gauge | Double |
412+
405413
### k8s.container.cpu_limit_utilization
406414
407415
Container cpu utilization as a ratio of the container's limits

receiver/kubeletstatsreceiver/factory.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ func createMetricsReceiver(
6868
return nil, err
6969
}
7070

71-
scrp, err := newKubletScraper(rest, set, rOptions, cfg.MetricsBuilderConfig)
71+
scrp, err := newKubletScraper(rest, set, rOptions, cfg.MetricsBuilderConfig, cfg.NodeName)
7272
if err != nil {
7373
return nil, err
7474
}

receiver/kubeletstatsreceiver/internal/kubelet/accumulator.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ func (a *metricDataAccumulator) nodeStats(s stats.NodeStats) {
5656

5757
currentTime := pcommon.NewTimestampFromTime(a.time)
5858
addUptimeMetric(a.mbs.NodeMetricsBuilder, metadata.NodeUptimeMetrics.Uptime, s.StartTime, currentTime)
59-
addCPUMetrics(a.mbs.NodeMetricsBuilder, metadata.NodeCPUMetrics, s.CPU, currentTime, resources{})
59+
addCPUMetrics(a.mbs.NodeMetricsBuilder, metadata.NodeCPUMetrics, s.CPU, currentTime, resources{}, 0)
6060
addMemoryMetrics(a.mbs.NodeMetricsBuilder, metadata.NodeMemoryMetrics, s.Memory, currentTime, resources{})
6161
addFilesystemMetrics(a.mbs.NodeMetricsBuilder, metadata.NodeFilesystemMetrics, s.Fs, currentTime)
6262
addNetworkMetrics(a.mbs.NodeMetricsBuilder, metadata.NodeNetworkMetrics, s.Network, currentTime)
@@ -76,7 +76,7 @@ func (a *metricDataAccumulator) podStats(s stats.PodStats) {
7676

7777
currentTime := pcommon.NewTimestampFromTime(a.time)
7878
addUptimeMetric(a.mbs.PodMetricsBuilder, metadata.PodUptimeMetrics.Uptime, s.StartTime, currentTime)
79-
addCPUMetrics(a.mbs.PodMetricsBuilder, metadata.PodCPUMetrics, s.CPU, currentTime, a.metadata.podResources[s.PodRef.UID])
79+
addCPUMetrics(a.mbs.PodMetricsBuilder, metadata.PodCPUMetrics, s.CPU, currentTime, a.metadata.podResources[s.PodRef.UID], 0)
8080
addMemoryMetrics(a.mbs.PodMetricsBuilder, metadata.PodMemoryMetrics, s.Memory, currentTime, a.metadata.podResources[s.PodRef.UID])
8181
addFilesystemMetrics(a.mbs.PodMetricsBuilder, metadata.PodFilesystemMetrics, s.EphemeralStorage, currentTime)
8282
addNetworkMetrics(a.mbs.PodMetricsBuilder, metadata.PodNetworkMetrics, s.Network, currentTime)
@@ -110,7 +110,7 @@ func (a *metricDataAccumulator) containerStats(sPod stats.PodStats, s stats.Cont
110110
currentTime := pcommon.NewTimestampFromTime(a.time)
111111
resourceKey := sPod.PodRef.UID + s.Name
112112
addUptimeMetric(a.mbs.ContainerMetricsBuilder, metadata.ContainerUptimeMetrics.Uptime, s.StartTime, currentTime)
113-
addCPUMetrics(a.mbs.ContainerMetricsBuilder, metadata.ContainerCPUMetrics, s.CPU, currentTime, a.metadata.containerResources[resourceKey])
113+
addCPUMetrics(a.mbs.ContainerMetricsBuilder, metadata.ContainerCPUMetrics, s.CPU, currentTime, a.metadata.containerResources[resourceKey], a.metadata.cpuNodeLimit)
114114
addMemoryMetrics(a.mbs.ContainerMetricsBuilder, metadata.ContainerMemoryMetrics, s.Memory, currentTime, a.metadata.containerResources[resourceKey])
115115
addFilesystemMetrics(a.mbs.ContainerMetricsBuilder, metadata.ContainerFilesystemMetrics, s.Rootfs, currentTime)
116116

receiver/kubeletstatsreceiver/internal/kubelet/accumulator_test.go

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ func TestMetadataErrorCases(t *testing.T) {
5353
},
5454
},
5555
},
56-
}, nil),
56+
}, NodeLimits{}, nil),
5757
testScenario: func(acc metricDataAccumulator) {
5858
now := metav1.Now()
5959
podStats := stats.PodStats{
@@ -79,7 +79,7 @@ func TestMetadataErrorCases(t *testing.T) {
7979
metricGroupsToCollect: map[MetricGroup]bool{
8080
VolumeMetricGroup: true,
8181
},
82-
metadata: NewMetadata([]MetadataLabel{MetadataLabelVolumeType}, nil, nil),
82+
metadata: NewMetadata([]MetadataLabel{MetadataLabelVolumeType}, nil, NodeLimits{}, nil),
8383
testScenario: func(acc metricDataAccumulator) {
8484
podStats := stats.PodStats{
8585
PodRef: stats.PodReference{
@@ -121,7 +121,7 @@ func TestMetadataErrorCases(t *testing.T) {
121121
},
122122
},
123123
},
124-
}, nil),
124+
}, NodeLimits{}, nil),
125125
testScenario: func(acc metricDataAccumulator) {
126126
podStats := stats.PodStats{
127127
PodRef: stats.PodReference{
@@ -165,7 +165,7 @@ func TestMetadataErrorCases(t *testing.T) {
165165
},
166166
},
167167
},
168-
}, nil),
168+
}, NodeLimits{}, nil),
169169
detailedPVCLabelsSetterOverride: func(*metadata.ResourceBuilder, string, string, string) error {
170170
// Mock failure cases.
171171
return errors.New("")

receiver/kubeletstatsreceiver/internal/kubelet/cpu.go

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,22 +10,37 @@ import (
1010
"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/kubeletstatsreceiver/internal/metadata"
1111
)
1212

13-
func addCPUMetrics(mb *metadata.MetricsBuilder, cpuMetrics metadata.CPUMetrics, s *stats.CPUStats, currentTime pcommon.Timestamp, r resources) {
13+
func addCPUMetrics(
14+
mb *metadata.MetricsBuilder,
15+
cpuMetrics metadata.CPUMetrics,
16+
s *stats.CPUStats,
17+
currentTime pcommon.Timestamp,
18+
r resources,
19+
nodeCPULimit float64) {
1420
if s == nil {
1521
return
1622
}
17-
addCPUUsageMetric(mb, cpuMetrics, s, currentTime, r)
23+
addCPUUsageMetric(mb, cpuMetrics, s, currentTime, r, nodeCPULimit)
1824
addCPUTimeMetric(mb, cpuMetrics.Time, s, currentTime)
1925
}
2026

21-
func addCPUUsageMetric(mb *metadata.MetricsBuilder, cpuMetrics metadata.CPUMetrics, s *stats.CPUStats, currentTime pcommon.Timestamp, r resources) {
27+
func addCPUUsageMetric(
28+
mb *metadata.MetricsBuilder,
29+
cpuMetrics metadata.CPUMetrics,
30+
s *stats.CPUStats,
31+
currentTime pcommon.Timestamp,
32+
r resources,
33+
nodeCPULimit float64) {
2234
if s.UsageNanoCores == nil {
2335
return
2436
}
2537
value := float64(*s.UsageNanoCores) / 1_000_000_000
2638
cpuMetrics.Utilization(mb, currentTime, value)
2739
cpuMetrics.Usage(mb, currentTime, value)
2840

41+
if nodeCPULimit > 0 {
42+
cpuMetrics.NodeUtilization(mb, currentTime, value/nodeCPULimit)
43+
}
2944
if r.cpuLimit > 0 {
3045
cpuMetrics.LimitUtilization(mb, currentTime, value/r.cpuLimit)
3146
}

receiver/kubeletstatsreceiver/internal/kubelet/metadata.go

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ type Metadata struct {
5252
DetailedPVCResourceSetter func(rb *metadata.ResourceBuilder, volCacheID, volumeClaim, namespace string) error
5353
podResources map[string]resources
5454
containerResources map[string]resources
55+
cpuNodeLimit float64
5556
}
5657

5758
type resources struct {
@@ -61,6 +62,11 @@ type resources struct {
6162
memoryLimit int64
6263
}
6364

65+
type NodeLimits struct {
66+
Name string
67+
CPUNanoCoresLimit float64
68+
}
69+
6470
func getContainerResources(r *v1.ResourceRequirements) resources {
6571
if r == nil {
6672
return resources{}
@@ -74,14 +80,15 @@ func getContainerResources(r *v1.ResourceRequirements) resources {
7480
}
7581
}
7682

77-
func NewMetadata(labels []MetadataLabel, podsMetadata *v1.PodList,
83+
func NewMetadata(labels []MetadataLabel, podsMetadata *v1.PodList, nodeResourceLimits NodeLimits,
7884
detailedPVCResourceSetter func(rb *metadata.ResourceBuilder, volCacheID, volumeClaim, namespace string) error) Metadata {
7985
m := Metadata{
8086
Labels: getLabelsMap(labels),
8187
PodsMetadata: podsMetadata,
8288
DetailedPVCResourceSetter: detailedPVCResourceSetter,
8389
podResources: make(map[string]resources),
8490
containerResources: make(map[string]resources),
91+
cpuNodeLimit: nodeResourceLimits.CPUNanoCoresLimit,
8592
}
8693

8794
if podsMetadata != nil {

0 commit comments

Comments
 (0)