refactor: enable metrics manifests by AvineshTripathi · Pull Request #79 · kubernetes-sigs/node-readiness-controller

AvineshTripathi · 2026-01-11T15:42:14Z

Enabled Prometheus and Cert-Manager in the Kustomize configuration to support secure metrics scraping and certificate management.

how to test locally:

create a kind cluster or any cluster
Install prometheus(I used https://github.com/prometheus-operator/kube-prometheus quickstart)
Install cert manager crds(kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.19.2/cert-manager.yaml)
Run make deploy ENABLE_METRICS=true ENABLE_TLS=true
expose grafana using (kubectl port-forward svc/grafana -n monitoring 3000)

Enabled Prometheus and Cert-Manager in the Kustomize configuration to support secure metrics scraping and certificate management. Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>

netlify · 2026-01-11T15:42:20Z

✅ Deploy Preview for node-readiness-controller canceled.

Name	Link
🔨 Latest commit	`28d1d94`
🔍 Latest deploy log	https://app.netlify.com/projects/node-readiness-controller/deploys/69797f0d7bfa820008a2672f

ajaysundark · 2026-01-14T08:29:46Z

 rules:
- nonResourceURLs:
-  - "/metrics"
+- apiGroups: [""]


why does metrics-reader RBAC need these resource access?

If we do not give these permissions, Prometheus is not able to configure pods as targets. here were the logs in Prometheus, and something similar happens for other resources

time=2026-01-11T11:53:19.320Z level=ERROR source=reflector.go:205 msg="Failed to watch" component=k8s_client_runtime logger=UnhandledError err="failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"nrr-system\"" reflector=pkg/mod/k8s.io/client-go@v0.34.1/tools/cache/reflector.go:290 type=*v1.Pod

I see, this seems to be required for Prometheus to discover the controller and creating service-monitor. Did you verify if nonResourceURLs: ["/metrics"] is no-longer needed and this role is sufficient for a scraper without it?

Yes, it works without that because Prometheus is hitting the controller pod api and not kube api server

I find the existing nonResourceURLs: ["/metrics"] to be the case in most other projects.

@AvineshTripathi, do you have see examples of projects giving metrics-reader these wide permissions?

File - config/rbac/metrics_reader_role.yaml
(basically the one this ongoing thread is about)

context here - #79 (comment) - basically this RBAC is already provided by kube-prometheus stack and so from controller code, we ahould clean it since its not needed.

got it, thanks for the context, @Priyankasaggu11929. This file comes from the default scaffolding of kubebuilder - https://github.com/kubernetes-sigs/kubebuilder/blob/ccb3bd37277f9e6db2cdca3a6b553994275db42e/pkg/plugins/common/kustomize/v2/scaffolds/internal/templates/config/rbac/metrics_reader_role.go#L41.

You are setting up authn/authz with controller-runtime's WithAuthenticationAndAuthorization feature. This seems to require this ClusterRole with nonResourceUrls: /metrics permissions. ref

more details here: http://book.kubebuilder.io/reference/metrics

You are setting up authn/authz with controller-runtime's WithAuthenticationAndAuthorization feature.
we already have that here

node-readiness-controller/config/rbac/metrics_auth_role.yaml

Line 9 in f579d82

- tokenreviews

wrt to nonResourceUrls i am still confused. My understanding about that endpoint is that it is apiServer endpoint and not controller endpoint

kube-prometheus-stack takes care of it

I agree to @Priyankasaggu11929 here. Service discovery is provided by prometheus installation. For eg: https://github.com/prometheus-community/helm-charts/blob/main/charts%2Fkube-prometheus-stack%2Ftemplates%2Fprometheus%2Fclusterrole.yaml#L13 sets up these permissions.

We should provide only rbac for our controller requirements

ajaysundark

This looks great, thanks @AvineshTripathi!

I think your suggestion on adding a new overlay that explicitly enables cert-manager and prometheus will help us move faster and safer for adding new functionality.

Wdyt about moving cert-manager and prometheus patches to config/overlays/secure-monitoring or similar?

ajaysundark · 2026-01-15T07:58:32Z

+  namespace: nrr-system
+subjects:
+- kind: ServiceAccount
+  name: prometheus-k8s


This hardcodes the namespace / SA name. Is this standard? I think we could document this as well.

Looks like the best practice is to not provide the role-binding - https://book.kubebuilder.io/reference/metrics#granting-permissions-to-access-metrics.

I think we should not prefer to include this either as we dont know the service-account or namespace, and by providing a default role-binding we could also risk accidentally binding to the wrong / restricted environments. Morever the consumer may not even be using prom (can be using Datadog or other custom agents as well)

…c build directory Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>

AvineshTripathi · 2026-01-24T04:20:36Z

Putting prometheus/cert-manager into their own directories with respective kustomizations files is LGTM to me as well (I saw other projects doing the same).
Q: I don't see us adding any grafana dashbaords right now in the PR, do you think we need to add grafana right now?

We can, but I feel the controller is doing its job by exporting the metrics. How users choose to visualize them (whether via Grafana or another tool) should be up to them.

Also, Makefile changes, I think, we should make a new makefile target something like deploy-with-monitoring (and another one to clean it as well) and leave the existing deploy target as it is, to keep it simple.

I totally hear you on keeping the deploy target simple. The reason I consolidated the logic into build-manifests-temp was to avoid duplicating the ENABLE_METRICS=true/ENABLE_TLS=true logic needed for build-installer.

I’m happy to split these out into a specific deploy-with-monitoring target if you prefer, but I’d love to find a middle ground that doesn't leave us maintaining the same logic in two places. Do you have a specific pattern in mind for reusing those flags across different targets?

Priyankasaggu11929 · 2026-01-24T08:17:50Z

We can, but I feel the controller is doing its job by exporting the metrics. How users choose to visualize them (whether via Grafana or another tool) should be up to them.

Correct. And sorry I didn't catch your decision - should we or should we not deal with grafana from controller side?

I was suggesting that we should not, to be clear. Unless we provide pre-made grafana specific dashboards.

I totally hear you on keeping the deploy target simple. The reason I consolidated the logic into build-manifests-temp was to avoid duplicating the ENABLE_METRICS=true/ENABLE_TLS=true logic needed for build-installer.

I’m happy to split these out into a specific deploy-with-monitoring target if you prefer, but I’d love to find a middle ground that doesn't leave us maintaining the same logic in two places. Do you have a specific pattern in mind for reusing those flags across different targets?

How about we just add a new deploy-with-monitoring target which in turn calls existing deploy target only but with ENABLE_METRICS=true? (and then similar for cleaning it up?)

I'm still not sure what should we do for ENABLE_TLS option we have. (Maybe another target? I prefer that we give explicit option but I leave this ENABLE_TLS to you. Maybe we adopt a pattern from other projects?)
Infact, I'm thinking another question - should we make deploy target default to using TLS enabled?

AvineshTripathi · 2026-01-24T09:25:43Z

Correct. And sorry I didn't catch your decision - should we or should we not deal with grafana from controller side? I was suggesting that we should not, to be clear. Unless we provide pre-made grafana specific dashboards.

yes, looks like we both are in favor of not keeping it as part of the config

How about we just add a new deploy-with-monitoring target which in turn calls existing deploy target only but with ENABLE_METRICS=true? (and then similar for cleaning it up?)

in order to do that, we will have to make a recursive call something like this

deploy-with-monitoring:
	$(MAKE) deploy ENABLE_METRICS=true

I prefer using flags over multiple targets for deployment. It keeps the Makefile clean and avoids redundancy. Managing a separate target for every configuration variant (P&C) adds unnecessary overhead, whereas flags allow us to scale the command dynamically. But I may be bias here, open to discuss cases where this approach will become a problem

Priyankasaggu11929 · 2026-01-24T10:14:59Z

+  namespace: monitoring
+roleRef:
+  kind: ClusterRole
+  name: metrics-reader


Based on https://github.com/kubernetes-sigs/node-readiness-controller/pull/79/changes#r2723899802

ClusterRole reference will need an update here.

do you think disabling it here(https://github.com/kubernetes-sigs/node-readiness-controller/pull/79/changes#diff-3fd064f2f1027a954b035a6d5885b96cc5b483f04889d14fa840ace5e9c29c55R20) will be better for now just to keep some ref in the codebase for now to revisit if needed

responded to above comment here - https://github.com/kubernetes-sigs/node-readiness-controller/pull/79/changes#r2724047539

Also -

Just for my understanding (I don't know what's the best way to structure Kustomization files for each component) - would it be better to move this file under config/prometheus/ just to keep all prometheus manifests in one place?

will do once we decide if are keeping it or removing

replied to this at the wrong discussion - could we not provide the role-binding? this has risks of user accidentally deploying the role into a wrong environment / namespace as this file has assumptions on the consumer.

could we rather have this documented on 'how to install with secure metrics'?

Priyankasaggu11929 · 2026-01-24T11:01:22Z

I prefer using flags over multiple targets for deployment. It keeps the Makefile clean and avoids redundancy.

My suggestion was more on the lines of providing quick make targets for at least the most common deployment option(s) we provide.

(and of course, I 100% agree that as our various config knobs grow, it should become the responsibility of our docs to explain all those possible PnC(s) and not Makefile targets.

But right now, since we are providing Kustomization files to deploy the controller with TLS enabled metrics endpoint (which are good practices for prod deployment), I think having ready to use make targets (to deploy and clean) will help.

in order to do that, we will have to make a recursive call something like this

We can avoid the recursive call
(ref: make target specific variables - https://www.gnu.org/software/make/manual/html_node/Target_002dspecific.html)
There's an example here - https://stackoverflow.com/a/26383350

So, in our case, that would become something like following (this much feels clean to me):

# ========================
# Deploy targets
# ========================

.PHONY: deploy
deploy: build-manifests-temp ## Deploy controller to the K8s cluster.
	$(KUBECTL) apply -f $(BUILD_DIR)/manifests.yaml

.PHONY: deploy-with-metrics
deploy-with-metrics: ENABLE_METRICS=true ## Deploy controller with metrics enabled.
deploy-with-metrics: deploy

.PHONY: deploy-with-metrics-and-tls
deploy-with-metrics-and-tls: ENABLE_METRICS=true ENABLE_TLS=true ## Deploy controller with metrics and TLS enabled.
deploy-with-metrics-and-tls: deploy


# ========================
# Undeploy targets
# ========================

.PHONY: undeploy
undeploy: build-manifests-temp ## Undeploy controller from the K8s cluster. 
	$(KUBECTL) delete --ignore-not-found=$(ignore-not-found) -f $(BUILD_DIR)/manifests.yaml

.PHONY: undeploy-with-metrics
undeploy-with-metrics: ENABLE_METRICS=true ## Undeploy controller with metrics enabled.
undeploy-with-metrics: undeploy


.PHONY: undeploy-with-metrics-and-tls
undeploy-with-metrics-and-tls: ENABLE_METRICS=true ENABLE_TLS=true ## Undeploy controller with metrics and TLS enabled.
undeploy-with-metrics-and-tls: undeploy

I'm LGTM on the changes (after the few pending cleanup comments around the RBAC files).

These Makefile changes are not blocking from my side. So, feel free to merge it without.

Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>

AvineshTripathi · 2026-01-24T12:33:57Z

@Priyankasaggu11929 I have implemented the changes related to Makefile (I tried it, but I was doing something terribly wrong; your refs helped me). PHAL, and thank you!

Priyankasaggu11929 · 2026-01-25T08:35:24Z

Thanks @AvineshTripathi.

LGTM on the Makefile changes.

ajaysundark · 2026-01-26T06:21:58Z

+  namespace: nrr-system
+subjects:
+- kind: ServiceAccount
+  name: prometheus-k8s


Looks like the best practice is to not provide the role-binding - https://book.kubebuilder.io/reference/metrics#granting-permissions-to-access-metrics.

I think we should not prefer to include this either as we dont know the service-account or namespace, and by providing a default role-binding we could also risk accidentally binding to the wrong / restricted environments. Morever the consumer may not even be using prom (can be using Datadog or other custom agents as well)

Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>

ajaysundark · 2026-01-28T06:53:30Z

@@ -1,3 +1,3 @@
 # Prometheus Monitor Service (Metrics)
 apiVersion: monitoring.coreos.com/v1
 kind: ServiceMonitor


IMO, this could also be excluded as this adds dependency to prometheus operator. I expect the node-readiness-controller to be one of the initial pieces the cluster admin will be installing during cluster bootstrap, and adding dependencies will make it a bad UX.

@ajaysundark this is also part of the default kubebuilder layout

ajaysundark · 2026-01-28T13:57:19Z

/lgtm
/approve

k8s-ci-robot · 2026-01-28T13:57:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ajaysundark, AvineshTripathi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ajaysundark]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ajaysundark · 2026-01-29T13:52:03Z

/unhold

refactor: enable metrics manifests

3d5fe4d

Enabled Prometheus and Cert-Manager in the Kustomize configuration to support secure metrics scraping and certificate management. Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>

k8s-ci-robot requested a review from dchen1107 January 11, 2026 15:42

k8s-ci-robot requested a review from tallclair January 11, 2026 15:42

AvineshTripathi marked this pull request as draft January 11, 2026 15:42

ajaysundark requested review from Karthik-K-N and sreeram-venkitesh and removed request for dchen1107 and tallclair January 12, 2026 07:23

Merge branch 'main' into feat/controller-metrics

50a3a97

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2026

AvineshTripathi marked this pull request as ready for review January 13, 2026 05:58

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 13, 2026

k8s-ci-robot requested review from ajaysundark and haircommander January 13, 2026 05:58

ajaysundark reviewed Jan 14, 2026

View reviewed changes

ajaysundark reviewed Jan 15, 2026

View reviewed changes

Comment thread cmd/main.go

ajaysundark mentioned this pull request Jan 15, 2026

feat: Add detailed Prometheus metrics #6

Closed

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 17, 2026

feat(config): refactor metrics to use Kustomize components and dynami…

4b24a17

…c build directory Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>

AvineshTripathi force-pushed the feat/controller-metrics branch from e388c40 to 4b24a17 Compare January 17, 2026 13:40

ajaysundark reviewed Jan 19, 2026

View reviewed changes

Comment thread Makefile

ajaysundark reviewed Jan 19, 2026

View reviewed changes

Comment thread Makefile Outdated

ajaysundark reviewed Jan 19, 2026

View reviewed changes

Comment thread Makefile Outdated

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 23, 2026

Priyankasaggu11929 reviewed Jan 24, 2026

View reviewed changes

feat(Makefile): add helper targets for metrics

a272040

Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2026

ajaysundark requested changes Jan 26, 2026

View reviewed changes

k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2026

ajaysundark reviewed Jan 26, 2026

View reviewed changes

Comment thread config/prometheus/kustomization.yaml Outdated

feat(config): remove clusterscope bindings from the config

28d1d94

Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 28, 2026

ajaysundark reviewed Jan 28, 2026

View reviewed changes

Comment thread config/certmanager/certificate.yaml

ajaysundark reviewed Jan 28, 2026

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 28, 2026

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026

k8s-ci-robot merged commit 13a3729 into kubernetes-sigs:main Jan 29, 2026
9 checks passed

This was referenced Feb 18, 2026

REQUEST: New membership for @AvineshTripathi kubernetes/org#6145

Closed

Release v0.2.0 #136

Closed

OneUpWallStreet mentioned this pull request Mar 2, 2026

doc: installing node-readiness-controller with metrics support #94

Closed

Conversation

AvineshTripathi commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify Bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for node-readiness-controller canceled.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajaysundark left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AvineshTripathi commented Jan 24, 2026

Uh oh!

Priyankasaggu11929 commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AvineshTripathi commented Jan 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajaysundark Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priyankasaggu11929 commented Jan 24, 2026

Uh oh!

AvineshTripathi commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Priyankasaggu11929 commented Jan 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajaysundark Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AvineshTripathi commented Jan 11, 2026 •

edited

Loading

netlify Bot commented Jan 11, 2026 •

edited

Loading

Priyankasaggu11929 commented Jan 24, 2026 •

edited

Loading

ajaysundark Jan 26, 2026 •

edited

Loading

AvineshTripathi commented Jan 24, 2026 •

edited

Loading

ajaysundark Jan 28, 2026 •

edited

Loading