Skip to content

refactor: enable metrics manifests#79

Merged
k8s-ci-robot merged 10 commits intokubernetes-sigs:mainfrom
AvineshTripathi:feat/controller-metrics
Jan 29, 2026
Merged

refactor: enable metrics manifests#79
k8s-ci-robot merged 10 commits intokubernetes-sigs:mainfrom
AvineshTripathi:feat/controller-metrics

Conversation

@AvineshTripathi
Copy link
Copy Markdown
Contributor

@AvineshTripathi AvineshTripathi commented Jan 11, 2026

Enabled Prometheus and Cert-Manager in the Kustomize configuration to support secure metrics scraping and certificate management.

how to test locally:

  1. create a kind cluster or any cluster
  2. Install prometheus(I used https://github.com/prometheus-operator/kube-prometheus quickstart)
  3. Install cert manager crds(kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.19.2/cert-manager.yaml)
  4. Run make deploy ENABLE_METRICS=true ENABLE_TLS=true
  5. expose grafana using (kubectl port-forward svc/grafana -n monitoring 3000)

Enabled Prometheus and Cert-Manager in the Kustomize configuration to support secure metrics scraping and certificate management.

Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Jan 11, 2026

Deploy Preview for node-readiness-controller canceled.

Name Link
🔨 Latest commit 28d1d94
🔍 Latest deploy log https://app.netlify.com/projects/node-readiness-controller/deploys/69797f0d7bfa820008a2672f

@AvineshTripathi AvineshTripathi marked this pull request as draft January 11, 2026 15:42
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 11, 2026
@ajaysundark ajaysundark requested review from Karthik-K-N and sreeram-venkitesh and removed request for dchen1107 and tallclair January 12, 2026 07:23
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2026
@AvineshTripathi AvineshTripathi marked this pull request as ready for review January 13, 2026 05:58
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 13, 2026
Comment thread config/rbac/metrics_reader_role.yaml Outdated
rules:
- nonResourceURLs:
- "/metrics"
- apiGroups: [""]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does metrics-reader RBAC need these resource access?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not give these permissions, Prometheus is not able to configure pods as targets. here were the logs in Prometheus, and something similar happens for other resources

time=2026-01-11T11:53:19.320Z level=ERROR source=reflector.go:205 msg="Failed to watch" component=k8s_client_runtime logger=UnhandledError err="failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"nrr-system\"" reflector=pkg/mod/k8s.io/client-go@v0.34.1/tools/cache/reflector.go:290 type=*v1.Pod

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this seems to be required for Prometheus to discover the controller and creating service-monitor. Did you verify if nonResourceURLs: ["/metrics"] is no-longer needed and this role is sufficient for a scraper without it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works without that because Prometheus is hitting the controller pod api and not kube api server

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the existing nonResourceURLs: ["/metrics"] to be the case in most other projects.

@AvineshTripathi, do you have see examples of projects giving metrics-reader these wide permissions?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File - config/rbac/metrics_reader_role.yaml
(basically the one this ongoing thread is about)

context here - #79 (comment) - basically this RBAC is already provided by kube-prometheus stack and so from controller code, we ahould clean it since its not needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are setting up authn/authz with controller-runtime's WithAuthenticationAndAuthorization feature. This seems to require this ClusterRole with nonResourceUrls: /metrics permissions. ref

more details here: http://book.kubebuilder.io/reference/metrics

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are setting up authn/authz with controller-runtime's WithAuthenticationAndAuthorization feature.
we already have that here

wrt to nonResourceUrls i am still confused. My understanding about that endpoint is that it is apiServer endpoint and not controller endpoint

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kube-prometheus-stack takes care of it

I agree to @Priyankasaggu11929 here. Service discovery is provided by prometheus installation. For eg: https://github.com/prometheus-community/helm-charts/blob/main/charts%2Fkube-prometheus-stack%2Ftemplates%2Fprometheus%2Fclusterrole.yaml#L13 sets up these permissions.

We should provide only rbac for our controller requirements

Copy link
Copy Markdown
Contributor

@ajaysundark ajaysundark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @AvineshTripathi!

I think your suggestion on adding a new overlay that explicitly enables cert-manager and prometheus will help us move faster and safer for adding new functionality.

Wdyt about moving cert-manager and prometheus patches to config/overlays/secure-monitoring or similar?

Comment thread internal/metrics/metrics.go Outdated
Comment thread config/default/kustomization.yaml Outdated
Comment thread config/default/kustomization.yaml Outdated
namespace: nrr-system
subjects:
- kind: ServiceAccount
name: prometheus-k8s
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcodes the namespace / SA name. Is this standard? I think we could document this as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the best practice is to not provide the role-binding - https://book.kubebuilder.io/reference/metrics#granting-permissions-to-access-metrics.

I think we should not prefer to include this either as we dont know the service-account or namespace, and by providing a default role-binding we could also risk accidentally binding to the wrong / restricted environments. Morever the consumer may not even be using prom (can be using Datadog or other custom agents as well)

Comment thread cmd/main.go
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 17, 2026
…c build directory

Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>
@AvineshTripathi AvineshTripathi force-pushed the feat/controller-metrics branch from e388c40 to 4b24a17 Compare January 17, 2026 13:40
Comment thread Makefile
Comment thread Makefile Outdated
Comment thread Makefile Outdated
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 23, 2026
@AvineshTripathi
Copy link
Copy Markdown
Contributor Author

Putting prometheus/cert-manager into their own directories with respective kustomizations files is LGTM to me as well (I saw other projects doing the same).
Q: I don't see us adding any grafana dashbaords right now in the PR, do you think we need to add grafana right now?

We can, but I feel the controller is doing its job by exporting the metrics. How users choose to visualize them (whether via Grafana or another tool) should be up to them.

Also, Makefile changes, I think, we should make a new makefile target something like deploy-with-monitoring (and another one to clean it as well) and leave the existing deploy target as it is, to keep it simple.

I totally hear you on keeping the deploy target simple. The reason I consolidated the logic into build-manifests-temp was to avoid duplicating the ENABLE_METRICS=true/ENABLE_TLS=true logic needed for build-installer.

I’m happy to split these out into a specific deploy-with-monitoring target if you prefer, but I’d love to find a middle ground that doesn't leave us maintaining the same logic in two places. Do you have a specific pattern in mind for reusing those flags across different targets?

@Priyankasaggu11929
Copy link
Copy Markdown
Member

Priyankasaggu11929 commented Jan 24, 2026

We can, but I feel the controller is doing its job by exporting the metrics. How users choose to visualize them (whether via Grafana or another tool) should be up to them.

Correct. And sorry I didn't catch your decision - should we or should we not deal with grafana from controller side?

I was suggesting that we should not, to be clear. Unless we provide pre-made grafana specific dashboards.

I totally hear you on keeping the deploy target simple. The reason I consolidated the logic into build-manifests-temp was to avoid duplicating the ENABLE_METRICS=true/ENABLE_TLS=true logic needed for build-installer.

I’m happy to split these out into a specific deploy-with-monitoring target if you prefer, but I’d love to find a middle ground that doesn't leave us maintaining the same logic in two places. Do you have a specific pattern in mind for reusing those flags across different targets?

How about we just add a new deploy-with-monitoring target which in turn calls existing deploy target only but with ENABLE_METRICS=true? (and then similar for cleaning it up?)

I'm still not sure what should we do for ENABLE_TLS option we have. (Maybe another target? I prefer that we give explicit option but I leave this ENABLE_TLS to you. Maybe we adopt a pattern from other projects?)
Infact, I'm thinking another question - should we make deploy target default to using TLS enabled?

@AvineshTripathi
Copy link
Copy Markdown
Contributor Author

Correct. And sorry I didn't catch your decision - should we or should we not deal with grafana from controller side? I was suggesting that we should not, to be clear. Unless we provide pre-made grafana specific dashboards.

yes, looks like we both are in favor of not keeping it as part of the config

How about we just add a new deploy-with-monitoring target which in turn calls existing deploy target only but with ENABLE_METRICS=true? (and then similar for cleaning it up?)

in order to do that, we will have to make a recursive call something like this

deploy-with-monitoring:
	$(MAKE) deploy ENABLE_METRICS=true

I prefer using flags over multiple targets for deployment. It keeps the Makefile clean and avoids redundancy. Managing a separate target for every configuration variant (P&C) adds unnecessary overhead, whereas flags allow us to scale the command dynamically. But I may be bias here, open to discuss cases where this approach will become a problem

namespace: monitoring
roleRef:
kind: ClusterRole
name: metrics-reader
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think disabling it here(https://github.com/kubernetes-sigs/node-readiness-controller/pull/79/changes#diff-3fd064f2f1027a954b035a6d5885b96cc5b483f04889d14fa840ace5e9c29c55R20) will be better for now just to keep some ref in the codebase for now to revisit if needed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded to above comment here - https://github.com/kubernetes-sigs/node-readiness-controller/pull/79/changes#r2724047539


Also -

Just for my understanding (I don't know what's the best way to structure Kustomization files for each component) - would it be better to move this file under config/prometheus/ just to keep all prometheus manifests in one place?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do once we decide if are keeping it or removing

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replied to this at the wrong discussion - could we not provide the role-binding? this has risks of user accidentally deploying the role into a wrong environment / namespace as this file has assumptions on the consumer.

Copy link
Copy Markdown
Contributor

@ajaysundark ajaysundark Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we rather have this documented on 'how to install with secure metrics'?

@Priyankasaggu11929
Copy link
Copy Markdown
Member

I prefer using flags over multiple targets for deployment. It keeps the Makefile clean and avoids redundancy.

My suggestion was more on the lines of providing quick make targets for at least the most common deployment option(s) we provide.

(and of course, I 100% agree that as our various config knobs grow, it should become the responsibility of our docs to explain all those possible PnC(s) and not Makefile targets.

But right now, since we are providing Kustomization files to deploy the controller with TLS enabled metrics endpoint (which are good practices for prod deployment), I think having ready to use make targets (to deploy and clean) will help.


in order to do that, we will have to make a recursive call something like this

We can avoid the recursive call
(ref: make target specific variables - https://www.gnu.org/software/make/manual/html_node/Target_002dspecific.html)
There's an example here - https://stackoverflow.com/a/26383350

So, in our case, that would become something like following (this much feels clean to me):

# ========================
# Deploy targets
# ========================

.PHONY: deploy
deploy: build-manifests-temp ## Deploy controller to the K8s cluster.
	$(KUBECTL) apply -f $(BUILD_DIR)/manifests.yaml

.PHONY: deploy-with-metrics
deploy-with-metrics: ENABLE_METRICS=true ## Deploy controller with metrics enabled.
deploy-with-metrics: deploy

.PHONY: deploy-with-metrics-and-tls
deploy-with-metrics-and-tls: ENABLE_METRICS=true ENABLE_TLS=true ## Deploy controller with metrics and TLS enabled.
deploy-with-metrics-and-tls: deploy


# ========================
# Undeploy targets
# ========================

.PHONY: undeploy
undeploy: build-manifests-temp ## Undeploy controller from the K8s cluster. 
	$(KUBECTL) delete --ignore-not-found=$(ignore-not-found) -f $(BUILD_DIR)/manifests.yaml

.PHONY: undeploy-with-metrics
undeploy-with-metrics: ENABLE_METRICS=true ## Undeploy controller with metrics enabled.
undeploy-with-metrics: undeploy


.PHONY: undeploy-with-metrics-and-tls
undeploy-with-metrics-and-tls: ENABLE_METRICS=true ENABLE_TLS=true ## Undeploy controller with metrics and TLS enabled.
undeploy-with-metrics-and-tls: undeploy

I'm LGTM on the changes (after the few pending cleanup comments around the RBAC files).

These Makefile changes are not blocking from my side. So, feel free to merge it without.

Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2026
@AvineshTripathi
Copy link
Copy Markdown
Contributor Author

AvineshTripathi commented Jan 24, 2026

@Priyankasaggu11929 I have implemented the changes related to Makefile (I tried it, but I was doing something terribly wrong; your refs helped me). PHAL, and thank you!

@Priyankasaggu11929
Copy link
Copy Markdown
Member

Thanks @AvineshTripathi.

LGTM on the Makefile changes.

namespace: nrr-system
subjects:
- kind: ServiceAccount
name: prometheus-k8s
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the best practice is to not provide the role-binding - https://book.kubebuilder.io/reference/metrics#granting-permissions-to-access-metrics.

I think we should not prefer to include this either as we dont know the service-account or namespace, and by providing a default role-binding we could also risk accidentally binding to the wrong / restricted environments. Morever the consumer may not even be using prom (can be using Datadog or other custom agents as well)

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2026
Comment thread config/prometheus/kustomization.yaml Outdated
Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 28, 2026
Comment thread config/certmanager/certificate.yaml
@@ -1,3 +1,3 @@
# Prometheus Monitor Service (Metrics)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
Copy link
Copy Markdown
Contributor

@ajaysundark ajaysundark Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this could also be excluded as this adds dependency to prometheus operator. I expect the node-readiness-controller to be one of the initial pieces the cluster admin will be installing during cluster bootstrap, and adding dependencies will make it a bad UX.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajaysundark this is also part of the default kubebuilder layout

@ajaysundark
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ajaysundark, AvineshTripathi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 28, 2026
@ajaysundark
Copy link
Copy Markdown
Contributor

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026
@k8s-ci-robot k8s-ci-robot merged commit 13a3729 into kubernetes-sigs:main Jan 29, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants