refactor: enable metrics manifests#79
Conversation
Enabled Prometheus and Cert-Manager in the Kustomize configuration to support secure metrics scraping and certificate management. Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>
✅ Deploy Preview for node-readiness-controller canceled.
|
| rules: | ||
| - nonResourceURLs: | ||
| - "/metrics" | ||
| - apiGroups: [""] |
There was a problem hiding this comment.
why does metrics-reader RBAC need these resource access?
There was a problem hiding this comment.
If we do not give these permissions, Prometheus is not able to configure pods as targets. here were the logs in Prometheus, and something similar happens for other resources
time=2026-01-11T11:53:19.320Z level=ERROR source=reflector.go:205 msg="Failed to watch" component=k8s_client_runtime logger=UnhandledError err="failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list resource \"pods\" in API group \"\" in the namespace \"nrr-system\"" reflector=pkg/mod/k8s.io/client-go@v0.34.1/tools/cache/reflector.go:290 type=*v1.Pod
There was a problem hiding this comment.
I see, this seems to be required for Prometheus to discover the controller and creating service-monitor. Did you verify if nonResourceURLs: ["/metrics"] is no-longer needed and this role is sufficient for a scraper without it?
There was a problem hiding this comment.
Yes, it works without that because Prometheus is hitting the controller pod api and not kube api server
There was a problem hiding this comment.
I find the existing nonResourceURLs: ["/metrics"] to be the case in most other projects.
@AvineshTripathi, do you have see examples of projects giving metrics-reader these wide permissions?
There was a problem hiding this comment.
File - config/rbac/metrics_reader_role.yaml
(basically the one this ongoing thread is about)
context here - #79 (comment) - basically this RBAC is already provided by kube-prometheus stack and so from controller code, we ahould clean it since its not needed.
There was a problem hiding this comment.
got it, thanks for the context, @Priyankasaggu11929. This file comes from the default scaffolding of kubebuilder - https://github.com/kubernetes-sigs/kubebuilder/blob/ccb3bd37277f9e6db2cdca3a6b553994275db42e/pkg/plugins/common/kustomize/v2/scaffolds/internal/templates/config/rbac/metrics_reader_role.go#L41.
There was a problem hiding this comment.
You are setting up authn/authz with controller-runtime's WithAuthenticationAndAuthorization feature. This seems to require this ClusterRole with nonResourceUrls: /metrics permissions. ref
more details here: http://book.kubebuilder.io/reference/metrics
There was a problem hiding this comment.
You are setting up authn/authz with controller-runtime's WithAuthenticationAndAuthorization feature.
we already have that here
wrt to nonResourceUrls i am still confused. My understanding about that endpoint is that it is apiServer endpoint and not controller endpoint
There was a problem hiding this comment.
kube-prometheus-stack takes care of it
I agree to @Priyankasaggu11929 here. Service discovery is provided by prometheus installation. For eg: https://github.com/prometheus-community/helm-charts/blob/main/charts%2Fkube-prometheus-stack%2Ftemplates%2Fprometheus%2Fclusterrole.yaml#L13 sets up these permissions.
We should provide only rbac for our controller requirements
ajaysundark
left a comment
There was a problem hiding this comment.
This looks great, thanks @AvineshTripathi!
I think your suggestion on adding a new overlay that explicitly enables cert-manager and prometheus will help us move faster and safer for adding new functionality.
Wdyt about moving cert-manager and prometheus patches to config/overlays/secure-monitoring or similar?
| namespace: nrr-system | ||
| subjects: | ||
| - kind: ServiceAccount | ||
| name: prometheus-k8s |
There was a problem hiding this comment.
This hardcodes the namespace / SA name. Is this standard? I think we could document this as well.
There was a problem hiding this comment.
Looks like the best practice is to not provide the role-binding - https://book.kubebuilder.io/reference/metrics#granting-permissions-to-access-metrics.
I think we should not prefer to include this either as we dont know the service-account or namespace, and by providing a default role-binding we could also risk accidentally binding to the wrong / restricted environments. Morever the consumer may not even be using prom (can be using Datadog or other custom agents as well)
…c build directory Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>
e388c40 to
4b24a17
Compare
We can, but I feel the controller is doing its job by exporting the metrics. How users choose to visualize them (whether via Grafana or another tool) should be up to them.
I totally hear you on keeping the deploy target simple. The reason I consolidated the logic into I’m happy to split these out into a specific |
Correct. And sorry I didn't catch your decision - should we or should we not deal with grafana from controller side? I was suggesting that we should not, to be clear. Unless we provide pre-made grafana specific dashboards.
How about we just add a new I'm still not sure what should we do for |
yes, looks like we both are in favor of not keeping it as part of the config
in order to do that, we will have to make a recursive call something like this I prefer using flags over multiple targets for deployment. It keeps the Makefile clean and avoids redundancy. Managing a separate target for every configuration variant (P&C) adds unnecessary overhead, whereas flags allow us to scale the command dynamically. But I may be bias here, open to discuss cases where this approach will become a problem |
| namespace: monitoring | ||
| roleRef: | ||
| kind: ClusterRole | ||
| name: metrics-reader |
There was a problem hiding this comment.
ClusterRole reference will need an update here.
There was a problem hiding this comment.
do you think disabling it here(https://github.com/kubernetes-sigs/node-readiness-controller/pull/79/changes#diff-3fd064f2f1027a954b035a6d5885b96cc5b483f04889d14fa840ace5e9c29c55R20) will be better for now just to keep some ref in the codebase for now to revisit if needed
There was a problem hiding this comment.
responded to above comment here - https://github.com/kubernetes-sigs/node-readiness-controller/pull/79/changes#r2724047539
Also -
Just for my understanding (I don't know what's the best way to structure Kustomization files for each component) - would it be better to move this file under config/prometheus/ just to keep all prometheus manifests in one place?
There was a problem hiding this comment.
will do once we decide if are keeping it or removing
There was a problem hiding this comment.
replied to this at the wrong discussion - could we not provide the role-binding? this has risks of user accidentally deploying the role into a wrong environment / namespace as this file has assumptions on the consumer.
There was a problem hiding this comment.
could we rather have this documented on 'how to install with secure metrics'?
My suggestion was more on the lines of providing quick make targets for at least the most common deployment option(s) we provide. (and of course, I 100% agree that as our various config knobs grow, it should become the responsibility of our docs to explain all those possible PnC(s) and not Makefile targets. But right now, since we are providing Kustomization files to deploy the controller with TLS enabled metrics endpoint (which are good practices for prod deployment), I think having ready to use make targets (to deploy and clean) will help.
We can avoid the recursive call So, in our case, that would become something like following (this much feels clean to me): I'm LGTM on the changes (after the few pending cleanup comments around the RBAC files). These Makefile changes are not blocking from my side. So, feel free to merge it without. |
Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>
|
@Priyankasaggu11929 I have implemented the changes related to Makefile (I tried it, but I was doing something terribly wrong; your refs helped me). PHAL, and thank you! |
|
Thanks @AvineshTripathi. LGTM on the Makefile changes. |
| namespace: nrr-system | ||
| subjects: | ||
| - kind: ServiceAccount | ||
| name: prometheus-k8s |
There was a problem hiding this comment.
Looks like the best practice is to not provide the role-binding - https://book.kubebuilder.io/reference/metrics#granting-permissions-to-access-metrics.
I think we should not prefer to include this either as we dont know the service-account or namespace, and by providing a default role-binding we could also risk accidentally binding to the wrong / restricted environments. Morever the consumer may not even be using prom (can be using Datadog or other custom agents as well)
Signed-off-by: AvineshTripathi <avineshtripathi1@gmail.com>
| @@ -1,3 +1,3 @@ | |||
| # Prometheus Monitor Service (Metrics) | |||
| apiVersion: monitoring.coreos.com/v1 | |||
| kind: ServiceMonitor | |||
There was a problem hiding this comment.
IMO, this could also be excluded as this adds dependency to prometheus operator. I expect the node-readiness-controller to be one of the initial pieces the cluster admin will be installing during cluster bootstrap, and adding dependencies will make it a bad UX.
There was a problem hiding this comment.
@ajaysundark this is also part of the default kubebuilder layout
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ajaysundark, AvineshTripathi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
Enabled Prometheus and Cert-Manager in the Kustomize configuration to support secure metrics scraping and certificate management.
how to test locally:
make deploy ENABLE_METRICS=true ENABLE_TLS=true