Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion internal/runtime/client/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -518,7 +518,9 @@ func httpCall(ctx context.Context, request, response runtime.Object, opts *httpC
}

// Use client-go's transport.TLSConfigureFor to ensure good defaults for tls
client := http.DefaultClient
client := &http.Client{}
defer client.CloseIdleConnections()
Comment on lines +521 to +522
Copy link
Member

@fabriziopandini fabriziopandini Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimmidyson with this change we are moving away from a single, long lived http.Client and starting using many short lived httpClients, which as far as I understand, it means to give up on all the optimisations that exists inside golang http client (and defer client.CloseIdleConnections() makes this very explicit)

I need some time to dig more into details, but I was wondering what is your take on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm not too worried about it because from what I can see this client is not that heavily used.

However I can take a look and try to provide a custom transport instead that dynamically configures the TLSConfig instead of creating a new client each time.

Copy link
Member Author

@jimmidyson jimmidyson Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini Pushed #13064 as an alternative implementation that fixes the same issue without creating a new http.Client every call, while avoiding reconfiguring the global http.DefaultClient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I can see this client is not that heavily used

This client is used by all the runtime extension calls; it is true that we did a few optimisations to reduce the number of calls, but it is definitely used in few controllers now.

However I can take a look and try to provide a custom transport instead that dynamically configures the TLSConfig instead of creating a new client each time.

Thanks. I will try to discuss with other maintainers what is the best solution here.

However, I also think that we should split the discussion in two parts:

  • what to do with the change that was already backported (we are past rc.0 and we should not take any risk)
  • what to do on main

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if #13064 is applied on top of all branches it will be exactly the same as it was before perf wise, but actually not having the bug of TLS config race..

and yes, i'd find it highly unlikely that a new client for each call would kick any bee hives.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fix the race one way or another. I can understand being rush averse at this stage of an RC, but we need to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimmidyson BTW, i think something that is missing in the OP is the explanation of the issue at extension discovery.
i.e. what happens due to the TLS config race at scale.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this occurs the ExtensionConfigs become unreconciled and therefore ClusterClasses that reference them also become unreconcilable. Sorry I don't have resource histories showing this but the log I shared in the OP would trigger this.

We're hitting this with just 2 ExtensionConfigs and luckily for us so far this has recovered pretty quickly on requeue and thanks to jitter etc this does not materially affect cluster deployments. However, at scale with more ExtensionConfigs this race will be hit more often and could lead to worse outcomes.

I'll update the new PR with this detail as well as the original description from this PR.


tlsConfig, err := transport.TLSConfigFor(&transport.Config{
TLS: transport.TLSConfig{
CertFile: opts.certFile,
Expand Down
Loading