-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
I was hunting down a performance issue that we are seeing when Knative is reconciling all revisions, for example due to the controller being restarted or global syncs.
The problem happens in the checkAndUpdateDeployment. The code there uses MakeDeployment like in the create case to setup the deployment. It will then check if there are any changes compared to the previous version by running this code:
// If the spec we want is the spec we have, then we're good.
if equality.Semantic.DeepEqual(have.Spec, deployment.Spec) {
return have, nil
}This check will never return true. The reasons are the following:
- Defaulting. Knative sets up the deployment and does not set fields at various places which Kubernetes will default to some value. Those differences are always there. Examples are: the
Protocolfield in ports which gets set toTCP, the TimeoutSeconds/PeriodSeconds/FailureThreshold in the queue proxy probe, maxSurge in the rolling update strategy which gets set to 25, the RevisionHistoryLimit which becomes 10, the APIVersion in an env.valueFrom.FieldRef which gets set tov1, the restart policy that gets set toAlways... - Mutations by others. Istio for example mutates in the labels
service.istio.io/canonical-nameandservice.istio.io/canonical-revisionin the pod template.
Why is this a problem? Assuming there are just 500 revisions in the system. If every revision reconcile is causing the one Update call, then - with a QPS of 40 (Default 5 multiplied by controller count 8), then only these update calls will need 12.5 seconds. If you have rather 3000 revisions, we're talking about 1m15s. And it's not just about the time. It's also the unnecessary hammering on the API server.
How can this be fixed? I currently have two ideas:
-
Pass the previous version along into
MakeDeploymentand there we will need to follow two strategies:a) is to set defaults where applicable. If you define a TCP probe, just specify that. Kubernetes would unlikely change its default, but specifying the explicit value also does not hurt.
b) Use the previous version to set fields that Knative does not care and Kubernetes could change, such as the maxSurge or revision limit.A sample of that can be found in SaschaSchwarze0@e3fe7f5. There is also code added in
checkAndUpdateDeploymentthat dumps the differences which one can separately apply to see the problem. This part is also explained below in the reproduction steps.But, this approach has two disadvantages:
a) it will require a regular verification and extension assuming Kubernetes eventually adds new fields with defaulting
b) it will be a large effort to capture every possible field that others could mutate in, it will never be able to handle situations where others mutate changes on top of Knative -
Pass a target object into
MakeDeployment. This would be an empty object for the create case and the current version for an update. All the existing functions that build the deployment details, pod spec, its container details, would need a rewrite to handle an empty object, or update an object.This will imo be cleaner, but its complexity and amount of changes grew beyond what I wanted to try out in code I am not familiar with so that I have no spike code to show here.
This will also not solve situations where webhooks mutate things that Knative sets.
I am happy to discuss this. You can reach me in Knative slack (@sascha). Also happy to help with the implementation.
--
One more thing I'd like to mention: creating an object from scratch and then applying it to the cluster for both the create and update case could be a pattern. I only focused on the revision controller to configure the deployment. I would not be surprised if that pattern (which I consider broken because of above explanations) is used elsewhere. Whether that is a pain or not depends on how many objects of that kind would be present and how often all objects get reconciled.
What version of Knative?
Current
Expected Behavior
Knative only updates Kubernetes deployments when there is a need to do this.
Actual Behavior
Every reconcile of a revision triggers a deployment update.
Steps to Reproduce the Problem
Add the following code before https://github.com/knative/serving/blob/v0.33.0/pkg/reconciler/revision/cruds.go#L93:
debugDiffs, err := kmp.SafeDiff(have.Spec, desiredDeployment.Spec)
if err != nil {
return nil, err
}
logger.Infof("Updating deployment %s/%s with diff\n", deployment.Namespace, deployment.Name, debugDiffs)Then let Knative running while you have revisions in the system. You can observe the unnecessary updates in the container logs of the controller.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status