-
Notifications
You must be signed in to change notification settings - Fork 495
Description
/kind bug
What steps did you take and what happened:
- Run an experiment following the grid search algorithm.
parallelTrialCountspecified was 5,maxTrialCountspecified was 8 andmaxFailedTrialCountspecified was 2.- We run this experiment multiple times and sometimes the suggested count returned by the suggestion controller is less than the request(which is
maxTrialCountvalue) which causes the trial to get stuck atRunningstate which in this case was 8 and requested was 10. This causes the pipeline Katib component to never finish which in-turn fails the pipeline.
This is only seen with the grid search algorithm whereas the random search works fine.
The following logs are seen from the Katib controller:
`{"level":"error","ts":1616487038.0000834,"logger":"suggestion-controller","msg":"Reconcile Suggestion error","Suggestion":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","error":"The response contains unexpected trials","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/suggestion.(*ReconcileSuggestion).Reconcile\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/suggestion/suggestion_controller.go:175\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"error","ts":1616487038.0001156,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"suggestion-controller","request":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","error":"The response contains unexpected trials","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"info","ts":1616487043.898402,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}
{"level":"info","ts":1616487043.8985455,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}
{"level":"info","ts":1616487043.8985615,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}
{"level":"info","ts":1616487043.902022,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}
{"level":"info","ts":1616487043.9020426,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}
{"level":"info","ts":1616487043.9020486,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}
{"level":"info","ts":1616487043.9060924,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}
{"level":"info","ts":1616487043.90611,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}
{"level":"info","ts":1616487043.9061158,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}
{"level":"info","ts":1616487043.9130266,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}
{"level":"info","ts":1616487043.9130516,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}
{"level":"info","ts":1616487043.9130602,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}
`
What did you expect to happen:
Expected the experiment to complete with all requested number of trials.
Anything else you would like to add:
Similar to issue: #1168
Environment:
- Kubeflow version (
kfctl version): 1.2 (v1beta1) - Kubernetes version: (use
kubectl version): 1.17