Skip to content

Suggestion and Experiment stuck at Running when suggestionCount < requests #1494

@midhun1998

Description

@midhun1998

/kind bug

What steps did you take and what happened:

  1. Run an experiment following the grid search algorithm.
  2. parallelTrialCount specified was 5, maxTrialCount specified was 8 and maxFailedTrialCount specified was 2.
  3. We run this experiment multiple times and sometimes the suggested count returned by the suggestion controller is less than the request(which is maxTrialCount value) which causes the trial to get stuck at Running state which in this case was 8 and requested was 10. This causes the pipeline Katib component to never finish which in-turn fails the pipeline.

This is only seen with the grid search algorithm whereas the random search works fine.

The following logs are seen from the Katib controller:

`{"level":"error","ts":1616487038.0000834,"logger":"suggestion-controller","msg":"Reconcile Suggestion error","Suggestion":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","error":"The response contains unexpected trials","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/suggestion.(*ReconcileSuggestion).Reconcile\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/suggestion/suggestion_controller.go:175\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

{"level":"error","ts":1616487038.0001156,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"suggestion-controller","request":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","error":"The response contains unexpected trials","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

{"level":"info","ts":1616487043.898402,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.8985455,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.8985615,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}

{"level":"info","ts":1616487043.902022,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.9020426,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.9020486,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}

{"level":"info","ts":1616487043.9060924,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.90611,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.9061158,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}

{"level":"info","ts":1616487043.9130266,"logger":"experiment-controller","msg":"Statistics","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","requiredActiveCount":3,"parallelCount":5,"activeCount":0,"completedCount":5}

{"level":"info","ts":1616487043.9130516,"logger":"experiment-controller","msg":"Reconcile Suggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","addCount":3}

{"level":"info","ts":1616487043.9130602,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"katib-d7c64e16-54fe-47ab-b744-121d62c49b40","name":"test-katib-d7c64e16-54fe-47ab-b744-121d62c49b40","Suggestion Requests":8}
`

What did you expect to happen:
Expected the experiment to complete with all requested number of trials.

Anything else you would like to add:
Similar to issue: #1168

Environment:

  • Kubeflow version (kfctl version): 1.2 (v1beta1)
  • Kubernetes version: (use kubectl version): 1.17

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions