Skip to content

Conversation

@stbenjam
Copy link
Member

@stbenjam stbenjam commented Aug 7, 2019

fixes #724

@stbenjam stbenjam added the CI check this PR with CI label Aug 7, 2019
@metal3ci
Copy link

metal3ci commented Aug 7, 2019

Build ABORTED, see build http://10.8.144.11:8080/job/dev-tools/1004/

@metal3ci
Copy link

metal3ci commented Aug 7, 2019

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1005/

@stbenjam
Copy link
Member Author

stbenjam commented Aug 7, 2019

Looks like we're hitting openshift/baremetal-runtimecfg#15

@stbenjam stbenjam changed the title Bump release to 4.2.0-0.ci-2019-08-07-085324-kni.0 Bump release to 4.2.0-0.ci-2019-08-07-190500-kni.0 Aug 7, 2019
@metal3ci
Copy link

metal3ci commented Aug 7, 2019

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1009/

@metal3ci
Copy link

metal3ci commented Aug 8, 2019

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1011/

@stbenjam stbenjam changed the title Bump release to 4.2.0-0.ci-2019-08-07-190500-kni.0 Bump release to 4.2.0-0.ci-2019-08-08-094456-kni.0 Aug 8, 2019
@stbenjam
Copy link
Member Author

stbenjam commented Aug 8, 2019

Looks like some of the Ironic stuff was moved into the MAO: https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/baremetal_pod.go

machine-api fails to come up during install now:

[root@dell-r730-021 dev-scripts]# oc get clusteroperator machine-api -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'

Progressing False 

Available True 

Degraded True Failed when progressing towards operator: 4.2.0-0.ci-2019-08-08-094456-kni.0 because Deployment.apps "metal3" is invalid: [spec.template.spec.containers[1].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[2].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[3].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[4].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[5].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[6].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[7].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.initContainers[0].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.initContainers[1].volumeMounts[0].name: Not found: "metal3-shared"]

Upgradeable True 

[root@dell-r730-021 dev-scripts]# oc get clusteroperator machine-api -o=jsonpath='{range .status.conditions[*]}{.type}{" "}{.status}{" "}{.message}{"\n"}{end}'

Progressing False 

Available True 

Degraded True Failed when progressing towards operator: 4.2.0-0.ci-2019-08-08-094456-kni.0 because Deployment.apps "metal3" is invalid: [spec.template.spec.containers[1].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[2].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[3].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[4].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[5].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[6].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.containers[7].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.initContainers[0].volumeMounts[0].name: Not found: "metal3-shared", spec.template.spec.initContainers[1].volumeMounts[0].name: Not found: "metal3-shared"]

Upgradeable True 

@sadasu mentioned we need to create the metal3-shared volume in dev-scripts. Not sure how exactly to do that, I tried this but it didn't work:


[root@dell-r730-021 dev-scripts]# cat assets/templates/91_metal3-shared-volume.yaml 

kind: PersistentVolume

apiVersion: v1

metadata:

  name: metal3-shared

  labels:

    type: local

spec:

  capacity:

    storage: 10Gi

  accessModes:

    - ReadWriteMany

  hostPath:

    path: "/shared"

[root@dell-r730-021 dev-scripts]# oc get pv

NAME            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE

metal3-shared   10Gi       RWX            Retain           Available                                   6m43s

@stbenjam
Copy link
Member Author

stbenjam commented Aug 8, 2019

As a rule we shouldn't be adding more tech debt to dev-scripts, can the MAO create this volume? If it's too late due to 4.2 freeze, (1) I need to know what changes we need to dev-scripts to make it work, (2) an issue tracking the volume creation in the appropriate place (MAO? installer?), (3) an issue against dev-scripts tracking it's removal.

This is currently blocking us going to a new release.

@sadasu
Copy link
Contributor

sadasu commented Aug 8, 2019

I am currently looking at how this shared volume can be created in MAO. I agree that this should not be tech debt against dev-scripts.

@stbenjam
Copy link
Member Author

stbenjam commented Aug 8, 2019

Thanks! openshift/machine-api-operator#373 should fix this, I'll build a new KNI release when that's merged and in a CI build.

@stbenjam stbenjam changed the title Bump release to 4.2.0-0.ci-2019-08-08-094456-kni.0 Bump release to 4.2.0-0.ci-2019-08-09-143854-kni.0 Aug 9, 2019
@metal3ci
Copy link

metal3ci commented Aug 9, 2019

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1025/

@e-minguez
Copy link
Contributor

The 24h bug https://bugzilla.redhat.com/show_bug.cgi?id=1736800 seems to be fixed in 4.2.0-0.nightly-2019-08-09-000333 so hopefully we can use that (or later)

@stbenjam
Copy link
Member Author

Note, this is currently blocked by openshift/machine-api-operator#374. Need that to get in before install will complete successsfully

@hardys
Copy link

hardys commented Aug 13, 2019

I re-tested with registry.svc.ci.openshift.org/ocp/release:4.1.0-0.ci-2019-08-13-013332 which I think should contain the MAO fixes above, but it seems we still need a way to create the missing configmap for the MAO baremetal-operator integration:

$ oc describe pod --namespace openshift-machine-api metal3-7557d6d788-6bqfr | grep Failed
  Warning  Failed     69m (x8 over 70m)    kubelet, master-1  Error: configmaps "metal3-config" not found

I know @imain and @sadasu have been looking into the final solution there ref openshift/installer#2149 but to make this work now I think we'll need to create that configmap in dev-scripts?

@hardys
Copy link

hardys commented Aug 13, 2019

Or, rather figure out how we can apply the configmap from 08_deploy_bmo.sh during 06_create_cluster.sh so that the deploy doesn't time out on the MAO

@stbenjam
Copy link
Member Author

We can maybe drop it in the assets directory for now

@hardys
Copy link

hardys commented Aug 13, 2019

We can maybe drop it in the assets directory for now

Yeah good point, the other issue is we don't have a way to kill the dnsmasq on the host which is also done in the 08 script (this will be resolved when we move to bootstrap hosted ironic).

In practice that's probably not a problem, since we don't create the worker CRs inside the installer (yet) and if we merge #715 workers can be explictly handled as a post-deploy task.

Part of this script depends on a working baremetal-operator, which we
don't have right now.  Disable that step until we fix the
machine-api-operator deployment of the baremetal-operator and its
dependencies.
@metal3ci
Copy link

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1049/

@metal3ci
Copy link

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1050/

@russellb
Copy link
Member

I expect us to merge this without fully fixing the baremetal-operator pod, so I've opened an issue to track the remaining work to fix that: #739

@russellb
Copy link
Member

I see in this latest CI run that our release isn't being used ...

Aug 15 19:39:52 | level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-15-190342"

@metal3ci
Copy link

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1051/

I forgot to wrap this addition with a condition that it should only be
checked when the install fails.
@metal3ci
Copy link

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1052/

When running under the machine-api-operator, the Deployment is just
called "metal3" and not "metal3-baremetal-operator".  Fix this check to
look for the metal3 pod using the updated name.
@metal3ci
Copy link

Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/1053/

@metal3ci
Copy link

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1054/

@metal3ci
Copy link

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1055/

@metal3ci
Copy link

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1056/

@metal3ci
Copy link

Build ABORTED, see build http://10.8.144.11:8080/job/dev-tools/1057/

@metal3ci
Copy link

Build ABORTED, see build http://10.8.144.11:8080/job/dev-tools/1058/

@metal3ci
Copy link

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1059/

@metal3ci
Copy link

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1060/

@metal3ci
Copy link

Build FAILURE, see build http://10.8.144.11:8080/job/dev-tools/1061/

@metal3ci
Copy link

Build SUCCESS, see build http://10.8.144.11:8080/job/dev-tools/1062/

@russellb russellb changed the title Bump release to 4.2.0-0.ci-2019-08-09-143854-kni.0 Drop the use of custom KNI release images. Aug 16, 2019
@russellb russellb merged commit 0786591 into openshift-metal3:master Aug 16, 2019
@stbenjam stbenjam deleted the bump branch August 16, 2019 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI check this PR with CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bump release version to 4.2.0-0.ci-2019-08-06-190740 (or later)

6 participants