Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions enhancements/rhcos/rhcos-fail-live.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---

title: rhcos-fail-live
authors:
- "Ben Howard <@darkmuggle>"
- "Colin Walters <@cgwalters>"
reviewers:
- "@ashcrow"
- "@miabbott"
- "@jlebon"
approvers:
- "@ashcrow"
- "@crawford"
creation-date: 2020-03-24
last-updated: 2020-03-24
status: implementable
---

# RHCOS Ignition Fail to Live


## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

OpenShift installations are started by booting a RHEL CoreOS system using Ignition.
The Ignition configuration is served via the
[machine-config-server](https://github.com/openshift/machine-config-operator/blob/master/docs/MachineConfigServer.md),
which is a network endpoint. Hence, OpenShift installations require networking in
the initial ram disk. But, not all platforms support DHCP, and there is a need
to support static IP addressing as well.

## Motivation

In some circumstances, configuring the network might entail trying to catch the
bootloader prompt. In particular on VMWare, we do not currently offer a
programmatic way to configure the network in the initramfs.
(This will likely be fixed independently)

When DHCP is not available, users must configure networking using kernel arguments.

Further, when trying to craft a static configuration, this requires knowledge of
device interface names which is often most easily acquired by inspecting a live system.

Currently performing machine-specific configuration can entail customizing the
Ignition config generated by the OpenShift installer per machine.

### Goals

When Ignition fails for any reason today, given access to an interactive
console, one can press `Enter` to start a shell in the initial ramdisk. This
enhancement calls for extending that to support an auto-login to the *real* root.

* Provide a user-friendly recovery console for Ignition failures.
* Provide a user-friendly method to configure network
* Remove the need for catching a Grub console
* Discard all state upon leaving the environmet
* Provide disk-based UPI installation on unsupported platforms using the "metal" target.

### Non-goals

All other tasks.

## Proposal

Rather than failing to the `emergency.target` upon an Ignition Failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove .

Specific platforms (Qemu and Metal) will have platform-specific configurations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is continuing from previous fragment.

for "failing to live" which will provide the means for user recovery from the
error state.

Upon entering the "live failure mode", the user will be logged in automatically
to a terminal with all the tools of the boot image. When the user exits the mode
a white-list of settings will be encoded as an additional Ignition payload;
this payload will be superseded by any provided by the user.

Only platforms which imply, by nature of its configuration, that a user has
interacted with the boot (e.g setting the kernel commandline argument ignition.url=<...>)
and can reasonably be expected to have user-input will support this mode.

### Risks and Mitigations

This behavior change is aimed at providing users with the ability to use the
tooling in the target system to recover.

## Design Details

### Test Plan

**Note:** *Section not required until targeted at a release.**

#### Examples


### Upgrade / Downgrade Strategy


### Version Skew Strategy

N/A

## Implementation History

Major milestones in the life cycle of a proposal should be tracked in `Implementation
History`.

## Drawbacks

The idea is to find the best form of an argument why this enhancement should _not_ be implemented.

## Alternatives

The CoreOS Team has considered multiple ideas, including:

*Adding error handlers to Ignition*: this was rejected as it would require a
significant re-architecture. Ignition was designed to treat all errors as fatal.

*Attempting to discern the fault causing the error* (such as no networking) and emitting
a totem. This idea would allow an external actor to change how things boot.
The idea is similar to the proposed solution but it would require more logic in Dracut.

*Using the CoreOS Installer to drive network information* was considered and rejected
since it is specific to "metal" images. With this solution, its concievable to use
UPI installations on unsupported platforms such as Azure-like installations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Azure/Hyper-V/

Further, it requires another utility and information that may not be known (such as interface naming).