Unattended updates using OSTree¶
This document expects some basic knowledge about OSTree and how the auto-sig uses it. If you need a refresher about that, see the our OSTree docs.
The basic ostree update operation is atomic in the sense that it is either fully applies, or not at all on the next reboot. The result of the upgrade will never be partial. However, an upgrade can still fail in other ways, like failing to boot, or booting but not working properly. When this happens on a regular computer it is easy to interactively use the boot menu to boot the old version and then roll back.
However, in the automotive use-case, like typical embedded systems, there is no user than can interactively fix things up like this. Instead we need a system that can detect such failures and automatically fall back to the old boot. This is referred to as “Unattended updates”.
Basic mechanisms: watchdog, boot-once¶
The basic mechanism for unattended updates is an external watchdog. The idea is that before applying the new update we tell some external device to start a timer, and then we switch to the new update. If the update succeeds in some predefined time, we tell the watchdog to stop, and the update is considered successful. If the boot succeeds, but we detect that something is broken we can automatically roll back and reboot into the old version.
However, if the boot hangs, the watchdog will not see a stop command. When the time runs out the watchdog will then reset the CPU, triggering a forced reboot.
In order to use the above mechanism we need the system to support some kind of “boot-once” support. This is a where you configure the system to boot into the new version, but unless that boot is successful, the following reboot will boot the original version.
Implementation in QEmu¶
The basic mechanisms described above are implemented differently dependent on the exact hardware, so it is hard to give general documentation on how to do this. Instead we chose to make an example based on a virtual machine in qemu. This way it doesn’t need any particular hardware, and anyone can try it.
Watchdog in QEmu¶
QEmu actually supports some emulated hardware watchdogs, but
unfortunately those all reset the watchdog on system reboot, so it is
not possible to use them for unattended updates. Instead we use the
runvm script in this repo, as it has a simple external watchdog
--watchdog on the command line (and
--verbose if you
want to see messages from the watchdog). This will create a device
/dev/virtio-ports/watchdog.0, in the VM. If you write “START” into
it the watchdog will start a 30 sec timeout, and if you write “STOP”
into it it will stop any outstanding timeout. If the timeout runs
out the script will connect to the qemu monitor and tell it to reset
There is some code in the
rpms/autosig-watchdog to use the watchdog.
watchdog-stop commands, as well as
some systemd service files to integrate with the systems a described
boot-once in grub2¶
The OSTree images uses grub2 to boot the system, this uses the Boot Loader Spec (BLS) files to describe the possible boot targets, and supports a boot counter mechanism to do the fallback. After an update, ostree creates BLS files for the new and the old target, where the new one is first (default boot) and the old is second.
Each time grub boots it loads the
grubenv file, and this can store
key/value state between boots. In particular, it supports the
boot_success keys. If
boot_counter is set it
gets decremented (and saved back to
grubenv) each boot. If
boot_counter reaches 0 we consider the boot failed, and we change the
default to the second BLS entry, thus falling back to the old system.
Health check system integration¶
To combine the above during an update we use greenboot which hooks into OSTree and systemd adding various forms of health checks.
Using greenboot, an regular update would look like this:
- rpm-ostree upgrade prepares (stages) an update, this writes all the basic OS in place for the next boot, but doesn’t merge the system /etc into the new deployment, or configure grub to boot it.
- rpm-ostree triggers
ostree-finalize-staged.service, which will complete the update at the end of this reboot.
- This triggers
boot_counter, enabling boot-once and health checks for the new boot.
- The system is rebooted,
- Before reaching the
greenboot-healthcheck.serviceis run, which runs various checks on the system and detects if it is OK (green) or failed (red).
- In case the system is red, some info is logged and the system
is rebooted. This will trigger the
boot_countermechanism, and falling back to the old ostree deployment. In the next boot the
greenboot-rpm-ostree-grub2-check-fallback.serviceservice will detect this and will make the old default permanent (roll back).
- In case the system is green, the
greenboot-grub2-set-success.servicewill remove the
boot_counterkey and set
boot_success=1in grubenv. This makes further reboots use the new version.
The watchdog service files mentioned above integrate with this setup
in two ways. First of all, the
triggers before the
ostree-finalize-staged.service completes the
migration (at reboot) and starts the watchdog.
watchdog-ostree-stop.service triggers after
boot-complete.target (i.e. after a successful green boot) and stops
upgrade-demo image demonstrates how this can work.
First we build the basic demo, and create a repo for the update.
$ make cs9-qemu-upgrade-demo-ostree.x86_64.qcow2 OSTREE_REPO=upgrade-demo-repo
Then we build the update, and this one includes the
extra rpm which makes the boot slower than the 30 sec of the watchdog.
$ make cs9-qemu-upgrade-demo-ostree.x86_64.repo OSTREE_REPO=upgrade-demo-repo DEFINES='extra_rpms=["autosig-sample-slow-startup"] distro_version="9.1"'
To make it easier to see what version is running we also set a newer version (9.1) for the update.
Then run the image like so:
$ ./runvm --verbose --watchdog --publish-dir=upgrade-demo-repo cs9-qemu-upgrade-demo-ostree.x86_64.qcow2
After login we can check the state of the system:
# rpm-ostree status State: idle Deployments: ● auto-sig:cs9/x86_64/qemu-upgrade-demo Version: 9 (2022-04-05T10:26:04Z) Commit: da5f056764585acb7b618ac826f2555b8ef0cfac7ab783a7e48b4140814dc342 # cat /boot/grub2/grubenv # GRUB Environment Block boot_success=1 ...
Then trigger an update and check out the new state:
# rpm-ostree upgrade Staging deployment... done Added: autosig-sample-slow-startup-0.1-1.el9.x86_64 Run "systemctl reboot" to start a reboot # rpm-ostree status State: idle Deployments: auto-sig:cs9/x86_64/qemu-upgrade-demo Version: 9.1 (2022-04-05T10:28:59Z) Commit: b4eeec5715eb8b18fae89e95e2ac295279e23b84675bb38281c03bc52543db9e Diff: 1 added ● auto-sig:cs9/x86_64/qemu-upgrade-demo Version: 9 (2022-04-05T10:26:04Z) Commit: da5f056764585acb7b618ac826f2555b8ef0cfac7ab783a7e48b4140814dc342 # cat /boot/grub2/grubenv # GRUB Environment Block boot_success=0 boot_counter=1 ...
reboot, and notice the
Starting watchdog for 30 sec output from runvm.
If you manage to log in before the watchdog you can get the state:
# rpm-ostree status State: idle Deployments: ● auto-sig:cs9/x86_64/qemu-upgrade-demo Version: 9.1 (2022-04-05T10:28:59Z) Commit: b4eeec5715eb8b18fae89e95e2ac295279e23b84675bb38281c03bc52543db9e auto-sig:cs9/x86_64/qemu-upgrade-demo Version: 9 (2022-04-05T10:26:04Z) Commit: da5f056764585acb7b618ac826f2555b8ef0cfac7ab783a7e48b4140814dc342 # cat /boot/grub2/grubenv # GRUB Environment Block boot_success=0 boot_counter=0 ...
However, the slow-start service is slower than the watchdog, so after
a short time you should see
Triggering watchdog from runvm, and the
At the end of the next boot you will see
Stopped watchdog from runvm
as the fallback succeeds, and if you look in the logs you will see lines like:
greenboot-rpm-ostree-grub2-check-fallback: FALLBACK BOOT DETECTED! Default rpm-ostree deployment has been rolled back. Reached target Boot Completion Check. Starting Mark boot as successful in grubenv... Starting greenboot Success Scripts Runner... greenboot: Boot Status is GREEN - Health Check SUCCESS Starting Stop watchdog after update on successful boot... Finished greenboot Success Scripts Runner. watchdog-ostree-stop.service: Deactivated successfully. Finished Stop watchdog after update on successful boot. Finished Mark boot as successful in grubenv.
And if you check the status, we’re back at the original version:
# rpm-ostree status State: idle Deployments: ● auto-sig:cs9/x86_64/qemu-upgrade-demo Version: 9 (2022-04-05T10:26:04Z) Commit: da5f056764585acb7b618ac826f2555b8ef0cfac7ab783a7e48b4140814dc342 auto-sig:cs9/x86_64/qemu-upgrade-demo Version: 9.1 (2022-04-05T10:28:59Z) Commit: b4eeec5715eb8b18fae89e95e2ac295279e23b84675bb38281c03bc52543db9e # cat /boot/grub2/grubenv # GRUB Environment Block boot_success=1 ...