[coreboot] Improving normal/fallback mecanism (Was: Bricked Lenovo T60)

Fri Jun 21 23:03:41 CEST 2013

On Fri, 21 Jun 2013 09:26:11 -0700
ron minnich <rminnich at gmail.com> wrote:

> the question about fallback is 'when do I tell the machine that the
> normal boot succeeded'? At LANL, we learned the best place: as LATE in
> the boot process as possible, long after LInux is up. You want to be
> sure, if you set 'booted ok', that it is LINUX that booted ok, not
> just coreboot. That's a key piece.
> 
> Because if coreboot sets 'booted ok', and then the node doesn't boot,
> that's not doing you much good, is it? We learned that the hard way.
> 
> So: only linux boot scripts get to set 'normal booted ok', and that
> should be the last thing you do
> and
> many things get to clear 'normal booted ok', including linux, the
> payload, and coreboot itself.
The issue is that the coreboot implementation doesn't currently work
like that.

An implementation of what you described would work like that:
At very early boot, coreboot would record the value of a
booted_ok nvram setting in a variable, and then reset that the
booted_ok nvram parameter to false.
It would then boot on the boot_option nvram setting(like Normal for
instance) if the recorded value is true;
Then the OS would boot and the last boot scipt to run would do
something like nvramtool -w booted_ok=true

In the case of when something goes wrong, coreboot would record the
value of booted_ok and find that it booted ok, then it would set that
value to false, since something goes wrong it wouldn't complete the
boot...
Then the user would shut down the computer and power it on again,
coreboot would then look at the booted_ok value and finds that there
was a problem last time the user booted the computer, and because of
that would run the images that have the fallback prefix in cbfs.

Here would be the advantages and disadvantages of that approach:
----------------------------------------------------------------
-> It would require cooperation from the OS/Distribution, so for
instance the user would be expected to put a systemd unit
in /etc/systemd/system, we would probalby also have to create a
sysvinit script for that. We could probably add support for clearing
the booted_ok flag in SeaBIOS for the cases where adding an init script
is not an option, and make a Kconfig option in coreboot to select that
SeaBIOS option. I guess that adding an nvram option for telling seabios
to clear that value is probably a bad idea because the change would
affect too many boards(see below).

-> The good thing is that it's way more reliable than the current
approach that seem to tell that it booted fine in ramstage, from IRC:
<kmalkki> if I remember correctly normal boot is marked good before
entering payload
The current approach doesn't guarantee that the user could boot into a
working system, the payload could fail or the OS could not boot because
of a wrong memory layout for instance.

-> The user would be able to test new changes really easily:
* For instance if it's a laptop he wouldn't even need to disassemble the
  laptop if the reflashing goes wrong.
* Even for the people used to reflash their laptop it would be a huge
  benefit:
  * Assuming that there is a working coreboot image already for
    the laptop, they wouldn't need to use an external reflashing tool,
    that means faster testing.
  * That also means that they could develop for that laptop in more
    situations(For instance in a train, in a plane etc... where it
    would be complicated to reflash the laptop externally), If someone
    ports CONFIG_ELOG and/or CONFIG_CHROMEOS_RAMOOPS to all
    laptops, the developer would also get the logs in the case where the
    system didn't boot.

Implementation:
---------------
So what would be the best approach for adding support for that?
-> Many boards have the possibility to use the nvram and some use it by
default...
Should a new cmos layout option be added? Or should we re-use
the last_boot option? would the rename of options or values be a
problem if the layout doesn't change? What about value types changes
(Fallback/Normal -> false/true) ?
Basically what happen if you have a board that has a CMOS layout and
that you flash an image with a new and different CMOS layout and reboot?
There are 2 cases: if the board has a cmos.default(very few boards have
that) or if the board lack that(it's the case for the majority of the
boards).

-> Should a new Kconfig option and fallback/normal mecanism be added,
since we have 2 implementations already that could be a good idea and
it would be safer that way.

Denis.