amazon/xen... any way at all to pass a message/signal/semaphoere/morse-code to the boot loader?

Wed Apr 19 12:41:50 UTC 2017

Sorry to jump in this a bit late, but I don't understand the statement "no way I found in EC2 to easily switch the partition to boot from". If you mean passing flags to the bootloader like you would a physical on-prem box, you might be right. If you mean booting from a different partition of the EBS volume, you're probably right. I'd urge you to think in terms of whole disks when dealing with EC2 and EBS.

Leif suggested a good solution, where you keep a snapshot of a recovery drive. That is very straightforward to automate. You could write a bit of python that takes an instance ID, stops it if it's running, moves the existing sda to sdb, tags the instance as "recovery mode", starts the instance, and posts a notification to an SNS channel. The SNS notification could generate an email to the right people (basically "this instance is now in recovery mode, you should go have a look"). That python code could run in lambda (i.e., serverless) and could be exposed via API gateway. Or you could invoke it manually. You could use a variety of access control/authorisation mechanisms to ensure that only the right people could invoke that API, and you could use AWS IAM to ensure that the python lambda function can manipulate only the right set of instances.

In that same python code you can kick off snapshots of the drive being recovered, too, so that you have a backup of it to fall back on if you screw up while attempting recovery. That python code can add the instance to a quarantine security group that, for example, only allows SSH from your management instances/network while the system is in recovery mode. Basically you'd record all the attributes about the instance just before you go into recovery. Then you go into recovery. And when you're happy with it, you could have some python that restores the security groups, detaches the recovery EBS volume, removes the recovery tag, restarts the instance from the recovered EBS volume, and optionally deletes the pre-recovery snapshot.

The upshot is that you could expose a web page or a REST API to your "customer" allowing them to kick off this "recovery mode" process. Basically they go to a web page and say "I screwed up my instance." All this stuff kicks off and you get an email (or maybe THEY can login to the recovery mode and try to fix it themselves). Frankly, you could make the whole thing self-service. You could control access to that API or web page in any way that suits you and your customers. 

I think that's the more native AWS way to handle the recovery mode scenario. Of course I may have misunderstood some nuance in the use case. I suspect the platform gives you a bunch of native tools that can help you, but they don't work at all the way passing flags to a boot loader would work.

Paco

On 16/04/2017, 14:43, "owner-freebsd-cloud at freebsd.org on behalf of Julian Elischer" <owner-freebsd-cloud at freebsd.org on behalf of julian at freebsd.org> wrote:

    On 13/4/17 2:07 am, Jeremiah Lott via freebsd-cloud wrote:
    > On Wed, Apr 12, 2017 at 1:30 AM, Leif Pedersen <bilbo at hobbiton.org> wrote:
    >
    >> I keep an extra EBS volume handy that has a simple recovery image. If I get
    >> stuck into a trouble, I change the normal boot disk to sdb, and attach my
    >> recovery volume as sda1. Essentially, the extra volume is my "recovery
    >> partition". To make it cheaper, keep only a snapshot of it.
    >>
    > I tried for a while to get some sort of bootloader-based recovery plan in
    > place for our cloud-based systems, like what was originally asked for. We
    > already have a primary and a backup partition in our boot disk, but there
    > was no way I found in EC2 to easily switch the partition to boot from. In
    > the end, I gave up on passing information to the bootloader and used
    > something like the above with multiple images. I actually wrote a script at
    > one point using the aws CLI that you could run from any FreeBSD VM in the
    > same availability zone. It detached the original boot volume from the
    > "broken" instance; attached it as a secondary disk to the recovery image,
    > changed the boot partition, detached it from the recovery image, then
    > re-attached it to the original image. It took a while to run, but required
    > little user input. We kind of kept that as "good enough" for the rare case
    > that a instance became un-bootable and we cared to recover it rather than
    > replace it. I'm not sure we actually ever used it on a customer system. It
    > was used more during development when you are more likely to break stuff
    > (and want to recover coredumps, etc. so you can fix the broken code).

    Thanks for your comments. It appears that you have the same issues  that we do.
    Andriy Gapon has been doing some stuff where nextboot information is saved onto the drive,
    and it knows how much it has beooted which may be good enough for us.  basically a 'drops to recovery mode after N failures' would be enough  for me.

    Is there any chance you can make your "recovery" system available?
    (especially if you can give source for the aws CLI stuff..  I think  having that as an example
    and starting point might be a good start to making something truely useful.
    It may even be worth adding it to the regular FreeBSD AMI so that any  FreeBSD  EC2 system could be used to recover other systems.

    In our system there is a single zpool with two ZFS datasets and we use 
    the "bootfs" parameter to select the new image, but it can be 
    overridden from the boot menu, except of course on AWS due to the lack 
    of console.

    >
    > If you go down the route of implementing EC2 network driver(s) in the
    > bootloader, then you could read the instance metadata via http and use a
    > tag to control the boot behavior. However, a bootloader driver, even a very
    > simplistic one, for xn0 (and potentially for both ixv and ena, if you
    > support EC2 Enhanced Networking) was more work that we wanted to undertake
    > for this.
    >
    >    Jeremiah Lott
    >    Avere Systems

Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.