From nobody Mon Jul 17 17:08:27 2023
X-Original-To: freebsd-virtualization@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4R4T7K72fbz4nD7V
	for <freebsd-virtualization@mlmmj.nyi.freebsd.org>; Mon, 17 Jul 2023 17:08:29 +0000 (UTC)
	(envelope-from rob.fx907@gmail.com)
Received: from mail-oo1-xc2c.google.com (mail-oo1-xc2c.google.com [IPv6:2607:f8b0:4864:20::c2c])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4R4T7K5C9Pz4QLF;
	Mon, 17 Jul 2023 17:08:29 +0000 (UTC)
	(envelope-from rob.fx907@gmail.com)
Authentication-Results: mx1.freebsd.org;
	none
Received: by mail-oo1-xc2c.google.com with SMTP id 006d021491bc7-5671db01ee0so377154eaf.1;
        Mon, 17 Jul 2023 10:08:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1689613709; x=1690218509;
        h=cc:to:subject:message-id:date:from:references:in-reply-to
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=txDw3xnLH98/eBGl4m9nbROBGF0zxcs4iz9aVGqtzX4=;
        b=RXIGEyqT1smCGyIWIcUMH32nGpi4QpcBUzZY5szYnSAuP59hoYXrVdbQZe6+WnWORy
         /lTgLC0fCPE/Rp8zc+WreymmeLk0Dx05pb5U7N3kIQ8bj9NkBfGoBAfNnH8C4gy+4+bm
         V4FE4d02Xh3JPbjpDdXB1cwx4/E8LfijpC7QenIxb613gzkzzVCLVOS/RAVoGCEaR5zA
         a5dR5hOX3gTgnELUPJUO1xJ550GTXHgNpTx6MKdkr5BUcDidSCjk3SKkVrHOHK/8YcK5
         Av6YCuOEEe+ivFS0b8TvtkEuDE+SU+9Rc+tpCAM8ImPZzcpr6O4baOKmCX7TnJFUaTuO
         41ig==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689613709; x=1690218509;
        h=cc:to:subject:message-id:date:from:references:in-reply-to
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=txDw3xnLH98/eBGl4m9nbROBGF0zxcs4iz9aVGqtzX4=;
        b=aby7OnhcNR2UkZvVvIscIAXpn2nCSEFaNvXGpjqSmTTd0TupeTIaSIdrPGphLuC7Y1
         UNd547n44sf1H6/NJPcIBhhF6mxZadGQ5ZuB5JyjPazNrx4J+rbS2XJGkzapz2JZC2fn
         n5vKFncYc/TZKovM9878IuYcXBg/Hz73dujc8ffv6+4QJxRUHlKRz7L1vAY7C+AUwQnu
         M1JVnhAWdLc3WrygfGI4cJrBY+CPbUv+ZX6OrgK89W3UxKZFdBnt0GKSjRCLGVzHaVVe
         Sb+Czsq8oj/d5ycZpHPvoefn+lxrckRN3H80ZFFQh3DvmGVrqxR7ZTL2FCXximVHzqXj
         RJCA==
X-Gm-Message-State: ABy/qLaIWz2inuhiX6dukOmdbv5Nu97UH2eHDwJa4yXOV1POynUgixkN
	GkSCfJ7O13zOm2XQDZgJqVSRnswLSeM2IFgHRwo=
X-Google-Smtp-Source: APBJJlGM/m8UMjiNSWjMrjgc0rDIPMgh7/nxwVb39Lb3e8SsA/5KAf1aERnuaa/Pp0+jBH2abw6S52HX8DUy+IejhSE=
X-Received: by 2002:a05:6359:b96:b0:133:b86:cbe8 with SMTP id
 gf22-20020a0563590b9600b001330b86cbe8mr1547095rwb.1.1689613708546; Mon, 17
 Jul 2023 10:08:28 -0700 (PDT)
List-Id: Discussion <freebsd-virtualization.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-virtualization
List-Help: <mailto:virtualization+help@freebsd.org>
List-Post: <mailto:virtualization@freebsd.org>
List-Subscribe: <mailto:virtualization+subscribe@freebsd.org>
List-Unsubscribe: <mailto:virtualization+unsubscribe@freebsd.org>
Sender: owner-freebsd-virtualization@freebsd.org
X-BeenThere: freebsd-virtualization@freebsd.org
MIME-Version: 1.0
Received: by 2002:a05:7000:b1c1:b0:4f3:bac5:bc52 with HTTP; Mon, 17 Jul 2023
 10:08:27 -0700 (PDT)
In-Reply-To: <CAGOCPLi_oD5E1yQSVRGaqNAfVthmK3LWpMR+9GYzYj+CB4sdTA@mail.gmail.com>
References: <CAGOCPLhJrNrysBM1vc87vfkX5jZLCmnyfGf+cv2wmHFF1UhC-w@mail.gmail.com>
 <3d7ee1f6ff98fe9aede5a85702b906fc3014b6b6.camel@FreeBSD.org>
 <CAGOCPLg4ZeaRLK0VeRzifteXt3dJnSqZ=YT5BJ8EtH7+wMkTfA@mail.gmail.com>
 <b66fb737fca369239b3953892132f7e29906564f.camel@FreeBSD.org> <CAGOCPLi_oD5E1yQSVRGaqNAfVthmK3LWpMR+9GYzYj+CB4sdTA@mail.gmail.com>
From: Rob Wing <rob.fx907@gmail.com>
Date: Mon, 17 Jul 2023 09:08:27 -0800
Message-ID: <CAF3+n_e31oOBv_cN5dtAThN3R0+ZZ0TyD5fBL5PYHpZYnF=S+A@mail.gmail.com>
Subject: Re: Warm and Live Migration Implementation for bhyve
To: Elena Mihailescu <elenamihailescu22@gmail.com>
Cc: =?UTF-8?Q?Corvin_K=C3=B6hne?= <corvink@freebsd.org>, 
	"freebsd-virtualization@freebsd.org" <freebsd-virtualization@freebsd.org>, 
	Mihai Carabas <mihai.carabas@gmail.com>, Matthew Grooms <mgrooms@shrew.net>
Content-Type: multipart/alternative; boundary="000000000000e7d7220600b1d87d"
X-Rspamd-Queue-Id: 4R4T7K5C9Pz4QLF
X-Spamd-Bar: ----
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US];
	TAGGED_RCPT(0.00)[];
	TAGGED_FROM(0.00)[]
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-ThisMailContainsUnwantedMimeParts: N

--000000000000e7d7220600b1d87d
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I'm curious why the stream send bits are rolled into bhyve as opposed to
using netcat/ssh to do the network transfer?

sort of how one would do a zfs send/recv between hosts

On Monday, July 17, 2023, Elena Mihailescu <elenamihailescu22@gmail.com>
wrote:

> Hi Corvin,
>
> On Mon, 3 Jul 2023 at 09:35, Corvin K=C3=B6hne <corvink@freebsd.org> wrot=
e:
> >
> > On Tue, 2023-06-27 at 16:35 +0300, Elena Mihailescu wrote:
> > > Hi Corvin,
> > >
> > > Thank you for the questions! I'll respond to them inline.
> > >
> > > On Mon, 26 Jun 2023 at 10:16, Corvin K=C3=B6hne <corvink@freebsd.org>
> > > wrote:
> > > >
> > > > Hi Elena,
> > > >
> > > > thanks for posting this proposal here.
> > > >
> > > > Some open questions from my side:
> > > >
> > > > 1. How is the data send to the target? Does the host send a
> > > > complete
> > > > dump and the target parses it? Or does the target request data one
> > > > by
> > > > one und the host sends it as response?
> > > >
> > > It's not a dump of the guest's state, it's transmitted in steps.
> > > However, some parts may be migrated as a chunk (e.g., the emulated
> > > devices' state is transmitted as the buffer generated from the
> > > snapshot functions).
> > >
> >
> > How does the receiver know which chunk relates to which device? It
> > would be nice if you can start bhyve on the receiver side without
> > parameters e.g. `bhyve --receive=3D127.0.0.1:1234`. Therefore, the
> > protocol has to carry some information about the device configuration.
> >
>
> Regarding your first question, we send a chunk of data (a buffer) with
> the state: we resume the data in the same order we saved it. It relies
> on save/restore. We currently do not support migrating between
> different versions of suspend&resume/migration.
>
> It would be nice to have something like `bhyve
> --receive=3D127.0.0.1:1234`, but I don't think it is possible at this
> point mainly because of the following two reasons:
> - the guest image must be shared (e.g., via NFS) between the source
> and destination hosts. If the mounting points differ between the two,
> opening the disk at the destination will fail (also, we must suppose
> that the user used an absolute path since a relative one won't work)
> - if the VM uses a network adapter, we must specify the tap interface
> on the destination host (e.g., if on the source host the VM uses
> `tap0`, on the destination host, `tap0` may not exist or may be used
> by other VMs).
>
>
> >
> > > I'll try to describe a bit the protocol we have implemented for
> > > migration, maybe it can partially respond to the second and third
> > > questions.
> > >
> > > The destination host waits for the source host to connect (through a
> > > socket).
> > > After that, the source sends its system specifications (hw_machine,
> > > hw_model, hw_pagesize). If the source and destination hosts have
> > > identical hardware configurations, the migration can take place.
> > >
> > > Then, if we have live migration, we migrate the memory in rounds
> > > (i.e., we get a list of the pages that have the dirty bit set, send
> > > it
> > > to the destination to know what pages will be received, then send the
> > > pages through the socket; this process is repeated until the last
> > > round).
> > >
> > > Next, we stop the guest's vcpus, send the remaining memory (for live
> > > migration) or the guest's memory from vmctx->baseaddr for warm
> > > migration. Then, based on the suspend/resume feature, we get the
> > > state
> > > of the virtualized devices (the ones from the kernel space) and send
> > > this buffer to the destination. We repeat this for the emulated
> > > devices as well (the ones from the userspace).
> > >
> > > On the receiver host, we get the memory pages and set them to their
> > > according position in the guest's memory, use the restore functions
> > > for the state of the devices and start the guest's execution.
> > >
> > > Excluding the guest's memory transfer, the rest is based on the
> > > suspend/resume feature. We snapshot the guest's state, but instead of
> > > saving the data locally, we send it via network to the destination.
> > > On
> > > the destination host, we start a new virtual machine, but instead of
> > > reading/getting the state from the disk (i.e., the snapshot files) we
> > > get this state via the network from the source host.
> > >
> > > If the destination can properly resume the guest activity, it will
> > > send an "OK" to the source host so it can destroy/remove the guest
> > > from its end.
> > >
> > > Both warm and live migration are based on "cold migration". Cold
> > > migration means we suspend the guest on the source host, and restore
> > > the guest on the destination host from the snapshot files. Warm
> > > migration only does this using a socket, while live migration changes
> > > the way the memory is migrated.
> > >
> > > > 2. What happens if we add a new data section?
> > > >
> > > What are you referring to with a new data section? Is this question
> > > related to the third one? If so, see my answer below.
> > >
> > > > 3. What happens if the bhyve version differs on host and target
> > > > machine?
> > >
> > > The two hosts must be identical for migration, that's why we have the
> > > part where we check the specifications between the two migration
> > > hosts. They are expected to have the same version of bhyve and
> > > FreeBSD. We will add an additional check in the check specs part to
> > > see if we have the same FreeBSD build.
> > >
> > > As long as the changes in the virtual memory subsystem won't affect
> > > bhyve (and how the virtual machine sees/uses the memory), the
> > > migration constraints should only be related to suspend/resume. The
> > > state of the virtual devices is handled by the snapshot system, so if
> > > it is able to accommodate changes in the data structures, the
> > > migration process will not be affected.
> > >
> > > Thank you,
> > > Elena
> > >
> > > >
> > > >
> > > > --
> > > > Kind regards,
> > > > Corvin
> > > >
> > > > On Fri, 2023-06-23 at 13:00 +0300, Elena Mihailescu wrote:
> > > > > Hello,
> > > > >
> > > > > This mail presents the migration feature we have implemented for
> > > > > bhyve. Any feedback from the community is much appreciated.
> > > > >
> > > > > We have opened a stack of reviews on Phabricator
> > > > > (https://reviews.freebsd.org/D34717) that is meant to split the
> > > > > code
> > > > > in smaller parts so it can be more easily reviewed. A brief
> > > > > history
> > > > > of
> > > > > the implementation can be found at the bottom of this email.
> > > > >
> > > > > The migration mechanism we propose needs two main components in
> > > > > order
> > > > > to move a virtual machine from one host to another:
> > > > > 1. the guest's state (vCPUs, emulated and virtualized devices)
> > > > > 2. the guest's memory
> > > > >
> > > > > For the first part, we rely on the suspend/resume feature. We
> > > > > call
> > > > > the
> > > > > same functions as the ones used by suspend/resume, but instead of
> > > > > saving the data in files, we send it via the network.
> > > > >
> > > > > The most time consuming aspect of migration is transmitting guest
> > > > > memory. The UPB team has implemented two options to accomplish
> > > > > this:
> > > > > 1. Warm Migration: The guest execution is suspended on the source
> > > > > host
> > > > > while the memory is sent to the destination host. This method is
> > > > > less
> > > > > complex but may cause extended downtime.
> > > > > 2. Live Migration: The guest continues to execute on the source
> > > > > host
> > > > > while the memory is transmitted to the destination host. This
> > > > > method
> > > > > is more complex but offers reduced downtime.
> > > > >
> > > > > The proposed live migration procedure (pre-copy live migration)
> > > > > migrates the memory in rounds:
> > > > > 1. In the initial round, we migrate all the guest memory (all
> > > > > pages
> > > > > that are allocated)
> > > > > 2. In the subsequent rounds, we migrate only the pages that were
> > > > > modified since the previous round started
> > > > > 3. In the final round, we suspend the guest, migrate the
> > > > > remaining
> > > > > pages that were modified from the previous round and the guest's
> > > > > internal state (vCPU, emulated and virtualized devices).
> > > > >
> > > > > To detect the pages that were modified between rounds, we propose
> > > > > an
> > > > > additional dirty bit (virtualization dirty bit) for each memory
> > > > > page.
> > > > > This bit would be set every time the page's dirty bit is set.
> > > > > However,
> > > > > this virtualization dirty bit is reset only when the page is
> > > > > migrated.
> > > > >
> > > > > The proposed implementation is split in two parts:
> > > > > 1. The first one, the warm migration, is just a wrapper on the
> > > > > suspend/resume feature which, instead of saving the suspended
> > > > > state
> > > > > on
> > > > > disk, sends it via the network to the destination
> > > > > 2. The second part, the live migration, uses the layer previously
> > > > > presented, but sends the guest's memory in rounds, as described
> > > > > above.
> > > > >
> > > > > The migration process works as follows:
> > > > > 1. we identify:
> > > > >  - VM_NAME - the name of the virtual machine which will be
> > > > > migrated
> > > > >  - SRC_IP - the IP address of the source host
> > > > >  - DST_IP - the IP address of the destination host (default is
> > > > > 24983)
> > > > >  - DST_PORT - the port we want to use for migration
> > > > > 2. we start a virtual machine on the destination host that will
> > > > > wait
> > > > > for a migration. Here, we must specify SRC_IP (and the port we
> > > > > want
> > > > > to
> > > > > open for migration, default is 24983).
> > > > > e.g.: bhyve ... -R SRC_IP:24983 guest_vm_dst
> > > > > 3. using bhyvectl on the source host, we start the migration
> > > > > process.
> > > > > e.g.: bhyvectl --migrate=3DDST_IP:24983 --vm=3Dguest_vm
> > > > >
> > > > > A full tutorial on this can be found here:
> > > > > https://github.com/FreeBSD-UPB/freebsd-src/wiki/Virtual-
> Machine-Migration-using-bhyve
> > > > >
> > > > > For sending the migration request to a virtual machine, we use
> > > > > the
> > > > > same thread/socket that is used for suspend.
> > > > > For receiving a migration request, we used a similar approach to
> > > > > the
> > > > > resume process.
> > > > >
> > > > > As some of you may remember seeing similar emails from our part
> > > > > on
> > > > > the
> > > > > freebsd-virtualization list, I'll present a brief history of this
> > > > > project:
> > > > > The first part of the project was the suspend/resume
> > > > > implementation
> > > > > which landed in bhyve in 2020, under the BHYVE_SNAPSHOT guard
> > > > > (https://reviews.freebsd.org/D19495).
> > > > > After that, we focused on two tracks:
> > > > > 1. adding various suspend/resume features (multiple device
> > > > > support -
> > > > > https://reviews.freebsd.org/D26387, CAPSICUM support -
> > > > > https://reviews.freebsd.org/D30471, having an uniform file format
> > > > > -
> > > > > at
> > > > > that time, during the bhyve bi-weekly calls, we concluded that
> > > > > the
> > > > > JSON format was the most suitable at that time -
> > > > > https://reviews.freebsd.org/D29262) so we can remove the #ifdef
> > > > > BHYVE_SNAPSHOT guard.
> > > > > 2. implementing the migration feature for bhyve. Since this one
> > > > > relies
> > > > > on the save/restore, but does not modify its behaviour, we
> > > > > considered
> > > > > we can go in parallel with both tracks.
> > > > > We had various presentations in the FreeBSD Community on these
> > > > > topics:
> > > > > AsiaBSDCon2018, AsiaBSDCon2019, BSDCan2019, BSDCan2020,
> > > > > AsiaBSDCon2023.
> > > > >
> > > > > The first patches for warm and live migration were opened in
> > > > > 2021:
> > > > > https://reviews.freebsd.org/D28270,
> > > > > https://reviews.freebsd.org/D30954. However, the general feedback
> > > > > on
> > > > > these was that the patches are too big to be reviewed, so we
> > > > > should
> > > > > split them in smaller chunks (this was also true for some of the
> > > > > suspend/resume improvements). Thus, we split them into smaller
> > > > > parts.
> > > > > Also, as things changed in bhyve (i.e., capsicum support for
> > > > > suspend/resume was added this year), we rebased and updated our
> > > > > reviews.
> > > > >
> > > > > Thank you,
> > > > > Elena
> > > > >
> > > >
> >
> > --
> > Kind regards,
> > Corvin
>
> Thanks,
> Elena
>
>

--000000000000e7d7220600b1d87d
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I&#39;m curious why the stream send bits are rolled into bhyve as opposed t=
o using netcat/ssh to do the network transfer?<div><br></div><div>sort of h=
ow one would do a zfs send/recv between hosts<br><br>On Monday, July 17, 20=
23, Elena Mihailescu &lt;<a href=3D"mailto:elenamihailescu22@gmail.com">ele=
namihailescu22@gmail.com</a>&gt; wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">H=
i Corvin,<br>
<br>
On Mon, 3 Jul 2023 at 09:35, Corvin K=C3=B6hne &lt;<a href=3D"mailto:corvin=
k@freebsd.org">corvink@freebsd.org</a>&gt; wrote:<br>
&gt;<br>
&gt; On Tue, 2023-06-27 at 16:35 +0300, Elena Mihailescu wrote:<br>
&gt; &gt; Hi Corvin,<br>
&gt; &gt;<br>
&gt; &gt; Thank you for the questions! I&#39;ll respond to them inline.<br>
&gt; &gt;<br>
&gt; &gt; On Mon, 26 Jun 2023 at 10:16, Corvin K=C3=B6hne &lt;<a href=3D"ma=
ilto:corvink@freebsd.org">corvink@freebsd.org</a>&gt;<br>
&gt; &gt; wrote:<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Hi Elena,<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; thanks for posting this proposal here.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Some open questions from my side:<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; 1. How is the data send to the target? Does the host send a<=
br>
&gt; &gt; &gt; complete<br>
&gt; &gt; &gt; dump and the target parses it? Or does the target request da=
ta one<br>
&gt; &gt; &gt; by<br>
&gt; &gt; &gt; one und the host sends it as response?<br>
&gt; &gt; &gt;<br>
&gt; &gt; It&#39;s not a dump of the guest&#39;s state, it&#39;s transmitte=
d in steps.<br>
&gt; &gt; However, some parts may be migrated as a chunk (e.g., the emulate=
d<br>
&gt; &gt; devices&#39; state is transmitted as the buffer generated from th=
e<br>
&gt; &gt; snapshot functions).<br>
&gt; &gt;<br>
&gt;<br>
&gt; How does the receiver know which chunk relates to which device? It<br>
&gt; would be nice if you can start bhyve on the receiver side without<br>
&gt; parameters e.g. `bhyve --receive=3D127.0.0.1:1234`. Therefore, the<br>
&gt; protocol has to carry some information about the device configuration.=
<br>
&gt;<br>
<br>
Regarding your first question, we send a chunk of data (a buffer) with<br>
the state: we resume the data in the same order we saved it. It relies<br>
on save/restore. We currently do not support migrating between<br>
different versions of suspend&amp;resume/migration.<br>
<br>
It would be nice to have something like `bhyve<br>
--receive=3D127.0.0.1:1234`, but I don&#39;t think it is possible at this<b=
r>
point mainly because of the following two reasons:<br>
- the guest image must be shared (e.g., via NFS) between the source<br>
and destination hosts. If the mounting points differ between the two,<br>
opening the disk at the destination will fail (also, we must suppose<br>
that the user used an absolute path since a relative one won&#39;t work)<br=
>
- if the VM uses a network adapter, we must specify the tap interface<br>
on the destination host (e.g., if on the source host the VM uses<br>
`tap0`, on the destination host, `tap0` may not exist or may be used<br>
by other VMs).<br>
<br>
<br>
&gt;<br>
&gt; &gt; I&#39;ll try to describe a bit the protocol we have implemented f=
or<br>
&gt; &gt; migration, maybe it can partially respond to the second and third=
<br>
&gt; &gt; questions.<br>
&gt; &gt;<br>
&gt; &gt; The destination host waits for the source host to connect (throug=
h a<br>
&gt; &gt; socket).<br>
&gt; &gt; After that, the source sends its system specifications (hw_machin=
e,<br>
&gt; &gt; hw_model, hw_pagesize). If the source and destination hosts have<=
br>
&gt; &gt; identical hardware configurations, the migration can take place.<=
br>
&gt; &gt;<br>
&gt; &gt; Then, if we have live migration, we migrate the memory in rounds<=
br>
&gt; &gt; (i.e., we get a list of the pages that have the dirty bit set, se=
nd<br>
&gt; &gt; it<br>
&gt; &gt; to the destination to know what pages will be received, then send=
 the<br>
&gt; &gt; pages through the socket; this process is repeated until the last=
<br>
&gt; &gt; round).<br>
&gt; &gt;<br>
&gt; &gt; Next, we stop the guest&#39;s vcpus, send the remaining memory (f=
or live<br>
&gt; &gt; migration) or the guest&#39;s memory from vmctx-&gt;baseaddr for =
warm<br>
&gt; &gt; migration. Then, based on the suspend/resume feature, we get the<=
br>
&gt; &gt; state<br>
&gt; &gt; of the virtualized devices (the ones from the kernel space) and s=
end<br>
&gt; &gt; this buffer to the destination. We repeat this for the emulated<b=
r>
&gt; &gt; devices as well (the ones from the userspace).<br>
&gt; &gt;<br>
&gt; &gt; On the receiver host, we get the memory pages and set them to the=
ir<br>
&gt; &gt; according position in the guest&#39;s memory, use the restore fun=
ctions<br>
&gt; &gt; for the state of the devices and start the guest&#39;s execution.=
<br>
&gt; &gt;<br>
&gt; &gt; Excluding the guest&#39;s memory transfer, the rest is based on t=
he<br>
&gt; &gt; suspend/resume feature. We snapshot the guest&#39;s state, but in=
stead of<br>
&gt; &gt; saving the data locally, we send it via network to the destinatio=
n.<br>
&gt; &gt; On<br>
&gt; &gt; the destination host, we start a new virtual machine, but instead=
 of<br>
&gt; &gt; reading/getting the state from the disk (i.e., the snapshot files=
) we<br>
&gt; &gt; get this state via the network from the source host.<br>
&gt; &gt;<br>
&gt; &gt; If the destination can properly resume the guest activity, it wil=
l<br>
&gt; &gt; send an &quot;OK&quot; to the source host so it can destroy/remov=
e the guest<br>
&gt; &gt; from its end.<br>
&gt; &gt;<br>
&gt; &gt; Both warm and live migration are based on &quot;cold migration&qu=
ot;. Cold<br>
&gt; &gt; migration means we suspend the guest on the source host, and rest=
ore<br>
&gt; &gt; the guest on the destination host from the snapshot files. Warm<b=
r>
&gt; &gt; migration only does this using a socket, while live migration cha=
nges<br>
&gt; &gt; the way the memory is migrated.<br>
&gt; &gt;<br>
&gt; &gt; &gt; 2. What happens if we add a new data section?<br>
&gt; &gt; &gt;<br>
&gt; &gt; What are you referring to with a new data section? Is this questi=
on<br>
&gt; &gt; related to the third one? If so, see my answer below.<br>
&gt; &gt;<br>
&gt; &gt; &gt; 3. What happens if the bhyve version differs on host and tar=
get<br>
&gt; &gt; &gt; machine?<br>
&gt; &gt;<br>
&gt; &gt; The two hosts must be identical for migration, that&#39;s why we =
have the<br>
&gt; &gt; part where we check the specifications between the two migration<=
br>
&gt; &gt; hosts. They are expected to have the same version of bhyve and<br=
>
&gt; &gt; FreeBSD. We will add an additional check in the check specs part =
to<br>
&gt; &gt; see if we have the same FreeBSD build.<br>
&gt; &gt;<br>
&gt; &gt; As long as the changes in the virtual memory subsystem won&#39;t =
affect<br>
&gt; &gt; bhyve (and how the virtual machine sees/uses the memory), the<br>
&gt; &gt; migration constraints should only be related to suspend/resume. T=
he<br>
&gt; &gt; state of the virtual devices is handled by the snapshot system, s=
o if<br>
&gt; &gt; it is able to accommodate changes in the data structures, the<br>
&gt; &gt; migration process will not be affected.<br>
&gt; &gt;<br>
&gt; &gt; Thank you,<br>
&gt; &gt; Elena<br>
&gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; --<br>
&gt; &gt; &gt; Kind regards,<br>
&gt; &gt; &gt; Corvin<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; On Fri, 2023-06-23 at 13:00 +0300, Elena Mihailescu wrote:<b=
r>
&gt; &gt; &gt; &gt; Hello,<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; This mail presents the migration feature we have implem=
ented for<br>
&gt; &gt; &gt; &gt; bhyve. Any feedback from the community is much apprecia=
ted.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; We have opened a stack of reviews on Phabricator<br>
&gt; &gt; &gt; &gt; (<a href=3D"https://reviews.freebsd.org/D34717" target=
=3D"_blank">https://reviews.freebsd.org/<wbr>D34717</a>) that is meant to s=
plit the<br>
&gt; &gt; &gt; &gt; code<br>
&gt; &gt; &gt; &gt; in smaller parts so it can be more easily reviewed. A b=
rief<br>
&gt; &gt; &gt; &gt; history<br>
&gt; &gt; &gt; &gt; of<br>
&gt; &gt; &gt; &gt; the implementation can be found at the bottom of this e=
mail.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; The migration mechanism we propose needs two main compo=
nents in<br>
&gt; &gt; &gt; &gt; order<br>
&gt; &gt; &gt; &gt; to move a virtual machine from one host to another:<br>
&gt; &gt; &gt; &gt; 1. the guest&#39;s state (vCPUs, emulated and virtualiz=
ed devices)<br>
&gt; &gt; &gt; &gt; 2. the guest&#39;s memory<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; For the first part, we rely on the suspend/resume featu=
re. We<br>
&gt; &gt; &gt; &gt; call<br>
&gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; same functions as the ones used by suspend/resume, but =
instead of<br>
&gt; &gt; &gt; &gt; saving the data in files, we send it via the network.<b=
r>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; The most time consuming aspect of migration is transmit=
ting guest<br>
&gt; &gt; &gt; &gt; memory. The UPB team has implemented two options to acc=
omplish<br>
&gt; &gt; &gt; &gt; this:<br>
&gt; &gt; &gt; &gt; 1. Warm Migration: The guest execution is suspended on =
the source<br>
&gt; &gt; &gt; &gt; host<br>
&gt; &gt; &gt; &gt; while the memory is sent to the destination host. This =
method is<br>
&gt; &gt; &gt; &gt; less<br>
&gt; &gt; &gt; &gt; complex but may cause extended downtime.<br>
&gt; &gt; &gt; &gt; 2. Live Migration: The guest continues to execute on th=
e source<br>
&gt; &gt; &gt; &gt; host<br>
&gt; &gt; &gt; &gt; while the memory is transmitted to the destination host=
. This<br>
&gt; &gt; &gt; &gt; method<br>
&gt; &gt; &gt; &gt; is more complex but offers reduced downtime.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; The proposed live migration procedure (pre-copy live mi=
gration)<br>
&gt; &gt; &gt; &gt; migrates the memory in rounds:<br>
&gt; &gt; &gt; &gt; 1. In the initial round, we migrate all the guest memor=
y (all<br>
&gt; &gt; &gt; &gt; pages<br>
&gt; &gt; &gt; &gt; that are allocated)<br>
&gt; &gt; &gt; &gt; 2. In the subsequent rounds, we migrate only the pages =
that were<br>
&gt; &gt; &gt; &gt; modified since the previous round started<br>
&gt; &gt; &gt; &gt; 3. In the final round, we suspend the guest, migrate th=
e<br>
&gt; &gt; &gt; &gt; remaining<br>
&gt; &gt; &gt; &gt; pages that were modified from the previous round and th=
e guest&#39;s<br>
&gt; &gt; &gt; &gt; internal state (vCPU, emulated and virtualized devices)=
.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; To detect the pages that were modified between rounds, =
we propose<br>
&gt; &gt; &gt; &gt; an<br>
&gt; &gt; &gt; &gt; additional dirty bit (virtualization dirty bit) for eac=
h memory<br>
&gt; &gt; &gt; &gt; page.<br>
&gt; &gt; &gt; &gt; This bit would be set every time the page&#39;s dirty b=
it is set.<br>
&gt; &gt; &gt; &gt; However,<br>
&gt; &gt; &gt; &gt; this virtualization dirty bit is reset only when the pa=
ge is<br>
&gt; &gt; &gt; &gt; migrated.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; The proposed implementation is split in two parts:<br>
&gt; &gt; &gt; &gt; 1. The first one, the warm migration, is just a wrapper=
 on the<br>
&gt; &gt; &gt; &gt; suspend/resume feature which, instead of saving the sus=
pended<br>
&gt; &gt; &gt; &gt; state<br>
&gt; &gt; &gt; &gt; on<br>
&gt; &gt; &gt; &gt; disk, sends it via the network to the destination<br>
&gt; &gt; &gt; &gt; 2. The second part, the live migration, uses the layer =
previously<br>
&gt; &gt; &gt; &gt; presented, but sends the guest&#39;s memory in rounds, =
as described<br>
&gt; &gt; &gt; &gt; above.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; The migration process works as follows:<br>
&gt; &gt; &gt; &gt; 1. we identify:<br>
&gt; &gt; &gt; &gt;=C2=A0 - VM_NAME - the name of the virtual machine which=
 will be<br>
&gt; &gt; &gt; &gt; migrated<br>
&gt; &gt; &gt; &gt;=C2=A0 - SRC_IP - the IP address of the source host<br>
&gt; &gt; &gt; &gt;=C2=A0 - DST_IP - the IP address of the destination host=
 (default is<br>
&gt; &gt; &gt; &gt; 24983)<br>
&gt; &gt; &gt; &gt;=C2=A0 - DST_PORT - the port we want to use for migratio=
n<br>
&gt; &gt; &gt; &gt; 2. we start a virtual machine on the destination host t=
hat will<br>
&gt; &gt; &gt; &gt; wait<br>
&gt; &gt; &gt; &gt; for a migration. Here, we must specify SRC_IP (and the =
port we<br>
&gt; &gt; &gt; &gt; want<br>
&gt; &gt; &gt; &gt; to<br>
&gt; &gt; &gt; &gt; open for migration, default is 24983).<br>
&gt; &gt; &gt; &gt; e.g.: bhyve ... -R SRC_IP:24983 guest_vm_dst<br>
&gt; &gt; &gt; &gt; 3. using bhyvectl on the source host, we start the migr=
ation<br>
&gt; &gt; &gt; &gt; process.<br>
&gt; &gt; &gt; &gt; e.g.: bhyvectl --migrate=3DDST_IP:24983 --vm=3Dguest_vm=
<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; A full tutorial on this can be found here:<br>
&gt; &gt; &gt; &gt; <a href=3D"https://github.com/FreeBSD-UPB/freebsd-src/w=
iki/Virtual-Machine-Migration-using-bhyve" target=3D"_blank">https://github=
.com/FreeBSD-<wbr>UPB/freebsd-src/wiki/Virtual-<wbr>Machine-Migration-using=
-bhyve</a><br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; For sending the migration request to a virtual machine,=
 we use<br>
&gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; same thread/socket that is used for suspend.<br>
&gt; &gt; &gt; &gt; For receiving a migration request, we used a similar ap=
proach to<br>
&gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; resume process.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; As some of you may remember seeing similar emails from =
our part<br>
&gt; &gt; &gt; &gt; on<br>
&gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; freebsd-virtualization list, I&#39;ll present a brief h=
istory of this<br>
&gt; &gt; &gt; &gt; project:<br>
&gt; &gt; &gt; &gt; The first part of the project was the suspend/resume<br=
>
&gt; &gt; &gt; &gt; implementation<br>
&gt; &gt; &gt; &gt; which landed in bhyve in 2020, under the BHYVE_SNAPSHOT=
 guard<br>
&gt; &gt; &gt; &gt; (<a href=3D"https://reviews.freebsd.org/D19495" target=
=3D"_blank">https://reviews.freebsd.org/<wbr>D19495</a>).<br>
&gt; &gt; &gt; &gt; After that, we focused on two tracks:<br>
&gt; &gt; &gt; &gt; 1. adding various suspend/resume features (multiple dev=
ice<br>
&gt; &gt; &gt; &gt; support -<br>
&gt; &gt; &gt; &gt; <a href=3D"https://reviews.freebsd.org/D26387" target=
=3D"_blank">https://reviews.freebsd.org/<wbr>D26387</a>, CAPSICUM support -=
<br>
&gt; &gt; &gt; &gt; <a href=3D"https://reviews.freebsd.org/D30471" target=
=3D"_blank">https://reviews.freebsd.org/<wbr>D30471</a>, having an uniform =
file format<br>
&gt; &gt; &gt; &gt; -<br>
&gt; &gt; &gt; &gt; at<br>
&gt; &gt; &gt; &gt; that time, during the bhyve bi-weekly calls, we conclud=
ed that<br>
&gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; JSON format was the most suitable at that time -<br>
&gt; &gt; &gt; &gt; <a href=3D"https://reviews.freebsd.org/D29262" target=
=3D"_blank">https://reviews.freebsd.org/<wbr>D29262</a>) so we can remove t=
he #ifdef<br>
&gt; &gt; &gt; &gt; BHYVE_SNAPSHOT guard.<br>
&gt; &gt; &gt; &gt; 2. implementing the migration feature for bhyve. Since =
this one<br>
&gt; &gt; &gt; &gt; relies<br>
&gt; &gt; &gt; &gt; on the save/restore, but does not modify its behaviour,=
 we<br>
&gt; &gt; &gt; &gt; considered<br>
&gt; &gt; &gt; &gt; we can go in parallel with both tracks.<br>
&gt; &gt; &gt; &gt; We had various presentations in the FreeBSD Community o=
n these<br>
&gt; &gt; &gt; &gt; topics:<br>
&gt; &gt; &gt; &gt; AsiaBSDCon2018, AsiaBSDCon2019, BSDCan2019, BSDCan2020,=
<br>
&gt; &gt; &gt; &gt; AsiaBSDCon2023.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; The first patches for warm and live migration were open=
ed in<br>
&gt; &gt; &gt; &gt; 2021:<br>
&gt; &gt; &gt; &gt; <a href=3D"https://reviews.freebsd.org/D28270" target=
=3D"_blank">https://reviews.freebsd.org/<wbr>D28270</a>,<br>
&gt; &gt; &gt; &gt; <a href=3D"https://reviews.freebsd.org/D30954" target=
=3D"_blank">https://reviews.freebsd.org/<wbr>D30954</a>. However, the gener=
al feedback<br>
&gt; &gt; &gt; &gt; on<br>
&gt; &gt; &gt; &gt; these was that the patches are too big to be reviewed, =
so we<br>
&gt; &gt; &gt; &gt; should<br>
&gt; &gt; &gt; &gt; split them in smaller chunks (this was also true for so=
me of the<br>
&gt; &gt; &gt; &gt; suspend/resume improvements). Thus, we split them into =
smaller<br>
&gt; &gt; &gt; &gt; parts.<br>
&gt; &gt; &gt; &gt; Also, as things changed in bhyve (i.e., capsicum suppor=
t for<br>
&gt; &gt; &gt; &gt; suspend/resume was added this year), we rebased and upd=
ated our<br>
&gt; &gt; &gt; &gt; reviews.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; Thank you,<br>
&gt; &gt; &gt; &gt; Elena<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt;<br>
&gt; --<br>
&gt; Kind regards,<br>
&gt; Corvin<br>
<br>
Thanks,<br>
Elena<br>
<br>
</blockquote></div>

--000000000000e7d7220600b1d87d--