From nobody Mon Jul 03 07:34:57 2023 X-Original-To: freebsd-virtualization@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Qvd440QpTz4lWDV for ; Mon, 3 Jul 2023 07:35:00 +0000 (UTC) (envelope-from corvink@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Qvd440056z3rGd; Mon, 3 Jul 2023 07:35:00 +0000 (UTC) (envelope-from corvink@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1688369700; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=QO2bhDsrVvEcNRasp16CDg9mIJBPtL9sZI1cYfnzQ+I=; b=ITZ2OON05RAyt6tu8Ff/MSKDBMTHWniaScw9Wzaj6dDIq+2xczkVYyxjmcr5eqXJV/7kkL LHVjReEuFMJShKafdk8E6vCo33OGBfzZB8soo4s2W2aDxl9zOj20Er6EJfw5TdjmuokhUM JO3V+tcElc2cKX1ylPKb6gbluTup7p6s4+UylasU7FhlQblrkmHOUtnIO8NHhMbrcvdAdv qP6hlJOEp0yjFMO842AkUJDeeHqqhqLiJij8El8Cc2EMfTq+XrsniCIG69if639Nuyhi1p 5wFiA61Gadi+h4mp5pv81ZjOdpJqSLAPzLQc5Sn062ev70vw4uDWDiKcFHDPuQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1688369700; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=QO2bhDsrVvEcNRasp16CDg9mIJBPtL9sZI1cYfnzQ+I=; b=LqefJsqoxbS9craiVxr2E6PhqHC8AR+Lb3BiBRRPzk8ozHr4WO0NdtyNxZoCRiPctO9n5U 8JQ3L6c9Du1lc5DGgyyy2BaXGb2r4K4xIQAuF92OYjgi7u32VbyxaBko8LeX5psDcq98M0 YdHo6MO2wEsD06i+0wxyMoE1f11tU7Hai9d3s+7jVdFjYRG4cPDgNdereost+OkPqWJjev XB04sLiFQ7hkNWdagSu6/vMpNHjTT9qrobcXvSZfQudDEh490HjSKNTvqhucjkl1vxcLwL X27SzzxLe8xKdmsTjwzmzI46DNNR57HYqQ6gShVt3YkKPfafr7+XbPDlqMqzXQ== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1688369700; a=rsa-sha256; cv=none; b=GHutrTXEOiHicfjRyxcBm16lcyd1D5EGRnhRiVUZMJGosnGViSrqG7ly+HXdUqLFON4tQG QxXF4XAjwKRDWV+73fJF+R40GRC+wPKgf8PFOKZVf+/xMRfyz3ZvHx7Qt2NxnRK07/OQqZ Bj9poVZ+UORPAMgSJpAYJqf/6UD6FZnnChgDfQSW4fZid6BBL93kEryksBggezfjzw9lXW ZqckfupZCIsIv9NtKEj2K+cXOhcpW3NPBQM/HS2i7bZt/BzGqZ1zuuJCvz5vHrQBHBu0Az A3zzwb+ktS13S/y0ofyz0LuREL7FB8xfaw4GccIUL2loRSVJKx5O7aWQmOYMDQ== Received: from [172.21.179.48] (unknown [195.226.174.194]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) (Authenticated sender: corvink) by smtp.freebsd.org (Postfix) with ESMTPSA id 4Qvd431Dnzzx5D; Mon, 3 Jul 2023 07:34:59 +0000 (UTC) (envelope-from corvink@FreeBSD.org) Message-ID: Subject: Re: Warm and Live Migration Implementation for bhyve From: Corvin =?ISO-8859-1?Q?K=F6hne?= To: Elena Mihailescu Cc: freebsd-virtualization@freebsd.org, Mihai Carabas , Matthew Grooms Date: Mon, 03 Jul 2023 09:34:57 +0200 In-Reply-To: References: <3d7ee1f6ff98fe9aede5a85702b906fc3014b6b6.camel@FreeBSD.org> Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="=-FXVYXhKGuN8C2nzIOaXc" User-Agent: Evolution 3.48.3 List-Id: Discussion List-Archive: https://lists.freebsd.org/archives/freebsd-virtualization List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-virtualization@freebsd.org X-BeenThere: freebsd-virtualization@freebsd.org MIME-Version: 1.0 X-ThisMailContainsUnwantedMimeParts: N --=-FXVYXhKGuN8C2nzIOaXc Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, 2023-06-27 at 16:35 +0300, Elena Mihailescu wrote: > Hi Corvin, >=20 > Thank you for the questions! I'll respond to them inline. >=20 > On Mon, 26 Jun 2023 at 10:16, Corvin K=C3=B6hne > wrote: > >=20 > > Hi Elena, > >=20 > > thanks for posting this proposal here. > >=20 > > Some open questions from my side: > >=20 > > 1. How is the data send to the target? Does the host send a > > complete > > dump and the target parses it? Or does the target request data one > > by > > one und the host sends it as response? > >=20 > It's not a dump of the guest's state, it's transmitted in steps. > However, some parts may be migrated as a chunk (e.g., the emulated > devices' state is transmitted as the buffer generated from the > snapshot functions). >=20 How does the receiver know which chunk relates to which device? It would be nice if you can start bhyve on the receiver side without parameters e.g. `bhyve --receive=3D127.0.0.1:1234`. Therefore, the protocol has to carry some information about the device configuration. > I'll try to describe a bit the protocol we have implemented for > migration, maybe it can partially respond to the second and third > questions. >=20 > The destination host waits for the source host to connect (through a > socket). > After that, the source sends its system specifications (hw_machine, > hw_model, hw_pagesize). If the source and destination hosts have > identical hardware configurations, the migration can take place. >=20 > Then, if we have live migration, we migrate the memory in rounds > (i.e., we get a list of the pages that have the dirty bit set, send > it > to the destination to know what pages will be received, then send the > pages through the socket; this process is repeated until the last > round). >=20 > Next, we stop the guest's vcpus, send the remaining memory (for live > migration) or the guest's memory from vmctx->baseaddr for warm > migration. Then, based on the suspend/resume feature, we get the > state > of the virtualized devices (the ones from the kernel space) and send > this buffer to the destination. We repeat this for the emulated > devices as well (the ones from the userspace). >=20 > On the receiver host, we get the memory pages and set them to their > according position in the guest's memory, use the restore functions > for the state of the devices and start the guest's execution. >=20 > Excluding the guest's memory transfer, the rest is based on the > suspend/resume feature. We snapshot the guest's state, but instead of > saving the data locally, we send it via network to the destination. > On > the destination host, we start a new virtual machine, but instead of > reading/getting the state from the disk (i.e., the snapshot files) we > get this state via the network from the source host. >=20 > If the destination can properly resume the guest activity, it will > send an "OK" to the source host so it can destroy/remove the guest > from its end. >=20 > Both warm and live migration are based on "cold migration". Cold > migration means we suspend the guest on the source host, and restore > the guest on the destination host from the snapshot files. Warm > migration only does this using a socket, while live migration changes > the way the memory is migrated. >=20 > > 2. What happens if we add a new data section? > >=20 > What are you referring to with a new data section? Is this question > related to the third one? If so, see my answer below. >=20 > > 3. What happens if the bhyve version differs on host and target > > machine? >=20 > The two hosts must be identical for migration, that's why we have the > part where we check the specifications between the two migration > hosts. They are expected to have the same version of bhyve and > FreeBSD. We will add an additional check in the check specs part to > see if we have the same FreeBSD build. >=20 > As long as the changes in the virtual memory subsystem won't affect > bhyve (and how the virtual machine sees/uses the memory), the > migration constraints should only be related to suspend/resume. The > state of the virtual devices is handled by the snapshot system, so if > it is able to accommodate changes in the data structures, the > migration process will not be affected. >=20 > Thank you, > Elena >=20 > >=20 > >=20 > > -- > > Kind regards, > > Corvin > >=20 > > On Fri, 2023-06-23 at 13:00 +0300, Elena Mihailescu wrote: > > > Hello, > > >=20 > > > This mail presents the migration feature we have implemented for > > > bhyve. Any feedback from the community is much appreciated. > > >=20 > > > We have opened a stack of reviews on Phabricator > > > (https://reviews.freebsd.org/D34717) that is meant to split the > > > code > > > in smaller parts so it can be more easily reviewed. A brief > > > history > > > of > > > the implementation can be found at the bottom of this email. > > >=20 > > > The migration mechanism we propose needs two main components in > > > order > > > to move a virtual machine from one host to another: > > > 1. the guest's state (vCPUs, emulated and virtualized devices) > > > 2. the guest's memory > > >=20 > > > For the first part, we rely on the suspend/resume feature. We > > > call > > > the > > > same functions as the ones used by suspend/resume, but instead of > > > saving the data in files, we send it via the network. > > >=20 > > > The most time consuming aspect of migration is transmitting guest > > > memory. The UPB team has implemented two options to accomplish > > > this: > > > 1. Warm Migration: The guest execution is suspended on the source > > > host > > > while the memory is sent to the destination host. This method is > > > less > > > complex but may cause extended downtime. > > > 2. Live Migration: The guest continues to execute on the source > > > host > > > while the memory is transmitted to the destination host. This > > > method > > > is more complex but offers reduced downtime. > > >=20 > > > The proposed live migration procedure (pre-copy live migration) > > > migrates the memory in rounds: > > > 1. In the initial round, we migrate all the guest memory (all > > > pages > > > that are allocated) > > > 2. In the subsequent rounds, we migrate only the pages that were > > > modified since the previous round started > > > 3. In the final round, we suspend the guest, migrate the > > > remaining > > > pages that were modified from the previous round and the guest's > > > internal state (vCPU, emulated and virtualized devices). > > >=20 > > > To detect the pages that were modified between rounds, we propose > > > an > > > additional dirty bit (virtualization dirty bit) for each memory > > > page. > > > This bit would be set every time the page's dirty bit is set. > > > However, > > > this virtualization dirty bit is reset only when the page is > > > migrated. > > >=20 > > > The proposed implementation is split in two parts: > > > 1. The first one, the warm migration, is just a wrapper on the > > > suspend/resume feature which, instead of saving the suspended > > > state > > > on > > > disk, sends it via the network to the destination > > > 2. The second part, the live migration, uses the layer previously > > > presented, but sends the guest's memory in rounds, as described > > > above. > > >=20 > > > The migration process works as follows: > > > 1. we identify: > > > =C2=A0- VM_NAME - the name of the virtual machine which will be > > > migrated > > > =C2=A0- SRC_IP - the IP address of the source host > > > =C2=A0- DST_IP - the IP address of the destination host (default is > > > 24983) > > > =C2=A0- DST_PORT - the port we want to use for migration > > > 2. we start a virtual machine on the destination host that will > > > wait > > > for a migration. Here, we must specify SRC_IP (and the port we > > > want > > > to > > > open for migration, default is 24983). > > > e.g.: bhyve ... -R SRC_IP:24983 guest_vm_dst > > > 3. using bhyvectl on the source host, we start the migration > > > process. > > > e.g.: bhyvectl --migrate=3DDST_IP:24983 --vm=3Dguest_vm > > >=20 > > > A full tutorial on this can be found here: > > > https://github.com/FreeBSD-UPB/freebsd-src/wiki/Virtual-Machine-Migra= tion-using-bhyve > > >=20 > > > For sending the migration request to a virtual machine, we use > > > the > > > same thread/socket that is used for suspend. > > > For receiving a migration request, we used a similar approach to > > > the > > > resume process. > > >=20 > > > As some of you may remember seeing similar emails from our part > > > on > > > the > > > freebsd-virtualization list, I'll present a brief history of this > > > project: > > > The first part of the project was the suspend/resume > > > implementation > > > which landed in bhyve in 2020, under the BHYVE_SNAPSHOT guard > > > (https://reviews.freebsd.org/D19495). > > > After that, we focused on two tracks: > > > 1. adding various suspend/resume features (multiple device > > > support - > > > https://reviews.freebsd.org/D26387, CAPSICUM support - > > > https://reviews.freebsd.org/D30471, having an uniform file format > > > - > > > at > > > that time, during the bhyve bi-weekly calls, we concluded that > > > the > > > JSON format was the most suitable at that time - > > > https://reviews.freebsd.org/D29262) so we can remove the #ifdef > > > BHYVE_SNAPSHOT guard. > > > 2. implementing the migration feature for bhyve. Since this one > > > relies > > > on the save/restore, but does not modify its behaviour, we > > > considered > > > we can go in parallel with both tracks. > > > We had various presentations in the FreeBSD Community on these > > > topics: > > > AsiaBSDCon2018, AsiaBSDCon2019, BSDCan2019, BSDCan2020, > > > AsiaBSDCon2023. > > >=20 > > > The first patches for warm and live migration were opened in > > > 2021: > > > https://reviews.freebsd.org/D28270, > > > https://reviews.freebsd.org/D30954. However, the general feedback > > > on > > > these was that the patches are too big to be reviewed, so we > > > should > > > split them in smaller chunks (this was also true for some of the > > > suspend/resume improvements). Thus, we split them into smaller > > > parts. > > > Also, as things changed in bhyve (i.e., capsicum support for > > > suspend/resume was added this year), we rebased and updated our > > > reviews. > > >=20 > > > Thank you, > > > Elena > > >=20 > >=20 --=20 Kind regards, Corvin --=-FXVYXhKGuN8C2nzIOaXc Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEgvRSla3m2t/H2U9G2FTaVjFeAmoFAmSieiEACgkQ2FTaVjFe AmrDpw//VS6X267yW6TRsUR6y+hT3YDd5TZQ+dbRBql+L2KtKfOSFam0b9bsIlMS KjYZOSRIptU7Uq83IqQPwPaUYFlxrJC3MnQlHQvfGH72uUoT9hojlkOdoan5s9Ex DEsXzrE3l6DVwOINjxAdXU+Q7dFjYEj+Je+A81P001jT1/VOdOlqkKf31dwHcbaM nIi78rvr1kNbbtUSP68yjJ7xDjwRZNTt/uLLK57T60wXE9eUPAXMowN9iiB3IUb9 nMGszxSTqENRZMaFIv0VmY1U3wUAPEkgN11WmyScAl9ymnibIKqfYWfmW6gvI8tp eLtbfV/SY/1MsGKm0cDTXcVB8zN5OzEDZHNCe8gWP/BY/uu/R22xw6EBp/SoQYWo oJau0ymYotfAqvxhHWNL2b8A7Izyh4vjW5AWBrvhO89vwAO84WJZUexUHAIQHKBk 0GDZDgoftd5pXR8RADATqNjcs0Oco32BMJM3sqWrQ/ced7YMx+Fgv7A7nYVM2L7l 5aZZD+NMNqATXQxXJtmE0KFjE0VzRHXZRkN1bBughvqrz89oDhLKcDzHeahFDhUR ao9Si2/YWU5zjRdBIyMzpF5Jr6aPk9GZLyEx9DnOuhrnHKsZkYcsluYp11pIk3yw vecG/WPfh4SEKBNzFS5GlYUoNOTJ+ykExPrbrsy4V8ZxtZwPk5E= =d2lk -----END PGP SIGNATURE----- --=-FXVYXhKGuN8C2nzIOaXc--