From nobody Mon Jul 17 16:43:54 2023 X-Original-To: freebsd-virtualization@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4R4Sbb5G93z4n1yq for ; Mon, 17 Jul 2023 16:44:27 +0000 (UTC) (envelope-from elenamihailescu22@gmail.com) Received: from mail-ed1-x536.google.com (mail-ed1-x536.google.com [IPv6:2a00:1450:4864:20::536]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4R4SbZ4BfRz4GxM; Mon, 17 Jul 2023 16:44:26 +0000 (UTC) (envelope-from elenamihailescu22@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20221208 header.b=QJ2avby8; spf=pass (mx1.freebsd.org: domain of elenamihailescu22@gmail.com designates 2a00:1450:4864:20::536 as permitted sender) smtp.mailfrom=elenamihailescu22@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ed1-x536.google.com with SMTP id 4fb4d7f45d1cf-51e566b1774so6173401a12.1; Mon, 17 Jul 2023 09:44:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1689612262; x=1692204262; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=N/+ZWd8LICYQcFM7ZgW5GbE4sNOns933Vt+sV3oKe1s=; b=QJ2avby8lH4bOIdzInau8ZP2FR1rXyVVfw+SDVtmxgaKKLvrheFaEwFeXWD6oYctfr QTV0C3hSVzMZQATjiN9pUo2xwaCv/o7oidt7NLnvSH6IG9E2IBFZ8Vgo11h4EXlws3DP zOF4ZRRtQs6ZgDgaeeDdTKNGNaNSy5I4/XPNeGGgR4VAoii1NWmozyvgcOs3lLAr+MGj qFLORhR2jITadbZxyQ4UjL5tvLSjdxwHTKygM1jwXzUZcIrE2dqZDijxVAfDqe3ilsbV i6t4rodyfHOX1NFfRh5MwDrtE0+IOv61AsC1MTV9Y4YHLRsiL7tqhS3gIYX+LP7BTphA 0LSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689612262; x=1692204262; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=N/+ZWd8LICYQcFM7ZgW5GbE4sNOns933Vt+sV3oKe1s=; b=j9an4EWOAwTK1vcJ346/a1zaix1bPbJ9gjmZdBBVQv0XVAt2kLTLzMaHMilS1gPFQi Z+3q3LQb0ZdNsDVgl+dbYZzYE7flMka1sW4DUf7+HwGT1oiqjixYGiSnXz951ch+umy8 0xfhVlwp662dMkna2mFtFQkHdHyWS0WW07FT5GgQr0+GALCo9Ry7rVrqos5WXcSoHkEi wDAANB8CMlf0j1fZxyCyX+tQ8MBxjJIrDDqEg7byt1+mo1m9D+zEMLMwTodfOuQ0o7SN 1IwnDBdmVeyNe/fjjRqVwzF1tynyGighV2/bcFZVuAR5eI5ltlrCWL94e6gqJ1wZRhMY DGTQ== X-Gm-Message-State: ABy/qLY3f9addWCebEXZO6KlEfhW55iZhITOLPN4uy+kGtDgBOoOa/KX vp+eF1DZmzBugB24MQ1brg+HAY8TTtZmF6863TkHhbdLyVQ= X-Google-Smtp-Source: APBJJlH1Vvqk6RR/SwbD5jRr94hHKsn5DGfp91QI4JS95M0sG6hMIHtIObq2AOgvGU2nGiEgLvNuejUiG11nxW+AlkQ= X-Received: by 2002:aa7:dcc6:0:b0:51d:87c6:bf28 with SMTP id w6-20020aa7dcc6000000b0051d87c6bf28mr11933241edu.3.1689612261987; Mon, 17 Jul 2023 09:44:21 -0700 (PDT) List-Id: Discussion List-Archive: https://lists.freebsd.org/archives/freebsd-virtualization List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-virtualization@freebsd.org X-BeenThere: freebsd-virtualization@freebsd.org MIME-Version: 1.0 References: <3d7ee1f6ff98fe9aede5a85702b906fc3014b6b6.camel@FreeBSD.org> In-Reply-To: From: Elena Mihailescu Date: Mon, 17 Jul 2023 18:43:54 +0200 Message-ID: Subject: Re: Warm and Live Migration Implementation for bhyve To: =?UTF-8?Q?Corvin_K=C3=B6hne?= Cc: freebsd-virtualization@freebsd.org, Mihai Carabas , Matthew Grooms Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spamd-Result: default: False [-2.64 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-0.96)[-0.964]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; NEURAL_SPAM_SHORT(0.32)[0.322]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20221208]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; MIME_GOOD(-0.10)[text/plain]; TAGGED_RCPT(0.00)[]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::536:from]; MLMMJ_DEST(0.00)[freebsd-virtualization@freebsd.org]; TO_DN_SOME(0.00)[]; ARC_NA(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; DKIM_TRACE(0.00)[gmail.com:+]; MID_RHS_MATCH_FROMTLD(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; FREEMAIL_CC(0.00)[freebsd.org,gmail.com,shrew.net] X-Rspamd-Queue-Id: 4R4SbZ4BfRz4GxM X-Spamd-Bar: -- X-ThisMailContainsUnwantedMimeParts: N Hi Corvin, On Mon, 3 Jul 2023 at 09:35, Corvin K=C3=B6hne wrote: > > On Tue, 2023-06-27 at 16:35 +0300, Elena Mihailescu wrote: > > Hi Corvin, > > > > Thank you for the questions! I'll respond to them inline. > > > > On Mon, 26 Jun 2023 at 10:16, Corvin K=C3=B6hne > > wrote: > > > > > > Hi Elena, > > > > > > thanks for posting this proposal here. > > > > > > Some open questions from my side: > > > > > > 1. How is the data send to the target? Does the host send a > > > complete > > > dump and the target parses it? Or does the target request data one > > > by > > > one und the host sends it as response? > > > > > It's not a dump of the guest's state, it's transmitted in steps. > > However, some parts may be migrated as a chunk (e.g., the emulated > > devices' state is transmitted as the buffer generated from the > > snapshot functions). > > > > How does the receiver know which chunk relates to which device? It > would be nice if you can start bhyve on the receiver side without > parameters e.g. `bhyve --receive=3D127.0.0.1:1234`. Therefore, the > protocol has to carry some information about the device configuration. > Regarding your first question, we send a chunk of data (a buffer) with the state: we resume the data in the same order we saved it. It relies on save/restore. We currently do not support migrating between different versions of suspend&resume/migration. It would be nice to have something like `bhyve --receive=3D127.0.0.1:1234`, but I don't think it is possible at this point mainly because of the following two reasons: - the guest image must be shared (e.g., via NFS) between the source and destination hosts. If the mounting points differ between the two, opening the disk at the destination will fail (also, we must suppose that the user used an absolute path since a relative one won't work) - if the VM uses a network adapter, we must specify the tap interface on the destination host (e.g., if on the source host the VM uses `tap0`, on the destination host, `tap0` may not exist or may be used by other VMs). > > > I'll try to describe a bit the protocol we have implemented for > > migration, maybe it can partially respond to the second and third > > questions. > > > > The destination host waits for the source host to connect (through a > > socket). > > After that, the source sends its system specifications (hw_machine, > > hw_model, hw_pagesize). If the source and destination hosts have > > identical hardware configurations, the migration can take place. > > > > Then, if we have live migration, we migrate the memory in rounds > > (i.e., we get a list of the pages that have the dirty bit set, send > > it > > to the destination to know what pages will be received, then send the > > pages through the socket; this process is repeated until the last > > round). > > > > Next, we stop the guest's vcpus, send the remaining memory (for live > > migration) or the guest's memory from vmctx->baseaddr for warm > > migration. Then, based on the suspend/resume feature, we get the > > state > > of the virtualized devices (the ones from the kernel space) and send > > this buffer to the destination. We repeat this for the emulated > > devices as well (the ones from the userspace). > > > > On the receiver host, we get the memory pages and set them to their > > according position in the guest's memory, use the restore functions > > for the state of the devices and start the guest's execution. > > > > Excluding the guest's memory transfer, the rest is based on the > > suspend/resume feature. We snapshot the guest's state, but instead of > > saving the data locally, we send it via network to the destination. > > On > > the destination host, we start a new virtual machine, but instead of > > reading/getting the state from the disk (i.e., the snapshot files) we > > get this state via the network from the source host. > > > > If the destination can properly resume the guest activity, it will > > send an "OK" to the source host so it can destroy/remove the guest > > from its end. > > > > Both warm and live migration are based on "cold migration". Cold > > migration means we suspend the guest on the source host, and restore > > the guest on the destination host from the snapshot files. Warm > > migration only does this using a socket, while live migration changes > > the way the memory is migrated. > > > > > 2. What happens if we add a new data section? > > > > > What are you referring to with a new data section? Is this question > > related to the third one? If so, see my answer below. > > > > > 3. What happens if the bhyve version differs on host and target > > > machine? > > > > The two hosts must be identical for migration, that's why we have the > > part where we check the specifications between the two migration > > hosts. They are expected to have the same version of bhyve and > > FreeBSD. We will add an additional check in the check specs part to > > see if we have the same FreeBSD build. > > > > As long as the changes in the virtual memory subsystem won't affect > > bhyve (and how the virtual machine sees/uses the memory), the > > migration constraints should only be related to suspend/resume. The > > state of the virtual devices is handled by the snapshot system, so if > > it is able to accommodate changes in the data structures, the > > migration process will not be affected. > > > > Thank you, > > Elena > > > > > > > > > > > -- > > > Kind regards, > > > Corvin > > > > > > On Fri, 2023-06-23 at 13:00 +0300, Elena Mihailescu wrote: > > > > Hello, > > > > > > > > This mail presents the migration feature we have implemented for > > > > bhyve. Any feedback from the community is much appreciated. > > > > > > > > We have opened a stack of reviews on Phabricator > > > > (https://reviews.freebsd.org/D34717) that is meant to split the > > > > code > > > > in smaller parts so it can be more easily reviewed. A brief > > > > history > > > > of > > > > the implementation can be found at the bottom of this email. > > > > > > > > The migration mechanism we propose needs two main components in > > > > order > > > > to move a virtual machine from one host to another: > > > > 1. the guest's state (vCPUs, emulated and virtualized devices) > > > > 2. the guest's memory > > > > > > > > For the first part, we rely on the suspend/resume feature. We > > > > call > > > > the > > > > same functions as the ones used by suspend/resume, but instead of > > > > saving the data in files, we send it via the network. > > > > > > > > The most time consuming aspect of migration is transmitting guest > > > > memory. The UPB team has implemented two options to accomplish > > > > this: > > > > 1. Warm Migration: The guest execution is suspended on the source > > > > host > > > > while the memory is sent to the destination host. This method is > > > > less > > > > complex but may cause extended downtime. > > > > 2. Live Migration: The guest continues to execute on the source > > > > host > > > > while the memory is transmitted to the destination host. This > > > > method > > > > is more complex but offers reduced downtime. > > > > > > > > The proposed live migration procedure (pre-copy live migration) > > > > migrates the memory in rounds: > > > > 1. In the initial round, we migrate all the guest memory (all > > > > pages > > > > that are allocated) > > > > 2. In the subsequent rounds, we migrate only the pages that were > > > > modified since the previous round started > > > > 3. In the final round, we suspend the guest, migrate the > > > > remaining > > > > pages that were modified from the previous round and the guest's > > > > internal state (vCPU, emulated and virtualized devices). > > > > > > > > To detect the pages that were modified between rounds, we propose > > > > an > > > > additional dirty bit (virtualization dirty bit) for each memory > > > > page. > > > > This bit would be set every time the page's dirty bit is set. > > > > However, > > > > this virtualization dirty bit is reset only when the page is > > > > migrated. > > > > > > > > The proposed implementation is split in two parts: > > > > 1. The first one, the warm migration, is just a wrapper on the > > > > suspend/resume feature which, instead of saving the suspended > > > > state > > > > on > > > > disk, sends it via the network to the destination > > > > 2. The second part, the live migration, uses the layer previously > > > > presented, but sends the guest's memory in rounds, as described > > > > above. > > > > > > > > The migration process works as follows: > > > > 1. we identify: > > > > - VM_NAME - the name of the virtual machine which will be > > > > migrated > > > > - SRC_IP - the IP address of the source host > > > > - DST_IP - the IP address of the destination host (default is > > > > 24983) > > > > - DST_PORT - the port we want to use for migration > > > > 2. we start a virtual machine on the destination host that will > > > > wait > > > > for a migration. Here, we must specify SRC_IP (and the port we > > > > want > > > > to > > > > open for migration, default is 24983). > > > > e.g.: bhyve ... -R SRC_IP:24983 guest_vm_dst > > > > 3. using bhyvectl on the source host, we start the migration > > > > process. > > > > e.g.: bhyvectl --migrate=3DDST_IP:24983 --vm=3Dguest_vm > > > > > > > > A full tutorial on this can be found here: > > > > https://github.com/FreeBSD-UPB/freebsd-src/wiki/Virtual-Machine-Mig= ration-using-bhyve > > > > > > > > For sending the migration request to a virtual machine, we use > > > > the > > > > same thread/socket that is used for suspend. > > > > For receiving a migration request, we used a similar approach to > > > > the > > > > resume process. > > > > > > > > As some of you may remember seeing similar emails from our part > > > > on > > > > the > > > > freebsd-virtualization list, I'll present a brief history of this > > > > project: > > > > The first part of the project was the suspend/resume > > > > implementation > > > > which landed in bhyve in 2020, under the BHYVE_SNAPSHOT guard > > > > (https://reviews.freebsd.org/D19495). > > > > After that, we focused on two tracks: > > > > 1. adding various suspend/resume features (multiple device > > > > support - > > > > https://reviews.freebsd.org/D26387, CAPSICUM support - > > > > https://reviews.freebsd.org/D30471, having an uniform file format > > > > - > > > > at > > > > that time, during the bhyve bi-weekly calls, we concluded that > > > > the > > > > JSON format was the most suitable at that time - > > > > https://reviews.freebsd.org/D29262) so we can remove the #ifdef > > > > BHYVE_SNAPSHOT guard. > > > > 2. implementing the migration feature for bhyve. Since this one > > > > relies > > > > on the save/restore, but does not modify its behaviour, we > > > > considered > > > > we can go in parallel with both tracks. > > > > We had various presentations in the FreeBSD Community on these > > > > topics: > > > > AsiaBSDCon2018, AsiaBSDCon2019, BSDCan2019, BSDCan2020, > > > > AsiaBSDCon2023. > > > > > > > > The first patches for warm and live migration were opened in > > > > 2021: > > > > https://reviews.freebsd.org/D28270, > > > > https://reviews.freebsd.org/D30954. However, the general feedback > > > > on > > > > these was that the patches are too big to be reviewed, so we > > > > should > > > > split them in smaller chunks (this was also true for some of the > > > > suspend/resume improvements). Thus, we split them into smaller > > > > parts. > > > > Also, as things changed in bhyve (i.e., capsicum support for > > > > suspend/resume was added this year), we rebased and updated our > > > > reviews. > > > > > > > > Thank you, > > > > Elena > > > > > > > > > -- > Kind regards, > Corvin Thanks, Elena