From nobody Sun May 05 01:56:20 2024 X-Original-To: virtualization@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4VX71m2xLjz53Xxb for ; Sun, 05 May 2024 01:56:28 +0000 (UTC) (envelope-from robn@despairlabs.com) Received: from wfhigh1-smtp.messagingengine.com (wfhigh1-smtp.messagingengine.com [64.147.123.152]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4VX71l2DvFz4s1V for ; Sun, 5 May 2024 01:56:27 +0000 (UTC) (envelope-from robn@despairlabs.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=despairlabs.com header.s=fm3 header.b=wmewUUvO; dkim=pass header.d=messagingengine.com header.s=fm3 header.b=TeNxBFIY; dmarc=none; spf=pass (mx1.freebsd.org: domain of robn@despairlabs.com designates 64.147.123.152 as permitted sender) smtp.mailfrom=robn@despairlabs.com Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailfhigh.west.internal (Postfix) with ESMTP id 7E176180008C for ; Sat, 4 May 2024 21:56:25 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Sat, 04 May 2024 21:56:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=despairlabs.com; h=cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:message-id:mime-version:reply-to :subject:subject:to:to; s=fm3; t=1714874185; x=1714960585; bh=eE 9m78M0UqWh/j0EAmmlp2oJTo/8v/ffwdkLYJlKBvk=; b=wmewUUvOme3OkUXSGf zZrCLDCtIc4+J4MxKF1sOBDNtAc3OFt1j8Kc19ta/DHRJIlnW3ZFgKsxWYaPGIev aB8OafVvKcQsr+Wqmg0II79S0aDojbht0M9YRqH4cyNgHHR7CaChWvG7EMsxTQQM TtnnTrOUnaUPwsc2OUPa4dpr5Ahh9B0F6aiuIgOWxmHrKG1rxjPDfCRS0TP8spne G5ea6bS9C77RUi4SrLeWPfgumA6x369E8FqLhTGg/fZx30iJVmRCdz2+LajBsSGd FsBlRgvR1avTw6X82nVfms3DtUeqs7YciwPD7GCU+gsJ0gUgMC0jBdoyTSaSezMY ukpg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:message-id:mime-version:reply-to:subject:subject:to :to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm3; t=1714874185; x=1714960585; bh=eE9m78M0UqWh/j0EAmmlp2oJTo/8 v/ffwdkLYJlKBvk=; b=TeNxBFIYytHj01wqn9q0bYaZDwprQ4S9JO8ybt8ZsOcv 6uQzr3XOD2Q62LkYJxllVOMF88vYrRBc/NOEQaV7+bFkQvv1ZdkJ9dO/jUiIqCoq ATQR7xonvQXPPnkTGwJhlApHABTsNnp36wR0m9tqgEMr9+NAn7an+6k47WFIOwVN rH43heWdLpEGc6JWE4LGQTfMuN/MV2HF2KE/m2PuO7PTu6hsy/eun/4QJdF9/LCF freigA8HZX5rhjAGDz84U3B2AYEVWWncchlWogAl4DPShzW7MTObAamYW23b+jal qfKBYxLAOh55/iLPTaNGWXcVPXSV+QxGuAFgQkdWrg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrvddvfedgheefucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpefkffggfgfvhffutgfgsehtkeertd dtvdejnecuhfhrohhmpeftohgsucfpohhrrhhishcuoehrohgsnhesuggvshhprghirhhl rggsshdrtghomheqnecuggftrfgrthhtvghrnhepffdufedvteevkeffgfffteeiiefhff efffetheegteetffduuddtgedtveetheelnecuffhomhgrihhnpehgihhthhhusgdrtgho mhdprghstghiihhnvghmrgdrohhrghdpuggvshhprghirhhlrggsshdrtghomhenucevlh hushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehrohgsnhesuggv shhprghirhhlrggsshdrtghomh X-ME-Proxy: Feedback-ID: ia7b9477a:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Sat, 4 May 2024 21:56:23 -0400 (EDT) Message-ID: Date: Sun, 5 May 2024 11:56:20 +1000 List-Id: Discussion List-Archive: https://lists.freebsd.org/archives/freebsd-virtualization List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: freebsd-virtualization@freebsd.org Sender: owner-freebsd-virtualization@FreeBSD.org MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Content-Language: en-US To: virtualization@freebsd.org From: Rob Norris Subject: Direct Linux loading for bhyve Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.57 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.98)[-0.979]; R_DKIM_ALLOW(-0.20)[despairlabs.com:s=fm3,messagingengine.com:s=fm3]; R_SPF_ALLOW(-0.20)[+ip4:64.147.123.128/27]; MIME_GOOD(-0.10)[text/plain]; RCVD_IN_DNSWL_LOW(-0.10)[64.147.123.152:from]; XM_UA_NO_VERSION(0.01)[]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; MIME_TRACE(0.00)[0:+]; RCPT_COUNT_ONE(0.00)[1]; ASN(0.00)[asn:29838, ipnet:64.147.123.0/24, country:US]; FROM_HAS_DN(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[messagingengine.com:dkim]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_NONE(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; DMARC_NA(0.00)[despairlabs.com]; PREVIOUSLY_DELIVERED(0.00)[virtualization@freebsd.org]; MLMMJ_DEST(0.00)[virtualization@freebsd.org]; RCVD_COUNT_THREE(0.00)[3]; RCVD_TLS_LAST(0.00)[]; DKIM_TRACE(0.00)[despairlabs.com:+,messagingengine.com:+] X-Rspamd-Queue-Id: 4VX71l2DvFz4s1V Hi all, Last year I did some work on adding support to bhyve to load a Linux kernel directly, without needing to create a disk image or configure a bootloader. I showed a few people at the Dev Summit in Taipei in March, and the concept was generally well received, so I'm writing this email to describe where I'm at, where I want to take it and seek comments, ideas and guidance on how to proceed. The initial motivation was to be able to do the equivalent of QEMU's -kernel, -append and -initrd options with bhyve, to boot a Linux kernel directly. (For me, it's to use to port my kernel dev tool "quiz"[1] to FreeBSD, though that is only tangentially related). To do this I added a "loader" class to bhyve, and then wrote a loader that implements the Linux x86 boot protocol. Some links: * Prototype: https://github.com/robn/freebsd-src/tree/bhyve-loader-linux/usr.sbin/bhyve * Demo run using the kernel and initrd from a Debian installer iso:     https://asciinema.org/a/FuXehcd5MkWb7LE15s1VT2ugK I'll describe how it's put together here. loader.h and loader.c define a trivial struct loader, which each loader module defines and adds to loader_set.     struct loader {         const char *l_name;         int (*l_setup_memory)(struct vmctx *ctx);         int (*l_setup_boot_cpu)(struct vmctx *ctx, struct vcpu *vcpu);     };     static const struct loader loader_linux = {         .l_name = "linux",         .l_setup_memory = loader_linux_setup_memory,         .l_setup_boot_cpu = loader_linux_setup_boot_cpu,     };     LOADER_SET(loader_linux); It's pretty straightforward: after memory is created, l_setup_memory() is called to load whatever is wanted into it. Then, once the boot CPU is created, l_setup_boot_cpu() is called to set initial registers and insert anything needed to hook up the final memory map, device state or whatever else. It's not so different to the existing bootrom support (indeed, an early version was just setting it up as an alternate bootrom). The details are in amd64/loader_linux.c. For a second opinion, I wrote a loader_multiboot2.c[2], though it's not finished and not working properly. I suspect it's not very far away but in any case, it does show the shape. Apart from the normal matters of style, documentation, testing, and other "productionising" tasks, there's at least two things to address: * I need a way to create a VM that will be destroyed when bhyve exits,   including if it is killed. KVM does it by binding the VM to a file   descriptor; when the descriptor is closed, the VM is destroyed. That's   a fairly common pattern in Linux for managing lifetimes of kernel   resources from userspace. I'm new enough to FreeBSD to not know what   the common pattern for that kind of thing is (pointers appreciated!).   Regardless, this feels like it's mostly a case of plumbing. * I don't have dedicated command line options yet. '-o loader.name=foo'   is used to select the loader, and any other options the loader has to   sort out by itself (eg the Linux loader knows 'loader.kernel',   'loader.initrd' and 'loader.cmdline'). Maybe we don't _need_ dedicated   command line options; I don't know how to decide that. And then there's a bunch of observations around possible future areas for improvement in both bhyve and libvmm. These aren't showstoppers for some form of loader support landing, but they're definitely places things can be better: * The memory layout story doesn't seem very flexible. The Linux loader   needs to stake out multiple different regions, but there's not really   any help to know what's already claimed, or claim it in turn. I just   have to pick some spots that probably aren't going to be used by any   device mapping or similar. It's not that hard, but it seems like some   kind of allocator concept might be useful. Not unlike the e820   allocator perhaps, but they're not "physical" regions as such (that   is, shouldn't be exposed to the OS in the e820 table). * Semi-related, something that would help a lot is some ability to map a   region of host memory directly into the guest address space. Then   instead of copying things in, we can just mmap() and give the   pointer to the hypervisor directly. This isn't so important for a   loader, but as far as I can tell it's a requirement if QEMU itself   were ever to use bhyve/libvmm for acceleration, as it wants to set up   its memory layout directly and then just hand it to the hypervisor   (getting QEMU running is another side project I have on the go, so   I've thought about this a little bit). * It does seem like this loader concept has some overlap with the   bootrom support, and maybe bootrom should be just another kind of   loader. But, maybe not, since a bootrom is a real device, not just   stuff in memory. * It's really really hard to set up the register state properly. This   might just be a reflection of the complexity of the problem,   especially since I'm trying to set things up so the CPU starts in   64-bit long mode, and there's very few examples of that out there   (even QEMU installs a tiny bootrom to have the guest do the   transitions before bouncing into Linux). Regardless, it seems like   helpers to assist with building the GDT, or setting segment shadow   registers, or control registers, etc, would make this sort of thing a   lot easier (incidentally; there is some help for this within libvmm   just for setting up for a FreeBSD guest; at least, that seems out of   place). I think that's everything for now. I'm very interested in any thoughts, opinions, guidance or complaints people have. I'm also hang around in #bhyve on IRC, on the FreeBSD Discord, and I'll be at BSDCan later this month if you want to chat to me about it. Cheers, Rob. 1. https://despairlabs.com/blog/posts/2024-03-04-quiz-rapid-openzfs-development/ 2. https://github.com/robn/freebsd-src/blob/bhyve-loader-multiboot2/usr.sbin/bhyve/amd64/loader_multiboot2.c