From nobody Tue Jul 16 20:15:24 2024 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4WNr604RdYz5Q9gs for ; Tue, 16 Jul 2024 20:20:08 +0000 (UTC) (envelope-from emil@etsalapatis.com) Received: from mail-io1-xd2f.google.com (mail-io1-xd2f.google.com [IPv6:2607:f8b0:4864:20::d2f]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4WNr5y451cz49f3 for ; Tue, 16 Jul 2024 20:20:06 +0000 (UTC) (envelope-from emil@etsalapatis.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=etsalapatis-com.20230601.gappssmtp.com header.s=20230601 header.b=ZaGZrN1a; dmarc=none; spf=pass (mx1.freebsd.org: domain of emil@etsalapatis.com designates 2607:f8b0:4864:20::d2f as permitted sender) smtp.mailfrom=emil@etsalapatis.com Received: by mail-io1-xd2f.google.com with SMTP id ca18e2360f4ac-7f70a708f54so10776339f.3 for ; Tue, 16 Jul 2024 13:20:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=etsalapatis-com.20230601.gappssmtp.com; s=20230601; t=1721161205; x=1721766005; darn=freebsd.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=bcd9+RGWey98wJdyRlYbVdoe2hQBPk1Z4srAjhn6ZJQ=; b=ZaGZrN1aCth167d6L6CEc/17WaTJerdIMmF4KXNRsmfQB9GaP0KiOXcDdSCLkgJA/m l6QQ5HuadYHFQMGySIgWLyKOctXe6rfc2uxBj16uujc6/4IWtz9mhsCQGajWTCn+lZ3D p2ArIiGMuJMIcDoaitrFMww6U5iITb9oqXLkedfBLsOUyrTXBG5nLi04Yj1Ez1Fdmvim 8amwwHf3mjGY6KNH/pTslDGNzIAakCNXGPl6u/AU2+Lr7AzuLAEyLodnis5i13E04ip5 HclaD9cRFl92jmKr+cYYL9yl7fr97f93J1yu2PZKJZPSWO4CL7aMSrrmdtlc3ZSMv0C4 WauA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721161205; x=1721766005; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=bcd9+RGWey98wJdyRlYbVdoe2hQBPk1Z4srAjhn6ZJQ=; b=vRAs/PLeE8nHGOh4RXvp6+YZvACvBBN5wys29F5JaB6o72WU9aK4tZXMSckxInI6Kp SOtUe7EQfBcpr0i1AKH7dWJhA1otS0KZ+ygcX+Yj/yjg/+8ujLCUUe3JMQSG/v4xIT0V Gal+4/IUPTz/ACT9AhZ9dIUjFRp7cJIcp1G968T4OJq3ZnOrL0JosOhYnRMVhu6IKmK4 0kbWRcmj/WKMUiD3Ev+SCNKVm1HgYm+DqujxLsgiqsDJ6WhDb5pY7pdbA60CNcALLt72 um5YnPNzIa0udJrvqSNJvKW2VlwwOoEynNev/1TWq0Sjj0OVWG+CIgWKq8H2mON5JbAa ggBQ== X-Forwarded-Encrypted: i=1; AJvYcCWKo+u7PH+p1BxOTltLgIQVL07HeOcA9MfmU0yKwLhmRrX1EAYvHAVo1vU0miUsUtZtKp+RECQwDBDQOz+G7ZHRfmVzR4cjYWq6Lrc= X-Gm-Message-State: AOJu0YxmOe2126gd7RkKAd7BmAuXkKbaTGOk1o7bZf1WwooHhLpjBFcL JMBNEgyZA0evAjf8iN1q1CM9tvPvsnjbrxkLLUzx6TxXgRVTwc0R/7GILTNHysL2L90r8QmV0Tl C/rZXkOxU/nUno3Ldj1eALSnr0SM1wGBQyS3rIQ== X-Google-Smtp-Source: AGHT+IHRzwYALmHJVJ8kB171cVEu2gZv9U/x3mKHX9U1kHzX26U11RzXMnhlXl+zXYtAT5vc5tZllKAZv405PeFxeSU= X-Received: by 2002:a05:6602:601b:b0:803:980e:5b39 with SMTP id ca18e2360f4ac-816c2c0d602mr86107739f.4.1721161204892; Tue, 16 Jul 2024 13:20:04 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@FreeBSD.org MIME-Version: 1.0 References: <75944503-8599-43CF-84C5-0C10CA325761@freebsd.org> In-Reply-To: <75944503-8599-43CF-84C5-0C10CA325761@freebsd.org> From: Emil Tsalapatis Date: Tue, 16 Jul 2024 16:15:24 -0400 Message-ID: Subject: Re: Is anyone working on VirtFS (FUSE over VirtIO) To: David Chisnall Cc: Warner Losh , Alan Somers , FreeBSD Hackers Content-Type: multipart/alternative; boundary="00000000000038270d061d631223" X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.20 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.998]; FORGED_SENDER(0.30)[freebsd-lists@etsalapatis.com,emil@etsalapatis.com]; R_DKIM_ALLOW(-0.20)[etsalapatis-com.20230601.gappssmtp.com:s=20230601]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_COUNT_ONE(0.00)[1]; MIME_TRACE(0.00)[0:+,1:+,2:~]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::d2f:from]; ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[]; FROM_HAS_DN(0.00)[]; DMARC_NA(0.00)[etsalapatis.com]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_NEQ_ENVFROM(0.00)[freebsd-lists@etsalapatis.com,emil@etsalapatis.com]; RCPT_COUNT_THREE(0.00)[4]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org]; TO_DN_ALL(0.00)[]; RCVD_TLS_LAST(0.00)[]; DKIM_TRACE(0.00)[etsalapatis-com.20230601.gappssmtp.com:+] X-Rspamd-Queue-Id: 4WNr5y451cz49f3 --00000000000038270d061d631223 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, On Mon, Jul 15, 2024 at 3:47=E2=80=AFAM David Chisnall wrote: > Hi, > > This looks great! Are there infrastructure problems with supporting the > DAX or is it =E2=80=98just work=E2=80=99? I had hoped that the extensions= to the buffer > cache that allow ARC to own pages that are delegated to the buffer cache > would be sufficient. > > After going over the Linux code, I think adding direct mapping doesn't require any changes outside of FUSE and virtio code. Direct mapping mainly requires code to manage the virtiofs device's memory region in the driver. This is a shared memory region between guest and host with which the driver backs FUSE inodes. The driver then includes an allocator used to map parts of an inode into the region. It should be possible to pass host-guest shared pages to ARC, with the caveat that the virtiofs driver should be able to reclaim them at any time. Does the code currently allow this? Virtiofs needs this because it maps region pages to inodes, and must reuse cold region pages during an allocation if there aren't any available. Basically, the region is a separate pool of device pages that's managed directly by virtiofs. If I understand the protocol correctly, the DAX mode is the same as the > direct mmap mode in FUSE (not sure if FreeBSD!=E2=80=99s kernel fuse bits= support > this?). > > Yeah, virtiofs DAX seems like it's similar to FUSE direct mmap, but with FUSE inodes being backed by the shared region instead. I don't think FreeBSD has direct mmap but I may be wrong there. Emil > David > > On 14 Jul 2024, at 15:07, Emil Tsalapatis > wrote: > > =EF=BB=BF > Hi David, Warner, > > I'm glad you find this approach interesting! I've been meaning to > update the virtio-dbg patch for a while but unfortunately haven't found t= he > time in the last month since I uploaded it... I'll update it soon to > address the reviews and split off the userspace device emulation code out > of the patch to make reviewing easier (thanks Alan for the suggestion). I= f > you have any questions or feedback please let me know. > > WRT virtiofs itself, I've been working on it too but I haven't found the > time to clean it up and upload it. I have a messy but working > implementation here > . The changes to > FUSE itself are indeed minimal because it is enough to redirect the > messages into a virtiofs device instead of sending them to a local FUSE > device. The virtiofs device and the FUSE device are both simple > bidirectional queues. Not sure on how to deal with directly mapping files > between host and guest just yet, because the Linux driver uses their DAX > interface for that, but it should be possible. > > Emil > > On Sun, Jul 14, 2024 at 3:11=E2=80=AFAM David Chisnall > wrote: > >> Wow, that looks incredibly useful. Not needing bhyve / qemu (nested, if >> your main development is a VM) to test virtio drivers would be a huge >> productivity win. >> >> David >> >> On 13 Jul 2024, at 23:06, Warner Losh wrote: >> >> Hey David, >> >> You might want to check out https://reviews.freebsd.org/D45370 which >> has the testing framework as well as hints at other work that's been don= e >> for virtiofs by Emil Tsalapatis. It looks quite interesting. Anything he= 's >> done that's at odds with what I've said just shows where my analysis was >> flawed :) This looks quite promising, but I've not had the time to look = at >> it in detail yet. >> >> Warner >> >> On Sat, Jul 13, 2024 at 2:44=E2=80=AFAM David Chisnall >> wrote: >> >>> On 31 Dec 2023, at 16:19, Warner Losh wrote: >>> >>> >>> Yea. The FUSE protocol is going to be the challenge here. For this to b= e >>> useful, the VirtioFS support on the FreeBSD needs to be 100% in the >>> kernel, since you can't have userland in the loop. This isn't so terrib= le, >>> though, since our VFS interface provides a natural breaking point for >>> converting the requests into FUSE requests. The trouble, I fear, is a >>> mismatch between FreeBSD's VFS abstraction layer and Linux's will cause >>> issues (many years ago, the weakness of FreeBSD VFS caused problems for= a >>> company doing caching, though things have no doubt improved from those >>> days). Second, there's a KVM tie-in for the direct mapped pages between= the >>> VM and the hypervisor. I'm not sure how that works on the client (FreeB= SD) >>> side (though the description also says it's mapped via a PCI bar, so ma= ybe >>> the VM OS doesn't care). >>> >>> >>> From what I can tell from a little bit of looking at the code, our FUSE >>> implementation has a fairly cleanly abstracted layer (in fuse_ipc.c) fo= r >>> handling the message queue. For VirtioFS, it would 'just' be necessary= to >>> factor out the bits here that do uio into something that talked to a Vi= rtIO >>> ring. I don=E2=80=99t know what the VFS limitations are, but since the= protocol >>> for VirtioFS is the kernel <-> userspace protocol for FUSE, it seems th= at >>> any functionality that works with FUSE filesystems in userspace would w= ork >>> with VirtioFS filesystems. >>> >>> The shared buffer cache bits are nice, but are optional, so could be >>> done in a later version once the basic functionality worked. >>> >>> David >>> >>> >> --00000000000038270d061d631223 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi,


After going over the Linux code, I think adding direct mapping doesn't requi= re any changes outside of FUSE and virtio code. Direct mapping mainly=20 requires code to manage the virtiofs device's memory region in the driv= er.=20 This is a shared memory region between guest and host with which the=20 driver backs FUSE inodes. The driver then includes an allocator used to=20 map parts of an inode into the region.

It should be possible to pass host-guest shared pages to ARC, = with=20 the caveat that the virtiofs driver should be able to reclaim them at=20 any time. Does the code currently allow this? Virtiofs needs this because i= t maps region pages to inodes, and must reuse cold region pages during an a= llocation if there aren't any available.=20 Basically, the region is a separate pool of device pages that's managed= =20 directly by virtiofs.

<= div dir=3D"ltr">
If I understand the protocol correct= ly, the DAX mode is the same as the direct mmap mode in FUSE (not sure if F= reeBSD!=E2=80=99s kernel fuse bits support this?).

=EF=BB=BF
Hi David, Warner,

=
=C2=A0=C2=A0=C2=A0 I'm glad you find this approach interesting! I&= #39;ve been meaning to update the virtio-dbg patch for a while but unfortun= ately haven't found the time in the last month since I uploaded it... I= 'll update it soon to address the reviews and split off the=20 userspace device emulation code out of the patch to make reviewing=20 easier (thanks Alan for the suggestion). If you have any questions or feedb= ack please let me know.

WRT virtiofs itself, I= 've been working on it too but I haven't found the time to clean it= up and upload it. I have a messy but working implementation her= e. The changes to FUSE itself are indeed minimal because it is enough t= o redirect the messages into a virtiofs device instead of sending them to a= local FUSE device. The virtiofs device and the FUSE device are both simple= bidirectional queues. Not sure on how to deal with directly mapping files = between host and guest just yet, because the Linux driver uses their DAX in= terface for that, but it should be possible.

E= mil

On Sun, Jul 14, 2024 at 3:11=E2=80=AFAM David Chisnall <theraven@freebsd.org= > wrote:
=
Wow, that looks incredibly useful.=C2=A0 Not needing bhyve / qemu (nes= ted, if your main development is a VM) to test virtio drivers would be a hu= ge productivity win. =C2=A0

David

On 13 Jul 2024, at 23:06, Warner Losh <imp@bsdimp.com> wrote:<= /div>
Hey David,

You= might want to check out=C2=A0 https://reviews.freebsd.org/D45370 which has the t= esting framework as well as hints at other work that's been done for vi= rtiofs=C2=A0by Emil=C2=A0Tsalapatis. It looks quite interesting. Anything h= e's done that's at odds with what I've said just shows where my= analysis was flawed :) This looks quite promising, but I've not had th= e time to look at it in detail yet.

Warner

On= Sat, Jul 13, 2024 at 2:44=E2=80=AFAM David Chisnall <theraven@freebsd.org> wrote:=
On 31 Dec = 2023, at 16:19, Warner Losh <imp@bsdimp.com> wrote:
Yea. The FUSE protocol is going to be the challeng= e here. For this to be useful, the VirtioFS=C2=A0support on=C2=A0the FreeBS= D=C2=A0 needs to be 100% in the kernel, since you can't have userland i= n the loop. This isn't so terrible, though, since our VFS interface pro= vides a natural breaking point for converting the requests into FUSE reques= ts. The trouble, I fear, is a mismatch between FreeBSD's VFS abstractio= n layer and Linux's will cause issues (many years ago, the weakness of = FreeBSD VFS caused problems for a company doing caching, though things have= no doubt improved from those days). Second, there's a KVM tie-in for t= he direct mapped pages between the VM and the hypervisor. I'm not sure = how that works on the client (FreeBSD) side (though the description also sa= ys it's mapped via a PCI bar, so maybe the VM OS doesn't care).

From what I can tell from a little bit = of looking at the code, our FUSE implementation has a fairly cleanly abstra= cted layer (in fuse_ipc.c) for handling the message queue.=C2=A0 For Virtio= FS, it would 'just' be necessary to factor out the bits here that d= o uio into something that talked to a VirtIO ring.=C2=A0 I don=E2=80=99t kn= ow what the VFS limitations are, but since the protocol for VirtioFS is the= kernel <-> userspace protocol for FUSE, it seems that any functional= ity that works with FUSE filesystems in userspace would work with VirtioFS = filesystems.

The shared buffer cache bits are nice= , but are optional, so could be done in a later version once the basic func= tionality worked. =C2=A0

David


--00000000000038270d061d631223--