Is anyone working on VirtFS (FUSE over VirtIO)

Discussion:

(too old to reply)

David Chisnall

2023-12-31 12:46:06 UTC

Hi,

For running FreeBSD containers on macOS, I’m using dfr’s update of the 9pfs client code. This seems to work fine but Podman is in the process of moving from using QEMU to using Apple’s native hypervisor frameworks. These don’t provide 9pfs servers and instead provide a native VirtFS server (macOS now ships with a native VirtFS client, as does Linux).

I believe the component bits for at least a functional implementation already exist (FUSE and a VirtIO transport), though I’m not sure about the parts for sharing buffer cache pages with the host. Is anyone working on connecting these together?

David

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Alan Somers

2023-12-31 14:35:37 UTC

Permalink

Post by David Chisnall
Hi,
For running FreeBSD containers on macOS, I’m using dfr’s update of the 9pfs client code. This seems to work fine but Podman is in the process of moving from using QEMU to using Apple’s native hypervisor frameworks. These don’t provide 9pfs servers and instead provide a native VirtFS server (macOS now ships with a native VirtFS client, as does Linux).
I believe the component bits for at least a functional implementation already exist (FUSE and a VirtIO transport), though I’m not sure about the parts for sharing buffer cache pages with the host. Is anyone working on connecting these together?
David

Nobody that I know of. And while I understand the FUSE stuff well,
I'm shakier on VirtIO and the buffer cache. Do you think that this is
something that a GSoC student could accomplish?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Chisnall

2023-12-31 15:24:50 UTC

Permalink

Post by Alan Somers

Nobody that I know of. And while I understand the FUSE stuff well,
I'm shakier on VirtIO and the buffer cache. Do you think that this is
something that a GSoC student could accomplish?

I’m not familiar enough with either part of the kernel to know. A competent student with two mentors each familiar with one of the parts might, but this is increasingly strategically important. The newer cloud container-hosting platforms are moving to lightweight VMs with VirtFS because it lets them get the same sharing of container image contents between hosts but with full kernel isolation. It would be easy to plug FreeBSD in as an alternative to Linux with this support.

The VirtFS protocol is less well documented than I’d like, but it appears to primarily be a different transport for FUSE messages and so may be quite easy to add if the FUSE code is sufficiently abstracted.

David

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Chisnall

2024-07-13 08:44:40 UTC

Permalink

Yea. The FUSE protocol is going to be the challenge here. For this to be useful, the VirtioFS support on the FreeBSD needs to be 100% in the kernel, since you can't have userland in the loop. This isn't so terrible, though, since our VFS interface provides a natural breaking point for converting the requests into FUSE requests. The trouble, I fear, is a mismatch between FreeBSD's VFS abstraction layer and Linux's will cause issues (many years ago, the weakness of FreeBSD VFS caused problems for a company doing caching, though things have no doubt improved from those days). Second, there's a KVM tie-in for the direct mapped pages between the VM and the hypervisor. I'm not sure how that works on the client (FreeBSD) side (though the description also says it's mapped via a PCI bar, so maybe the VM OS doesn't care).

From what I can tell from a little bit of looking at the code, our FUSE implementation has a fairly cleanly abstracted layer (in fuse_ipc.c) for handling the message queue. For VirtioFS, it would 'just' be necessary to factor out the bits here that do uio into something that talked to a VirtIO ring. I donât know what the VFS limitations are, but since the protocol for VirtioFS is the kernel <-> userspace protocol for FUSE, it seems that any functionality that works with FUSE filesystems in userspace would work with VirtioFS filesystems.

The shared buffer cache bits are nice, but are optional, so could be done in a later version once the basic functionality worked.

David

Warner Losh

2024-07-13 22:06:10 UTC

Permalink

Hey David,

You might want to check out https://reviews.freebsd.org/D45370 which has
the testing framework as well as hints at other work that's been done for
virtiofs by Emil Tsalapatis. It looks quite interesting. Anything he's done
that's at odds with what I've said just shows where my analysis was flawed
:) This looks quite promising, but I've not had the time to look at it in
detail yet.

Warner

Post by Warner Losh
Yea. The FUSE protocol is going to be the challenge here. For this to be
useful, the VirtioFS support on the FreeBSD needs to be 100% in the
kernel, since you can't have userland in the loop. This isn't so terrible,
though, since our VFS interface provides a natural breaking point for
converting the requests into FUSE requests. The trouble, I fear, is a
mismatch between FreeBSD's VFS abstraction layer and Linux's will cause
issues (many years ago, the weakness of FreeBSD VFS caused problems for a
company doing caching, though things have no doubt improved from those
days). Second, there's a KVM tie-in for the direct mapped pages between the
VM and the hypervisor. I'm not sure how that works on the client (FreeBSD)
side (though the description also says it's mapped via a PCI bar, so maybe
the VM OS doesn't care).
From what I can tell from a little bit of looking at the code, our FUSE
implementation has a fairly cleanly abstracted layer (in fuse_ipc.c) for
handling the message queue. For VirtioFS, it would 'just' be necessary to
factor out the bits here that do uio into something that talked to a VirtIO
ring. I donât know what the VFS limitations are, but since the protocol
for VirtioFS is the kernel <-> userspace protocol for FUSE, it seems that
any functionality that works with FUSE filesystems in userspace would work
with VirtioFS filesystems.
The shared buffer cache bits are nice, but are optional, so could be done
in a later version once the basic functionality worked.
David

David Chisnall

2024-07-14 07:11:26 UTC

Permalink

Wow, that looks incredibly useful. Not needing bhyve / qemu (nested, if your main development is a VM) to test virtio drivers would be a huge productivity win.

David

Post by Warner Losh
Hey David,
You might want to check out https://reviews.freebsd.org/D45370 which has the testing framework as well as hints at other work that's been done for virtiofs by Emil Tsalapatis. It looks quite interesting. Anything he's done that's at odds with what I've said just shows where my analysis was flawed :) This looks quite promising, but I've not had the time to look at it in detail yet.
Warner

Post by David Chisnall

From what I can tell from a little bit of looking at the code, our FUSE implementation has a fairly cleanly abstracted layer (in fuse_ipc.c) for handling the message queue. For VirtioFS, it would 'just' be necessary to factor out the bits here that do uio into something that talked to a VirtIO ring. I donât know what the VFS limitations are, but since the protocol for VirtioFS is the kernel <-> userspace protocol for FUSE, it seems that any functionality that works with FUSE filesystems in userspace would work with VirtioFS filesystems.
The shared buffer cache bits are nice, but are optional, so could be done in a later version once the basic functionality worked.
David

Emil Tsalapatis

2024-07-14 14:02:48 UTC

Permalink

Hi David, Warner,

I'm glad you find this approach interesting! I've been meaning to
update the virtio-dbg patch for a while but unfortunately haven't found the
time in the last month since I uploaded it... I'll update it soon to
address the reviews and split off the userspace device emulation code out
of the patch to make reviewing easier (thanks Alan for the suggestion). If
you have any questions or feedback please let me know.

WRT virtiofs itself, I've been working on it too but I haven't found the
time to clean it up and upload it. I have a messy but working
implementation here
<https://github.com/etsal/freebsd-src/tree/virtiofs-head>. The changes to
FUSE itself are indeed minimal because it is enough to redirect the
messages into a virtiofs device instead of sending them to a local FUSE
device. The virtiofs device and the FUSE device are both simple
bidirectional queues. Not sure on how to deal with directly mapping files
between host and guest just yet, because the Linux driver uses their DAX
interface for that, but it should be possible.

Emil

Post by David Chisnall
Wow, that looks incredibly useful. Not needing bhyve / qemu (nested, if
your main development is a VM) to test virtio drivers would be a huge
productivity win.
David
Hey David,
You might want to check out https://reviews.freebsd.org/D45370 which has
the testing framework as well as hints at other work that's been done for
virtiofs by Emil Tsalapatis. It looks quite interesting. Anything he's done
that's at odds with what I've said just shows where my analysis was flawed
:) This looks quite promising, but I've not had the time to look at it in
detail yet.
Warner

David Chisnall

2024-07-15 07:47:33 UTC

Permalink

Emil Tsalapatis

2024-07-16 20:15:24 UTC

Permalink

Hi,

Hi,
This looks great! Are there infrastructure problems with supporting the
DAX or is it âjust workâ? I had hoped that the extensions to the buffer
cache that allow ARC to own pages that are delegated to the buffer cache
would be sufficient.

After going over the Linux code, I think adding direct mapping doesn't
require any changes outside of FUSE and virtio code. Direct mapping mainly
requires code to manage the virtiofs device's memory region in the driver.
This is a shared memory region between guest and host with which the driver
backs FUSE inodes. The driver then includes an allocator used to map parts
of an inode into the region.

It should be possible to pass host-guest shared pages to ARC, with the
caveat that the virtiofs driver should be able to reclaim them at any time.
Does the code currently allow this? Virtiofs needs this because it maps
region pages to inodes, and must reuse cold region pages during an
allocation if there aren't any available. Basically, the region is a
separate pool of device pages that's managed directly by virtiofs.

If I understand the protocol correctly, the DAX mode is the same as the

direct mmap mode in FUSE (not sure if FreeBSD!âs kernel fuse bits support
this?).

Yeah, virtiofs DAX seems like it's similar to FUSE direct mmap, but with
FUSE inodes being backed by the shared region instead. I don't think
FreeBSD has direct mmap but I may be wrong there.

Emil

David
ï»¿
Hi David, Warner,
I'm glad you find this approach interesting! I've been meaning to
update the virtio-dbg patch for a while but unfortunately haven't found the
time in the last month since I uploaded it... I'll update it soon to
address the reviews and split off the userspace device emulation code out
of the patch to make reviewing easier (thanks Alan for the suggestion). If
you have any questions or feedback please let me know.
WRT virtiofs itself, I've been working on it too but I haven't found the
time to clean it up and upload it. I have a messy but working
implementation here
<https://github.com/etsal/freebsd-src/tree/virtiofs-head>. The changes to
FUSE itself are indeed minimal because it is enough to redirect the
messages into a virtiofs device instead of sending them to a local FUSE
device. The virtiofs device and the FUSE device are both simple
bidirectional queues. Not sure on how to deal with directly mapping files
between host and guest just yet, because the Linux driver uses their DAX
interface for that, but it should be possible.
Emil

Post by David Chisnall
Wow, that looks incredibly useful. Not needing bhyve / qemu (nested, if
your main development is a VM) to test virtio drivers would be a huge
productivity win.
David
Hey David,
You might want to check out https://reviews.freebsd.org/D45370 which
has the testing framework as well as hints at other work that's been done
for virtiofs by Emil Tsalapatis. It looks quite interesting. Anything he's
done that's at odds with what I've said just shows where my analysis was
flawed :) This looks quite promising, but I've not had the time to look at
it in detail yet.
Warner

Post by Warner Losh
Yea. The FUSE protocol is going to be the challenge here. For this to be
useful, the VirtioFS support on the FreeBSD needs to be 100% in the
kernel, since you can't have userland in the loop. This isn't so terrible,
though, since our VFS interface provides a natural breaking point for
converting the requests into FUSE requests. The trouble, I fear, is a
mismatch between FreeBSD's VFS abstraction layer and Linux's will cause
issues (many years ago, the weakness of FreeBSD VFS caused problems for a
company doing caching, though things have no doubt improved from those
days). Second, there's a KVM tie-in for the direct mapped pages between the
VM and the hypervisor. I'm not sure how that works on the client (FreeBSD)
side (though the description also says it's mapped via a PCI bar, so maybe
the VM OS doesn't care).
From what I can tell from a little bit of looking at the code, our FUSE
implementation has a fairly cleanly abstracted layer (in fuse_ipc.c) for
handling the message queue. For VirtioFS, it would 'just' be necessary to
factor out the bits here that do uio into something that talked to a VirtIO
ring. I donât know what the VFS limitations are, but since the protocol
for VirtioFS is the kernel <-> userspace protocol for FUSE, it seems that
any functionality that works with FUSE filesystems in userspace would work
with VirtioFS filesystems.
The shared buffer cache bits are nice, but are optional, so could be
done in a later version once the basic functionality worked.
David

David Chisnall

2024-07-17 08:31:26 UTC

Permalink

Thatâs how I understood the spec too.

It should be possible to pass host-guest shared pages to ARC, with the caveat that the virtiofs driver should be able to reclaim them at any time. Does the code currently allow this? Virtiofs needs this because it maps region pages to inodes, and must reuse cold region pages during an allocation if there aren't any available. Basically, the region is a separate pool of device pages that's managed directly by virtiofs.

I am not overly familiar with the buffer cache code, but I believe the code that was added to support ARC had similar requirements. The first ZFS port had pages in ARC and then exactly the same data in the buffer cache. The buffer cache was extended with a notion of pages that it didnât own so that it could just use the pages in ARC directly.

I donât remember if thereâs existing support for ARC to remove those pages from the buffer cache. They are both kernel pages so it would be possible to just treat removing them from ARC as an accounting operation. There is, I believe, support for the pager to remove arbitrary pages and so it might be simple to just add a new kind of pager for these pages (which just tells the host to flush the pages).

If I understand the protocol correctly, the DAX mode is the same as the direct mmap mode in FUSE (not sure if FreeBSD!âs kernel fuse bits support this?).

Yeah, virtiofs DAX seems like it's similar to FUSE direct mmap, but with FUSE inodes being backed by the shared region instead. I don't think FreeBSD has direct mmap but I may be wrong there.

It would be a nice feature to have if not!

David

Warner Losh

2023-12-31 16:19:23 UTC

Permalink

Top posting: I think you mean VirtioFS, not VirtFS. The latter is the 9p
thing that dfr is doing, the former is FUSE over VirtIO. I'll assume you
mean that.

Post by David Chisnall

Post by Alan Somers

Hi,
For running FreeBSD containers on macOS, Iâm using dfrâs update of the

9pfs client code. This seems to work fine but Podman is in the process of
moving from using QEMU to using Appleâs native hypervisor frameworks.
These donât provide 9pfs servers and instead provide a native VirtFS server
(macOS now ships with a native VirtFS client, as does Linux).

Post by Alan Somers

I believe the component bits for at least a functional implementation

already exist (FUSE and a VirtIO transport), though Iâm not sure about the
parts for sharing buffer cache pages with the host. Is anyone working on
connecting these together?

Post by Alan Somers

David

Nobody that I know of. And while I understand the FUSE stuff well,
I'm shakier on VirtIO and the buffer cache. Do you think that this is
something that a GSoC student could accomplish?

Iâm not familiar enough with either part of the kernel to know. A
competent student with two mentors each familiar with one of the parts
might, but this is increasingly strategically important. The newer cloud
container-hosting platforms are moving to lightweight VMs with VirtFS
because it lets them get the same sharing of container image contents
between hosts but with full kernel isolation. It would be easy to plug
FreeBSD in as an alternative to Linux with this support.

We shouldn't pin our hopes on GSoC for this. If it is important, it needs
to be funded.

Post by David Chisnall
The VirtFS protocol is less well documented than Iâd like, but it appears
to primarily be a different transport for FUSE messages and so may be quite
easy to add if the FUSE code is sufficiently abstracted.

Yea. The FUSE protocol is going to be the challenge here. For this to be
useful, the VirtioFS support on the FreeBSD needs to be 100% in the
kernel, since you can't have userland in the loop. This isn't so terrible,
though, since our VFS interface provides a natural breaking point for
converting the requests into FUSE requests. The trouble, I fear, is a
mismatch between FreeBSD's VFS abstraction layer and Linux's will cause
issues (many years ago, the weakness of FreeBSD VFS caused problems for a
company doing caching, though things have no doubt improved from those
days). Second, there's a KVM tie-in for the direct mapped pages between the
VM and the hypervisor. I'm not sure how that works on the client (FreeBSD)
side (though the description also says it's mapped via a PCI bar, so maybe
the VM OS doesn't care).

Now, having said that it's a challenge shouldn't be taken as
discouragement. I think it's going to take advice from a lot of different
people to be successful. It sounds like a fun project, but I'm already
over-subscribed to fun projects for $WORK. I cast no doubt on its
importance.

Warner