Understanding the rationale behind dropping of "block devices"
Konstantin Belousov
kostikbel at gmail.com
Wed Jan 18 11:46:33 UTC 2017
On Tue, Jan 17, 2017 at 05:15:38PM +0530, Aijaz Baig wrote:
> First of all, a very big thank you to Konstantin for such a detailed reply!
> It has been a
> pleasure to go through. My replies are inline
>
> On Mon, Jan 16, 2017 at 4:30 PM, Konstantin Belousov <kostikbel at gmail.com>
> wrote:
>
> > This is not true.
> >
> > We do have buffer cache of the blocks read through the device (special)
> > vnode. This is how, typically, the metadata for filesystems which are
> > clients of the buffer cache, is handled, i.e. UFS msdosfs cd9600 etc.
> > It is up to the filesystem to not create aliased cached copies of the
> > blocks both in the device vnode buffer list and in the filesystem vnode.
> >
> > In fact, sometimes filesystems, e.g. UFS, consciously break this rule
> > and read blocks of the user vnode through the disk cache. For instance,
> > this happens for the SU truncation of the indirect blocks.
> >
> This makes a lot more sense now. This basically means that no matter
> the underlying entity (for the vnode), be it a device special file or a
> remote location it always has to go through the VFS layer. So as you
> clearly said, the file-system has the sole discretion what to do with them.
> And as example, UFS does break this rule by having caching disk blocks
> which have already been cached by the VFS.
I do not understand what the 'it' that has to go through the VFS layer.
>
> > > If you want device M, at offset N we will fetch it for you from the
> > > device, DMA'd directly into your address space,
> > > but there is no cached copy.
> > > Having said that, it would be trivial to add a 'caching' geom layer to
> > > the system but that has never been needed.
> > The useful interpretation of the claim that FreeBSD does not cache
> > disk blocks is that the cache is not accessible over the user-initiated
> > i/o (read(2) and write(2)) through the opened devfs nodes. If a program
> > issues such request, it indeed goes directly to/from disk driver, which
> > is supplied a kernel buffer formed by remapped user pages.
> So basically read(2) and write(2) calls on device nodes bypass the VFS
> buffer cache as well and the IO indeed goes directly through the kernel
> memory pages (which as you said are in fact remapped user-land pages)
> So does that mean only the file-system code now uses the disk buffer
> cache?
Right now, in the tree, only filesystems calls into vfs_bio.c.
>
> > Note that if this device was or is mounted and filesystem kept some
> > metadata in the buffer cache, then the devfs i/o would make the
> > cache inconsistent.
> >
> Device being mounted as a file-system you mean? Could you please
> elaborate?
Yes, device which carries a volume, and the volume is mounted.
>
> > > The added complexity of carrying around two alternate interfaces to
> > > the same devices was judged by those who did the work to be not worth
> > > the small gain available to the very few people who used raw devices.
> > > Interestingly, since that time ZFS has implemented a block-layer cache
> > > for itself which is of course not integrated with the non-existing
> > > block level cache in the system :-).
> > We do carry two interfaces in the cdev drivers, which are lumped into
> > one. In particular, it is not easy to implement mapping of the block
> > devices exactly because the interfaces are mixed.
> By "mapping" of the block devices, you mean serving the IO intended for the
> said disk blocks right?
I mean, using mmap(2) interface on the file which references device special
node.
>
> So as you said, we can either serve the IO via the
> VFS
> directly using buffer cache or we could do that via the file system cache
>
> > If cdev disk device is mapped, VM would try to use cdevsw'd_mmap
> > or later mapping interfaces to handle user page faults, which is incorrect
> > for the purpose of the disk block mapping.
> Could you please elaborate?
Read the code, I do not see much sense in rewording things that are
stated in the code.
>
> > > I must add that I am getting confused specifically between two different
> > > things here:
> > > >From the replies above it appears that all disk accesses have to go
> through
> > > the VM subsystem now (so no raw disk accesses) however the arch handbook
> > > says raw interfaces are the way to go for disks (
> > > https://www.freebsd.org/doc/en/books/arch-handbook/
> driverbasics-block.html)?
> > Do not mix the concept of raw disk access and using some VM code to
> > implement this access. See my other reply for some more explanation of
> > the raw disk access, physio in the kernel source files terminology,
> > sys/kern/kern_physio.c.
> >
> yes I have taken a note of your earlier replies (thank you for being so
> elaborate) and
> as I have re-iterated earlier, I now understand that raw disk access is now
> direct
> between the kernel memory and the underlying device.
Such io is always direct between memory and device. The differences is
in who owns the memory used for io, and how this is interpreted by
system.
> So as you mentioned,
> the
> file system code (and perhaps only a few other entities) use the disk
> buffer cache
> that the VM implements. So an end user cannot interact with the buffer
> cache in
> any way is that what it is?
This question does not make any sense. Buffer cache is the kernel subsystem,
used as a library for other parts of the kernel.
>
> > > Secondly, I presume that the VM subsystem has it's own caching and
> > > buffering mechanism that is independent to the file system so an IO can
> > > choose to skip buffering at the file-system layer however it will
> still be
> > > served by the VM cache irrespective of whatever the VM object maps to.
> Is
> > > that true? I believe this is what is meant by 'caching' at the VM layer.
> > First, the term page cache has different meaning in the kernel code,
> > and that page cache was removed from the kernel very recently.
> > More correct but much longer term is 'page queue of the vm object'. If
> > given vnode has a vm object associated with it, then buffer cache ensures
> > that buffers for the given chunk of the vnode data range are created from
> > appropriately indexed pages from the queue. This way, buffer cache
> becomes
> > consistent with the page queue.
> > The vm object is created on the first vnode open by filesystem-specific
> > code, at least for UFS/devfs/msdosfs/cd9600 etc.
> I understand page cache as a cache implemented by the file system to speed
> up IO access (at least this is what Linux defines it as). Does it have a
> different
> meaning in FreeBSD?
I explicitely answered this question in advance, above.
>
> So a vnode is a VFS entity right? And I presume a VM object is any object
> from
> the perspective of the virtual memory subsystem.
No, vm object is struct vm_object.
> Since we no longer run in
> real mode
> isn't every vnode actually supposed to have an entity in the VM subsystem?
> May be
> I am not understanding what 'page cache' means in FreeBSD?
At this point, I am not able to add any information to you. Unless
you read the code, any further explanations would not provide any
useful sense.
>
> I mean every vnode in the VFS layer must have a backing VM object right?
No.
> May be only mmap(2)'ed device nodes don't have a backing VM object or do
> they?
Device vnodes do have backing VM object, but they cannot be mapped.
> If this assumption is correct than I cannot get my mind around what you
> mentioned
> regarding buffer caches coming into play for vnodes *only* if it has a
> backing vm object
I never said this.
>
> >
> > Caching policy for buffers is determined both by buffer cache and by
> > (somewhat strong) hints from the filesystems interfacing with the cache.
> > The pages constituting the buffer are wired, i.e. VM subsystem is informed
> > by buffer cache to not reclaim pages while the buffer is alive.
> >
> > VM page caching, i.e. storing them in the vnode page queue, is only
> > independent from the buffer cache when VM need/can handle something
> > that does not involve the buffer cache. E.g. on page fault in the
> > region backed by the file, VM allocates neccessary fresh (AKA without
> > valid content) pages and issues read request into the filesystem which
> > owns the vnode. It is up to the filesystem to implement read in any
> > reasonable way.
> >
> > Until recently, UFS and other local filesystems provided raw disk block
> > indexes for the generic VM code which then read content from the disk
> > blocks into the pages. This has its own shares of problems (but not
> > the consistency issue, since pages are allocated in the vnode vm
> > object page queue). I changes that path to go through the buffer cache
> > explicitely several months ago.
> >
> > But all this is highly depended on the filesystem. As the polar case,
> > tmpfs reuses the swap-backed object, which holds the file data, as the
> > vnode' vm object. The result is that paging requests from the tmpfs
> > mapped file is handled as if it is swap-backed anonymous memory.
> >
> > ZFS cannot reuse vm object page queue for its very special cache ARC.
> > So it keeps the consistency between writes and mmaps by copying the
> > data on write(2) both into ARC buffer, and into the pages from vm object.
> Well this is rather confusing (sorry again) may be too much detail for a
> noob
> like myself to appreciate at this stage of my journey.
>
> Nevertheless to summarize this, raw disk block access bypasses the
> buffer cache (as you had so painstakingly explained about read(2) and
> write(2) above) but is still cache by the VFS in the page queue. However
> this
> is also at the sole discretion of the file-system right?
>
> To summarize, the page cache (or rather the page queue for a given vnode)
> and the buffer cache are in fact separate entities although they are very
> tightly coupled
> for the most part except in cases like what you mentioned (about file
> backed data).
>
> So if we think of these as vertical layers, how would they look? From what
> you talk
> about page faults, it appears that the VM subsystem is apparently placed
> above or
> perhaps adjacent to the VFS layer Is that correct? Also about these caches,
> the buffer cache is a global cache available to both the VM subsystem as
> well as
> the VFS layer whereas the page queue for the vnode is the responsibility of
> the
> underlying file-system. Is that true?
>
> > Hope this helps.
> Of course this has helped. AL though it has raised a lot more questions as
> you can see
> at least it has got me thinking in (hopefully) the right direction. Once
> again a very big
> thank you!!
>
> Best Regards,
> Aijaz Baig
More information about the freebsd-scsi
mailing list