Understanding the rationale behind dropping of "block devices"
Aijaz Baig
aijazbaig1 at gmail.com
Tue Jan 17 11:45:42 UTC 2017
First of all, a very big thank you to Konstantin for such a detailed reply!
It has been a
pleasure to go through. My replies are inline
On Mon, Jan 16, 2017 at 4:30 PM, Konstantin Belousov <kostikbel at gmail.com>
wrote:
> This is not true.
>
> We do have buffer cache of the blocks read through the device (special)
> vnode. This is how, typically, the metadata for filesystems which are
> clients of the buffer cache, is handled, i.e. UFS msdosfs cd9600 etc.
> It is up to the filesystem to not create aliased cached copies of the
> blocks both in the device vnode buffer list and in the filesystem vnode.
>
> In fact, sometimes filesystems, e.g. UFS, consciously break this rule
> and read blocks of the user vnode through the disk cache. For instance,
> this happens for the SU truncation of the indirect blocks.
>
This makes a lot more sense now. This basically means that no matter
the underlying entity (for the vnode), be it a device special file or a
remote location it always has to go through the VFS layer. So as you
clearly said, the file-system has the sole discretion what to do with them.
And as example, UFS does break this rule by having caching disk blocks
which have already been cached by the VFS.
> > If you want device M, at offset N we will fetch it for you from the
> > device, DMA'd directly into your address space,
> > but there is no cached copy.
> > Having said that, it would be trivial to add a 'caching' geom layer to
> > the system but that has never been needed.
> The useful interpretation of the claim that FreeBSD does not cache
> disk blocks is that the cache is not accessible over the user-initiated
> i/o (read(2) and write(2)) through the opened devfs nodes. If a program
> issues such request, it indeed goes directly to/from disk driver, which
> is supplied a kernel buffer formed by remapped user pages.
So basically read(2) and write(2) calls on device nodes bypass the VFS
buffer cache as well and the IO indeed goes directly through the kernel
memory pages (which as you said are in fact remapped user-land pages)
So does that mean only the file-system code now uses the disk buffer
cache?
> Note that if this device was or is mounted and filesystem kept some
> metadata in the buffer cache, then the devfs i/o would make the
> cache inconsistent.
>
Device being mounted as a file-system you mean? Could you please
elaborate?
> > The added complexity of carrying around two alternate interfaces to
> > the same devices was judged by those who did the work to be not worth
> > the small gain available to the very few people who used raw devices.
> > Interestingly, since that time ZFS has implemented a block-layer cache
> > for itself which is of course not integrated with the non-existing
> > block level cache in the system :-).
> We do carry two interfaces in the cdev drivers, which are lumped into
> one. In particular, it is not easy to implement mapping of the block
> devices exactly because the interfaces are mixed.
By "mapping" of the block devices, you mean serving the IO intended for the
said disk blocks right? So as you said, we can either serve the IO via the
VFS
directly using buffer cache or we could do that via the file system cache
> If cdev disk device is mapped, VM would try to use cdevsw'd_mmap
> or later mapping interfaces to handle user page faults, which is incorrect
> for the purpose of the disk block mapping.
Could you please elaborate?
> > I must add that I am getting confused specifically between two different
> > things here:
> > >From the replies above it appears that all disk accesses have to go
through
> > the VM subsystem now (so no raw disk accesses) however the arch handbook
> > says raw interfaces are the way to go for disks (
> > https://www.freebsd.org/doc/en/books/arch-handbook/
driverbasics-block.html)?
> Do not mix the concept of raw disk access and using some VM code to
> implement this access. See my other reply for some more explanation of
> the raw disk access, physio in the kernel source files terminology,
> sys/kern/kern_physio.c.
>
yes I have taken a note of your earlier replies (thank you for being so
elaborate) and
as I have re-iterated earlier, I now understand that raw disk access is now
direct
between the kernel memory and the underlying device. So as you mentioned,
the
file system code (and perhaps only a few other entities) use the disk
buffer cache
that the VM implements. So an end user cannot interact with the buffer
cache in
any way is that what it is?
> > Secondly, I presume that the VM subsystem has it's own caching and
> > buffering mechanism that is independent to the file system so an IO can
> > choose to skip buffering at the file-system layer however it will
still be
> > served by the VM cache irrespective of whatever the VM object maps to.
Is
> > that true? I believe this is what is meant by 'caching' at the VM layer.
> First, the term page cache has different meaning in the kernel code,
> and that page cache was removed from the kernel very recently.
> More correct but much longer term is 'page queue of the vm object'. If
> given vnode has a vm object associated with it, then buffer cache ensures
> that buffers for the given chunk of the vnode data range are created from
> appropriately indexed pages from the queue. This way, buffer cache
becomes
> consistent with the page queue.
> The vm object is created on the first vnode open by filesystem-specific
> code, at least for UFS/devfs/msdosfs/cd9600 etc.
I understand page cache as a cache implemented by the file system to speed
up IO access (at least this is what Linux defines it as). Does it have a
different
meaning in FreeBSD?
So a vnode is a VFS entity right? And I presume a VM object is any object
from
the perspective of the virtual memory subsystem. Since we no longer run in
real mode
isn't every vnode actually supposed to have an entity in the VM subsystem?
May be
I am not understanding what 'page cache' means in FreeBSD?
I mean every vnode in the VFS layer must have a backing VM object right?
May be only mmap(2)'ed device nodes don't have a backing VM object or do
they?
If this assumption is correct than I cannot get my mind around what you
mentioned
regarding buffer caches coming into play for vnodes *only* if it has a
backing vm object
>
> Caching policy for buffers is determined both by buffer cache and by
> (somewhat strong) hints from the filesystems interfacing with the cache.
> The pages constituting the buffer are wired, i.e. VM subsystem is informed
> by buffer cache to not reclaim pages while the buffer is alive.
>
> VM page caching, i.e. storing them in the vnode page queue, is only
> independent from the buffer cache when VM need/can handle something
> that does not involve the buffer cache. E.g. on page fault in the
> region backed by the file, VM allocates neccessary fresh (AKA without
> valid content) pages and issues read request into the filesystem which
> owns the vnode. It is up to the filesystem to implement read in any
> reasonable way.
>
> Until recently, UFS and other local filesystems provided raw disk block
> indexes for the generic VM code which then read content from the disk
> blocks into the pages. This has its own shares of problems (but not
> the consistency issue, since pages are allocated in the vnode vm
> object page queue). I changes that path to go through the buffer cache
> explicitely several months ago.
>
> But all this is highly depended on the filesystem. As the polar case,
> tmpfs reuses the swap-backed object, which holds the file data, as the
> vnode' vm object. The result is that paging requests from the tmpfs
> mapped file is handled as if it is swap-backed anonymous memory.
>
> ZFS cannot reuse vm object page queue for its very special cache ARC.
> So it keeps the consistency between writes and mmaps by copying the
> data on write(2) both into ARC buffer, and into the pages from vm object.
Well this is rather confusing (sorry again) may be too much detail for a
noob
like myself to appreciate at this stage of my journey.
Nevertheless to summarize this, raw disk block access bypasses the
buffer cache (as you had so painstakingly explained about read(2) and
write(2) above) but is still cache by the VFS in the page queue. However
this
is also at the sole discretion of the file-system right?
To summarize, the page cache (or rather the page queue for a given vnode)
and the buffer cache are in fact separate entities although they are very
tightly coupled
for the most part except in cases like what you mentioned (about file
backed data).
So if we think of these as vertical layers, how would they look? From what
you talk
about page faults, it appears that the VM subsystem is apparently placed
above or
perhaps adjacent to the VFS layer Is that correct? Also about these caches,
the buffer cache is a global cache available to both the VM subsystem as
well as
the VFS layer whereas the page queue for the vnode is the responsibility of
the
underlying file-system. Is that true?
> Hope this helps.
Of course this has helped. AL though it has raised a lot more questions as
you can see
at least it has got me thinking in (hopefully) the right direction. Once
again a very big
thank you!!
Best Regards,
Aijaz Baig
More information about the freebsd-scsi
mailing list