Understanding the rationale behind dropping of "block devices"
Konstantin Belousov
kostikbel at gmail.com
Mon Jan 16 11:15:43 UTC 2017
On Mon, Jan 16, 2017 at 04:19:59PM +0530, Aijaz Baig wrote:
> I must add that I am getting confused specifically between two different
> things here:
> >From the replies above it appears that all disk accesses have to go through
> the VM subsystem now (so no raw disk accesses) however the arch handbook
> says raw interfaces are the way to go for disks (
> https://www.freebsd.org/doc/en/books/arch-handbook/driverbasics-block.html)?
Do not mix the concept of raw disk access and using some VM code to
implement this access. See my other reply for some more explanation of
the raw disk access, physio in the kernel source files terminology,
sys/kern/kern_physio.c.
>
> Secondly, I presume that the VM subsystem has it's own caching and
> buffering mechanism that is independent to the file system so an IO can
> choose to skip buffering at the file-system layer however it will still be
> served by the VM cache irrespective of whatever the VM object maps to. Is
> that true? I believe this is what is meant by 'caching' at the VM layer.
First, the term page cache has different meaning in the kernel code,
and that page cache was removed from the kernel very recently. More
correct but much longer term is 'page queue of the vm object'. If
given vnode has a vm object associated with it, then buffer cache ensures
that buffers for the given chunk of the vnode data range are created from
appropriately indexed pages from the queue. This way, buffer cache becomes
consistent with the page queue.
The vm object is created on the first vnode open by filesystem-specific
code, at least for UFS/devfs/msdosfs/cd9600 etc.
Caching policy for buffers is determined both by buffer cache and by
(somewhat strong) hints from the filesystems interfacing with the cache.
The pages constituing the buffer are wired, i.e. VM subsystem is informed
by buffer cache to not reclaim pages while the buffer is alive.
VM page caching, i.e. storing them in the vnode page queue, is only
independent from the buffer cache when VM need/can handle something
that does not involve the buffer cache. E.g. on page fault in the
region backed by the file, VM allocates neccessary fresh (AKA without
valid content) pages and issues read request into the filesystem which
owns the vnode. It is up to the filesystem to implement read in any
reasonable way.
Until recently, UFS and other local filesystems provided raw disk block
indexes for the generic VM code which then read content from the disk
blocks into the pages. This has its own shares of problems (but not
the consistency issue, since pages are allocated in the vnode vm
object page queue). I changes that path to go through the buffer cache
explicitely several months ago.
But all this is highly depended on the filesystem. As the polar case,
tmpfs reuses the swap-backed object, which holds the file data, as the
vnode' vm object. The result is that paging requests from the tmpfs
mapped file is handled as if it is swap-backed anonymous memory.
ZFS cannot reuse vm object page queue for its very special cache ARC.
So it keeps the consistency between writes and mmaps by copying the
data on write(2) both into ARC buffer, and into the pages from vm object.
Hope this helps.
More information about the freebsd-scsi
mailing list