Understanding the rationale behind dropping of "block devices"

Fri Jan 27 02:25:43 UTC 2017

On 16/1/17 7:00 pm, Konstantin Belousov wrote:
> On Mon, Jan 16, 2017 at 05:20:25PM +0800, Julian Elischer wrote:
>> On 16/01/2017 4:49 PM, Aijaz Baig wrote:
>>> Oh yes I was actually running an old release inside a VM and yes I had
>>> changed the device names myself while jotting down notes (to give it a more
>>> descriptive name like what the OSX does). So now I've checked it on a
>>> recent release and yes there is indeed no block device.
>>>
>>> root at bsd-client:/dev # gpart show
>>> =>      34  83886013  da0  GPT  (40G)
>>>           34      1024    1  freebsd-boot  (512K)
>>>         1058  58719232    2  freebsd-ufs  (28G)
>>>     58720290   3145728    3  freebsd-swap  (1.5G)
>>>     61866018  22020029       - free -  (10G)
>>>
>>> root at bsd-client:/dev # ls -lrt da*
>>> crw-r-----  1 root  operator  0x4d Dec 19 17:49 da0p1
>>> crw-r-----  1 root  operator  0x4b Dec 19 17:49 da0
>>> crw-r-----  1 root  operator  0x4f Dec 19 23:19 da0p3
>>> crw-r-----  1 root  operator  0x4e Dec 19 23:19 da0p2
>>>
>>> So this shows that I have a single SATA or SAS drive and there are
>>> apparently 3 partitions ( or is it four?? Why does it show unused space
>>> when I had used the entire disk?)
>>>
>>> Nevertheless my question still holds. What does 'removing support for block
>>> device' mean in this context? Was what I mentioned earlier with regards to
>>> my understanding correct? Viz. all disk devices now have a character (or
>>> raw) interface and are no longer served via the "page cache" but rather the
>>> "buffer cache". Does that mean all disk accesses are now direct by passing
>>> the file system??
>> Basically, FreeBSD never really buffered/cached by device.
>>
>> Buffering and caching is done by vnode in the filesystem.
>> We have no device-based block cache.  If you want file X at offset Y,
>> then we can satisfy that from cache.
>> VM objects map closely to vnode objects so the VM system IS the file
>> system buffer cache.
> This is not true.
>
> We do have buffer cache of the blocks read through the device (special)
> vnode.  This is how, typically, the metadata for filesystems which are
> clients of the buffer cache, is handled, i.e. UFS msdosfs cd9600 etc.
> It is up to the filesystem to not create aliased cached copies of the
> blocks both in the device vnode buffer list and in the filesystem vnode.
>
> In fact, sometimes filesystems, e.g. UFS, consciously break this rule
> and read blocks of the user vnode through the disk cache.  For instance,
> this happens for the SU truncation of the indirect blocks.
yes this caches blocks as an offset into a device, but it is still 
really a
part of the system which provides caching services to vnodes.
(at least that is how it was last time I looked)
>
>> If you want  device M, at offset N we will fetch it for you from the
>> device, DMA'd directly into your address space,
>> but there is no cached copy.
>> Having said that, it would be trivial to add a 'caching' geom layer to
>> the system but that has never been needed.
> The useful interpretation of the claim that FreeBSD does not cache
> disk blocks is that the cache is not accessible over the user-initiated
> i/o (read(2) and write(2)) through the opened devfs nodes.  If a program
> issues such request, it indeed goes directly to/from disk driver, which
> is supplied a kernel buffer formed by remapped user pages.  Note that
> if this device was or is mounted and filesystem kept some metadata in
> the buffer cache, then the devfs i/o would make the cache inconsistent.
>
>> The added complexity of carrying around two alternate interfaces to
>> the same devices was judged by those who did the work to be not worth
>> the small gain available to the very few people who used raw devices.
>> Interestingly, since that time ZFS has implemented a block-layer cache
>> for itself which is of course not integrated with the non-existing
>> block level cache in the system :-).
> We do carry two interfaces in the cdev drivers, which are lumped into
> one. In particular, it is not easy to implement mapping of the block
> devices exactly because the interfaces are mixed. If cdev disk device is
> mapped, VM would try to use cdevsw d_mmap or later mapping interfaces to
> handle user page faults, which is incorrect for the purpose of the disk
> block mapping.
>