MAXBSIZE increase
Bruce Evans
brde at optusnet.com.au
Sat Mar 28 02:45:07 UTC 2015
> Experimenting with NFS and ZFS I found an inter-operation issue: ZFS by
> default uses block of 128KB, while FreeBSD NFS (both client and server)
> is limited to 64KB requests by the value of MAXBSIZE. On file rewrite
> that limitation makes ZFS to do slow read-modify-write cycles for every
> write operation, instead of just writing the new data. Trivial iozone
> test show major difference between initial write and rewrite speeds
> because of this issue.
>
> Looking through the sources I've found and in r280347 fixed number of
> improper MAXBSIZE use cases in device drivers. After that I see no any
> reason why MAXBSIZE can not be increased to at least 128KB to match ZFS
> default (ZFS now supports block up to 1MB, but that is not default and
> so far rare). I've made a test build and also successfully created UFS
> file system with 128KB block -- not sure it is needed, but seems it
> survives this change well too.
>
> Is there anything I am missing, or it is safe to rise this limit now?
I see the following minor problems:
- static and dynamic allocation of MAXBSIZE bytes would be more wasteful
than before.
- boot blocks used to do static allocation of MAXBSIZE bytes. Now they
just do ffs's sanity check that the block size is less that that.
A block size larger than this is not necessarily invalid, but just
unsupported by the running instance of the buffer cache layer (so
unsupported by the running instance of ffs too). Another or the
same OS may have a larger MAXBSIZE, and the user may have broken
portability by actually using this to create a file system that
cannot be read by OS's with the historical MAXBSIZE. This check is
bogus for boot blocks, since they don't use the buffer cache layer.
ufsread.c uses a sort of anti-buffer-cache to avoid problems but
give extreme slowness. It uses a virtual block size of 4K and does
i/o 4K at a time with no caching. The buffer must not cross a 64K
boundary on x86, and the MI code states this requirement for all
arches. In i386/boot2, dmadat is 64K-aligned so the virtual buffer
size could be up to 64K, except dmadat is used for 3 other buffers
and only 4K is used for the data buffer. The data structure for this
is:
X /* Buffers that must not span a 64k boundary. */
X struct dmadat {
X char blkbuf[VBLKSIZE]; /* filesystem blocks */
X char indbuf[VBLKSIZE]; /* indir blocks */
X char sbbuf[SBLOCKSIZE]; /* superblock */
X char secbuf[DEV_BSIZE]; /* for MBR/disklabel */
X };
X static struct dmadat *dmadat;
I don't like the FreeBSD boot code, and use my version of biosboot
if possible. I expanded its buffers and improved its caching a year
or 2 ago. Old versions have 2 buffers of size MAXBSIZE in commented-
out code since this doesn't work, especially when written in C. The
commented-out code also sets a size of 4K for one of these buffers.
This last worked, for the default ffs block size only, in about 1990
(this code is from Mach). The old code actually uses 3 buffers of
size 8K, corresponding to 3 of the 4 buffers in dmadat. This broke
about 15 years ago when the default and normal ffs block size was
increased to 16K. I fixed this by allocating all of the buffers in
asm. From start.S:
X ENTRY(disklabel)
X . = EXT(boot1) + 0x200 + 0*276 + 1*0x200
X
X .globl EXT(buf)
X .set EXT(buf), EXT(boot1) + 0x20000
X .globl EXT(iobuf)
X .set EXT(iobuf), EXT(boot1) + 0x30000
X .globl EXT(mapbuf)
X .set EXT(mapbuf), EXT(boot1) + 0x40000
boot1 is loaded at a 64K boundary and overlaid with boot2, the same
as in the -current boot2. The above bits off 64K pieces of the
heap for all large data structures. boot2 only (64K?) for dmadat
instead, using hackish C code.
Then I improved the caching. biosboot was using my old code which
did caching mainly for floppies, since old systems were too slow to
keep up with reading even floppies 1 512-block at a time. It used
a read-ahead buffer of size 18*512 = 9K to optimize for floppies up
to size 1440K. This worked OK for old hard disks with the old default
ffs block size of 8K too. But it gives much the same anti-caching
as -current's virtual 4K buffers when the ffs block size is 16K or
larger. I didn't expand the cache to a large one on the heap, but
just changed it to 32*512 = 16K to work well with my default ffs
block size of 16K (32K is pessimal for my old disk), and fixed some
alignment problems (the old code attempts to align on track boundaries
but tracks don't exist for modern hard disks, and the alignment needs
to be to ffs block boundaries else 16K-blocks would be split every
time in the 16K "cache".
Summary for the boot blocks: they seem to be unaffected by increasing
MAXBSIZE, but their anti-cache works even better for fragmenting larger
blocks.
- the buffer cache is still optimized for i386 with BKVASIZE = 8K. 64-bit
systems don't need the complications and pessimizations to fit in
i386's limited kva, but have them anyway. When the default ffs block
size was doubled to 16K, BKVASIZE was doubled to match, but the tuning
wasn't doubled to match. This reduced the effective number of buffers
by a factor of 2. This pessimization was mostly unnoticable, since
memory sizes grew by more than a factor of 2 and and nbuf grew by about
a factor of 2. But increasing (nbuf*BKVASIZE) much more isn't possible
on i386 since it reaches a kva limit. Then when ffs's default block
size was doubled to 32K, BKVASIZE wasn't even doubled to match. If
anyone actually uses the doubled MAXBSIZE, then BKVASIZE will be mistuned
by another factor of 2. They probably shouldn't do that. A block size
of 64K already works poorly in ffs. Relative to a block size of 32K,
It mainly doubles the size for metadata i/o without making much
difference for data i/o, since data i/o is clustered. An fs block size
equal to MAXPHYS also makes clustering useless, by limiting the maximum
number of blocks per cluster to 1. That is better than the ratio of
4/32 in ufsread and 9/{8,16,32} in old biosboot, but still silly. Large
fs block sizes (where "large" is about 2K) are only good when clustering
doesn't work and the disk doesn't like small blocks. This may be the
case for ffs on newer hard disks. Metdata is not clustered for ffs.
My old hard disks like any size larger than 16K, but my not so old hard
disks prefer 32K or above.
nfs for zfs will actually use the new MAXBSIZE. I don't like it using
a hard-coded size. It gives buffer cache fragmentation. The new
MAXBSIZE will non-accidentally match the fs block size for zfs, but even
the old MAXBSIZE doesn't match the usual fs block size for any file
system.
- cd9660 mount uses MAXBSIZE for a sanity check. It can only support
block sizes up to that, but there must be an fs limit. It should
probably use min(CD9660_MAXBSIZE, MAXBSIZE).
- similarly in msdosfs, except I'm sure that there is an fs limit of
64K. Microsoft specifies that the limit is 32K, but 64K works in
FreeBSD and perhaps even in Microsoft OS's.
- similarly in ffs, except the ffs limit is historically identical
to MAXBSIZE. I think it goes the other way -- MAXBSIZE = 64K is
the historical ffs limit, and the buffer cache has to support that.
Perhaps ffs should remain at its historical limit. The lower
limit is still local in ffs. It is named MINBSIZE. Its value
is 4K in -current but 512 in my version. ffs has no fundamental
limit at either 4K or 64K, and can support any size supported by
the hardware after fixing some bugs involving assumptions that the
superblock fits in an ffs block.
- many file systems use MAXBSIZE to limit the read-ahead for cluster_read().
This seems wrong. cluster_read() has a natural limit of geom's virtual
"best" i/o size (normally MAXPHYS). The decision about the amount of
read-ahead should be left to the clustering code if possible. But it
is unclear what this should be. The clustering code gets this wrong
anyway. It has sysctls vfs.read_max (default 64) and vfs.write_behind
(default 1). The units for these are broken. They are fs-blocks.
A read-ahead of 64 fs-blocks of size 512 is too different from a
read-ahead of 64 fs-blocks of size MAXBSIZE whatever the latter is.
My version uses a read-ahead scaled in 512-blocks (default 256 blocks
= MAXPHYS bytes). The default read-ahead shouldn't vary much with
either MAXPHYS, MAXBSIZE or the fs block size, but should vary with
the device (don't read-ahead 64 large fs blocks on a floppy disk
device, as asked for by -current's read_max ...).
- ffs utilities like fsck are broken by limiting themselves to the buffer
cache limit of MAXBSIZE, like the boot blocks but with less reason since
they don't have space constraints and not being limited by the current
OS is more useful. Unless MAXBSIZE = 64K is considered to be a private
ffs that escaped. Then the ffs code should spell it FFS_MAXBSIZE or 64K.
Bruce
More information about the freebsd-fs
mailing list