DFLTPHYS vs MAXPHYS

Tue Jul 7 21:12:44 UTC 2009

On Tue, 7 Jul 2009, Matthew Dillon wrote:

> :All I wanted to say, is that it is FS privilege to decide how much data
> :it needs. But when it really needs a lot of data, they should be better
> :transferred with smaller number of bigger transactions, without strict
> :MAXPHYS limitation.
> :
> :--
> :Alexander Motin
>
>    We are in agreement.  That's essentially what I mean by all my
>    cluster_read() comments.

I did not disagree.  One of my points is that fs's are currently limited
by MAXPHYS and that simply increasing MAXPHYS isn't free.

>    What matters the most is how much read-ahead
>    the cluster code does, and how well matched the read-ahead is on
>    reducing future transactions, and not so much on anything else (such as
>    cpu caches).

I will disagree with most of this
- the amount of read-ahead/clustering is not very important.  fs's already
   depend on the drive doing significant buffering, so that when the fs gets
   things and seeks around a lot, not all the seeks are physical.  Locality
   is much more important.
- cpu caches are already of minor importance and will become more important
   as drives become faster.

>    The cluster heuristics are pretty good but they do break down under
>    certain circumstances.  For example, for UFS they break down when there
>    is file data adjacency between different inodes.  That is often why one
>    sees the KB/t sizes go down (and the TPS rate go up) when tar'ing up a
>    large number of small files.  taring up /usr/src is a good example of
>    this.  KB/t can drop all the way down to 8K and performance is noticably
>    degraded.

At least for ffs in FreeBSD, this is mostly locality, not clustering.
Tarring up /usr/src to test optimizations of locality is one of my
favourite benchmarks.  Since ffs does no inter-file or inode clustering,
the average i/o size is smaller than the average file size.  Since
files in /usr/src are small, you are lucky if the average i/o size is
8K (the average file size is actually between 8K and and 16K).  Since
the ffs block size is larger than the file size, most file data fits
in a single block and clustering has no effect.  (But I also like to
optimize and test file systems with a small block size.  Clustering
makes a big difference for msdosfs with a block size of 512, and in
this benchmark, after my optimizations, msdosfs with a block size of
512 is slightly faster than unoptimized ffs with a block size of 16K.
The smaller block size just takes more CPU.  msdosfs is fundamentally
faster than ffs for small files since it has better locality (no inodes,
and better locality for the FAT than for indirect blocks).)

>    The cluster heuristic also tends to break down on the initial read() from
>    a newly constituted vnode, because it has no prior history to work with
>    and so does not immediately issue a read-ahead even though the I/O may
>    end up being linear.

This is harmful for random file access, but for tarring up /user/src there
is a good chance that file locality (in directory traversal order) combined
with read-ahead in the drive will compensate for this.

>    For command latency issues Julian pointed out a very interesting contrast
>    between a HD and a (SATA) SSD.  With no seek times to speak of command
>    overhead becomes a bigger deal when trying to maximize the peformance
>    of a SSD.  I would guess that larger DMA transactions (from the point of
>    view of the host cpu anyhow) would be more highly desired once we start
>    hitting bandwidth ceilings of 300 MBytes/sec for SATA II and
>    600 MBytes/sec beyond that.

It is actually already a problem (the problem of this thread).  Even at
50MB/S, I see some slowness due to command latency (I see increased CPU
but that is similar to latency in the context of this thread).  Alexander
has 200MB/S disks so he sees larger problems.  My CPU overhead (on a ~2GHz
CPU) is about 50 uS/block.  With 64K-blocks at 50MB/S, this gives a CPU
overhead of 40 mS/S or 4%.  Not significant.  With 16K-blocks at 50MB/S,
this gives a CPU overhead of 16%.  This is becoming significant.  At
200MB/S, the overhead would be 16% even for 64K-blocks.  Alexander
reported savings of 10-15% using 512K-blocks.  This is consistent.

>    If in my example the bandwidth ceiling for a HD capable of doing 60MB/s
>    is hit at the 8K mark then presumably the block size needed to hit the
>    bandwidth ceiling for a HD or SSD capable of 200MB/s, or 300MB/s, or
>    higher, will also have to be larger.  16K, 32K, etc.  This is fast
>    approaching the 64K mark people are arguing about.

I thought we were arguing about the 512K and 1M marks :-).

I haven't been worrying about command latency and didn't notice that we
were discussing an SSD before.  At hundreds of MB/S, or for zero-latency
hardware, the command overhead becomes a limiting factor for throughput.

>    In anycase, the main reason I posted is to try to correct people's
>    assumptions on the importance of various parameters, particularly the
>    irrelevancy of cpu caches in the bigger picture.

My examples show that the CPU cache can be relevant even with a 50MB/S
disk.  With faster disks it becomes even more relevant.  It is hard to
keep up with 200MB/S, and harder if you double the number of cache misses
using large buffers.

Bruce