DFLTPHYS vs MAXPHYS

Mon Jul 6 18:12:48 UTC 2009

Linear dd

      tty             da0             cpu
 tin tout  KB/t tps   MB/s  us ni sy in id
   0   11   0.50 17511  8.55   0  0 15  0 85            bs=512
   0   11   1.00 16108 15.73   0  0 12  0 87            bs=1024
   0   11   2.00 14758 28.82   0  0 11  0 89            bs=2048
   0   11   4.00 12195 47.64   0  0  7  0 93            bs=4096
   0   11   8.00  8026 62.70   0  0  5  0 95            bs=8192 << MB/s breakpt
   0   11  16.00  4018 62.78   0  0  4  0 96            bs=16384
   0   11  32.00  2025 63.28   0  0  2  0 98            bs=32768 << id breakpt
   0   11  64.00  1004 62.75   0  0  1  0 99            bs=65536
   0   11 128.00   506 63.25   0  0  1  0 99            bs=131072

Random seek/read

      tty              da0             cpu
 tin tout   KB/t tps    MB/s  us ni sy in id
   0   11   0.50  189   0.09   0  0  0  0 100	bs=512
   0   11   1.00  184   0.18   0  0  0  0 100	bs=1024
   0   11   2.00  177   0.35   0  0  0  0 100	bs=2048
   0   11   4.00  175   0.68   0  0  0  0 100	bs=4096
   0   11   8.00  172   1.34   0  0  0  0 100	bs=8192
   0   11  16.00  166   2.59   0  0  0  0 100	bs=16384
   0   11  32.00  159   4.97   0  0  1  0 99	bs=32768
   0   11  64.00  142   8.87   0  0  0  0 100	bs=65536
   0   11 128.00  117  14.62   0  0  0  0 100	bs=131072
		  ^^^   ^^^
		  note TPS rate and MB/s

    Which is the more important tuning variable?  Efficiency of linear
    reads or saving re-seeks by buffering more data?  If you didn't choose
    saving re-seeks you lose.

    To go from 16K to 32K requires saving 5% of future re-seeks to break-even.
    To go from 32K to 64K requires saving 11% of future re-seeks.
    To go from 64K to 128K requires saving 18% of future re-seeks.
    (at least with this particular disk)

    At the point where the block size exceeds 32768 if you aren't saving
    re-seeks with locality of reference from the additional cached data,
    you lose.  If you are saving reseeks you win.  cpu caches do not enter
    into the equation at all.

    For most filesystems the re-seeks being saved depend on the access
    pattern.  For example, if you are doing a ls -lR or a find the re-seek
    pattern will be related to inode and directory lookups.  The number of
    inodes which fit in a cluster_read(), assuming reasonable locality of
    reference, will wind up determining the performance.

    However, as the buffer size grows the total number of bytes you are
    able to cache becomes the dominant factor in calculating the re-seek
    efficiency.  I don't have a graph for that but, ultimately, it means
    that reading very large blocks (i.e. 1MB) with a non-linear access
    pattern is bad because most of the additional data cached will never
    be used before the memory winds up being re-used to cache some other
    cluster.

Another thing to note here is that command transfer overhead also becomes
mostly irrelevant once you hit 32K, even if you have a lot of discrete
disks.  I/O's of less then 8KB are clearly wasteful of resources (in my
test even a linear transfer couldn't achieve the bandwidth ceiling of the
device).  I/O's greater then 32K are clearly dependant on saving re-seeks.
Note in particular that the data transfer rate for random I/O doubles as
the buffer size doubles when you have a random access pattern (because seek
times are so long).  In otherwords, it's a huge win if you are actually
able to save future re-seeks by caching the additional data.

What this all means is that cpu caches are basically irrelevant when it
comes to hard drive I/O.  You are either saving enough re-seeks to make up
for the greater seek latency or you aren't.  One re-seek is something
like 7ms.  7ms is a LONG time, which is why the cpu caches are irrelevant
for choosing the block size.  One can bean-count cache misses all day long
but it won't make the machine perform any better in this case.

						-Matt