Cached file read performance

Sat Dec 23 01:38:10 PST 2006

On Sat, 23 Dec 2006, Mark Kirkwood wrote:

> Bruce Evans wrote:
>> 
>> None was attached.
>> 
>
> (meaning the c prog yes?) I notice that it is stripped out from the web 
> archive... so here's a link:
>
> http://homepages.paradise.net.nz/markir/download/freebsd/readtest.c

>> However, I
>> couldn't see much difference between block sizes of 16, 32 and 64K for
>> a small (32MB) md-malloced file system with a simple test program.
>> All versions got nearly 1/4 of bandwidth of main memory (800MB/S +-10%
>> an an AthlonXP with ~PC3200 memory).

Now I see the problem with a normal file system.  The main difference
in my quick test was probably that 32MB is too small to show the
problem.  32MB fits in the buffer cache, but slightly larger files
only fit in the VMIO cache, and the main problem is the interaction
of these caches.

This behaviour is easy to understand using kernel profiling:

Part of a profile for the random case (reading 400MB with a block
size of 4K -- smaller block sizes make larger differences):
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 2.70 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  22.5      0.608    0.608   102466     5933     5933  copyout [13]
  21.2      1.180    0.573        0  100.00%           mcount [14]
  10.2      1.457    0.277        0  100.00%           mexitcount [17]
  10.2      1.733    0.276   450823      612      612  buf_splay [18]
   9.7      1.995    0.262   348917      751      751  vm_page_splay [20]
   5.7      2.148    0.153        0  100.00%           cputime [22]
   2.0      2.202    0.054   348017      154      179  vm_page_unwire [26]
   1.8      2.252    0.050    87127      573     3487  getnewbuf [16]
   1.7      2.298    0.047        0  100.00%           user [29]
   1.3      2.332    0.034    87132      388      388  pmap_qremove [31]
   1.1      2.363    0.031    87127      351     4025  allocbuf [15]
   1.0      2.388    0.026   348505       74      117  vm_page_wire [30]
%%%

Sequential case:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 1.35 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  39.3      0.530    0.530   102443     5178     5178  copyout [11]
  23.7      0.850    0.320        0  100.00%           mcount [12]
  11.5      1.004    0.154        0  100.00%           mexitcount [13]
   6.3      1.090    0.085        0  100.00%           cputime [16]
   3.0      1.130    0.040   102816      389      389  vm_page_splay [19]
   1.6      1.151    0.021   409846       52       59  _lockmgr [22]
   1.3      1.168    0.017        0  100.00%           user [23]
   0.9      1.180    0.012   102617      117      117  buf_splay [26]
...
   0.7      1.200    0.009    25603      356     1553  getnewbuf [20]
...
   0.6      1.208    0.009    25603      337     2197  allocbuf [17]
...
   0.6      1.224    0.008   101915       78       96  vm_page_unwire [29]
...
   0.5      1.239    0.007    25608      274      274  pmap_qremove [32]
...
   0.2      1.316    0.002   102409       20       35  vm_page_wire [44]
%%%

It is a buffer-cache/vm problem like I suspected.  The file system
block size is 16K, so with a read size of 4K, random reads allocate a
new buffer about 16K/4K = 4 times more often than sequential reads.
Allocation involves vm stuff which is very expensive (takes about 1.25
times as long as the actual copying).  I believe it was even more
expensive before it used splay trees.

More details from separate runs:

Random:
%%%
-----------------------------------------------

                 0.00        0.00       5/102404      breadn [237]
                 0.01        0.84  102399/102404      cluster_read [10]
[11]    31.3    0.01        0.84  102404         getblk [11]
                 0.03        0.32   87126/87126       allocbuf [15]
                 0.05        0.25   87126/87126       getnewbuf [16]
                 0.01        0.12  189530/189530      gbincore [23]
                 0.00        0.06   87126/87126       bgetvp [27]
                 0.00        0.00   15278/409852      _lockmgr [32]
                 0.00        0.00   15278/15278       bremfree [144]

-----------------------------------------------
%%%

Sequential:
%%%
-----------------------------------------------

                 0.00        0.00       6/102404      breadn [371]
                 0.01        0.12  102398/102404      cluster_read [14]
[15]     9.5    0.01        0.12  102404         getblk [15]
                 0.01        0.05   25603/25605       allocbuf [17]
                 0.01        0.03   25603/25603       getnewbuf [18]
                 0.00        0.01  128007/128007      gbincore [31]
                 0.00        0.00   76801/409846      _lockmgr [22]
                 0.00        0.00   25603/25603       bgetvp [39]
                 0.00        0.00   76801/76801       bremfree [66]

-----------------------------------------------
%%%

getblk() is called the same number of times for each.  In the sequential
case, it uses a previously allocated buffer (almost always one allocated
just before) with a probabilty of almost exactly 0.75, but in the
random case it uses a previosly allocated buffer with a probability
of about 0.13.  The second probably is only larger than epsilon because
there is a buffer pool with a size of a few thousand.  Sometimes you
get a hit in this pool, but for large working data sets, mostly you
don't; then the buffer must be consituted from vm (or the disk).

This problem is (now) fairly small because most working data sets
aren't large compared with the buffer pool.  It was much larger 10
years ago when the size of the buffer pool was only a few hundred.  It
was much larger still more than about 12 years ago in FreeBSD before
the buffer cache was merged with vm.  Then there was only the buffer
pool with nothing between it and the disk, and it was too small.

Linux might not have this problem because it is still using a simple
and better buffer cache.  At least 10-15 years ago, its buffer cache
had a fixed block size of 1K where FreeBSD's buffer cache had a variable
block size with the usual size equal to the ffs file system block size
of 4K or 8K.  With a block size of 1K, at least 4 times as many buffers
are needed to compete on storage with a block size of 4K, and the
buffer allocation routines need to be at least 4 times as efficient
to compete on efficiency.  Linux actually had a much larger multiple
than 4 for the storage.  I'm not sure about the efficiency factor, but
it wasn't too bad (any in-memory buffer management is better than
waiting for the disk, the small fixed size of 1K is much easier to
manage than larger, variable sizes).

The FreeBSD buffer management was and is especially unsuited to file
systems with small block sizes like msdsofs floppies (512-blocks) and
the original version of Linux's extfs (1K-blocks).  With a buffer cache
(pool) size of 256, you can manage a whole 128KB comprised of 512-blocks
and got enormous thrashing for accessing a 1200KB floppy.  With vm backing
and a buffer cache size of a few thousand, the thrashing only occurs in
memory, and a 1200KB floppy now barely fits in the buffer cache (pool).
Also, no one uses 1200KB floppies.  More practically, this problem makes
msdosfs on hard disks (normally 4K-blocks) and ext2fs on hard disks
(1K or 4K blocks) slower than they should be under FreeBSD.  vm backing
and clustering masks only some of the slowness.

The problem becomes smaller as the read block size appoaches the file
system block size and vanishes when the sizes are identical.  Then
there is apparently a different (smaller) problem:

Read size 16K, random:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  49.1      0.565    0.565    25643    22037    22037  copyout [11]
  12.6      0.710    0.145        0  100.00%           mcount [14]
   8.8      0.811    0.101    87831     1153     1153  vm_page_splay [17]
   7.0      0.892    0.081   112906      715      715  buf_splay [19]
   6.1      0.962    0.070        0  100.00%           mexitcount [20]
   3.4      1.000    0.039        0  100.00%           cputime [22]
   1.2      1.013    0.013    86883      153      181  vm_page_unwire [28]
   1.1      1.027    0.013        0  100.00%           user [29]
   1.1      1.040    0.013    21852      595     3725  getnewbuf [18]
%%%

Read size 16K, sequential:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ns/call  ns/call  name
  57.1      0.550    0.550    25643    21464    21464  copyout [11]
  14.2      0.687    0.137        0  100.00%           mcount [12]
   6.9      0.754    0.066        0  100.00%           mexitcount [15]
   4.2      0.794    0.040   102830      391      391  vm_page_splay [19]
   3.8      0.830    0.037        0  100.00%           cputime [20]
   1.4      0.844    0.013   102588      130      130  buf_splay [22]
   1.3      0.856    0.012    25603      488     1920  getnewbuf [17]
   1.0      0.866    0.009    25606      368      368  pmap_qremove [24]
%%%

Now the splay routines are called almost the same number of times, but
take much longer in the random case.  buf_splay() seems to be unrelated
to vm -- it is called from gbincore() even if the buffer is already
in the buffer cache.  It seems quite slow for that -- almost 1 uS just
to look up compared with 21 uS to copyout a 16K buffer.  Linux-sized
buffers would take only 1.5 uS and then 1 uS to look them up is clearly
too much.  Another benchmark shows gbincore() taking 501 nS each to
look up 64 in-buffer-cache buffers for 1MB file -- this must be the
best case for it (all these times are for -current on an Athlon XP2700
overclocked to 2025MHz).  The generic hash function used in my compiler
takes 40 nS to hash a 16-byte string on this machine.

The merged vm/buffer cache is clearly implemented suboptimally.  Direct
access to VMIO pages might be better, but it isn't clear how to implement
it without getting the slowest parts of vm for all accesses.  The
buffer cache is now essentially just a cache of vm mappings, with vm
mapping being so slow that it needs to be cached.  The last thing that
you want to do is throw away this cache and have to do a slow mapping
for every access.  I think the correct method is to wait for larger
virtual address spaces (already here) and use sparse mappings more.

Bruce