Cached file read performance
Bruce Evans
bde at zeta.org.au
Sat Dec 23 01:38:10 PST 2006
On Sat, 23 Dec 2006, Mark Kirkwood wrote:
> Bruce Evans wrote:
>>
>> None was attached.
>>
>
> (meaning the c prog yes?) I notice that it is stripped out from the web
> archive... so here's a link:
>
> http://homepages.paradise.net.nz/markir/download/freebsd/readtest.c
>> However, I
>> couldn't see much difference between block sizes of 16, 32 and 64K for
>> a small (32MB) md-malloced file system with a simple test program.
>> All versions got nearly 1/4 of bandwidth of main memory (800MB/S +-10%
>> an an AthlonXP with ~PC3200 memory).
Now I see the problem with a normal file system. The main difference
in my quick test was probably that 32MB is too small to show the
problem. 32MB fits in the buffer cache, but slightly larger files
only fit in the VMIO cache, and the main problem is the interaction
of these caches.
This behaviour is easy to understand using kernel profiling:
Part of a profile for the random case (reading 400MB with a block
size of 4K -- smaller block sizes make larger differences):
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 2.70 seconds
% cumulative self self total
time seconds seconds calls ns/call ns/call name
22.5 0.608 0.608 102466 5933 5933 copyout [13]
21.2 1.180 0.573 0 100.00% mcount [14]
10.2 1.457 0.277 0 100.00% mexitcount [17]
10.2 1.733 0.276 450823 612 612 buf_splay [18]
9.7 1.995 0.262 348917 751 751 vm_page_splay [20]
5.7 2.148 0.153 0 100.00% cputime [22]
2.0 2.202 0.054 348017 154 179 vm_page_unwire [26]
1.8 2.252 0.050 87127 573 3487 getnewbuf [16]
1.7 2.298 0.047 0 100.00% user [29]
1.3 2.332 0.034 87132 388 388 pmap_qremove [31]
1.1 2.363 0.031 87127 351 4025 allocbuf [15]
1.0 2.388 0.026 348505 74 117 vm_page_wire [30]
%%%
Sequential case:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 1.35 seconds
% cumulative self self total
time seconds seconds calls ns/call ns/call name
39.3 0.530 0.530 102443 5178 5178 copyout [11]
23.7 0.850 0.320 0 100.00% mcount [12]
11.5 1.004 0.154 0 100.00% mexitcount [13]
6.3 1.090 0.085 0 100.00% cputime [16]
3.0 1.130 0.040 102816 389 389 vm_page_splay [19]
1.6 1.151 0.021 409846 52 59 _lockmgr [22]
1.3 1.168 0.017 0 100.00% user [23]
0.9 1.180 0.012 102617 117 117 buf_splay [26]
...
0.7 1.200 0.009 25603 356 1553 getnewbuf [20]
...
0.6 1.208 0.009 25603 337 2197 allocbuf [17]
...
0.6 1.224 0.008 101915 78 96 vm_page_unwire [29]
...
0.5 1.239 0.007 25608 274 274 pmap_qremove [32]
...
0.2 1.316 0.002 102409 20 35 vm_page_wire [44]
%%%
It is a buffer-cache/vm problem like I suspected. The file system
block size is 16K, so with a read size of 4K, random reads allocate a
new buffer about 16K/4K = 4 times more often than sequential reads.
Allocation involves vm stuff which is very expensive (takes about 1.25
times as long as the actual copying). I believe it was even more
expensive before it used splay trees.
More details from separate runs:
Random:
%%%
-----------------------------------------------
0.00 0.00 5/102404 breadn [237]
0.01 0.84 102399/102404 cluster_read [10]
[11] 31.3 0.01 0.84 102404 getblk [11]
0.03 0.32 87126/87126 allocbuf [15]
0.05 0.25 87126/87126 getnewbuf [16]
0.01 0.12 189530/189530 gbincore [23]
0.00 0.06 87126/87126 bgetvp [27]
0.00 0.00 15278/409852 _lockmgr [32]
0.00 0.00 15278/15278 bremfree [144]
-----------------------------------------------
%%%
Sequential:
%%%
-----------------------------------------------
0.00 0.00 6/102404 breadn [371]
0.01 0.12 102398/102404 cluster_read [14]
[15] 9.5 0.01 0.12 102404 getblk [15]
0.01 0.05 25603/25605 allocbuf [17]
0.01 0.03 25603/25603 getnewbuf [18]
0.00 0.01 128007/128007 gbincore [31]
0.00 0.00 76801/409846 _lockmgr [22]
0.00 0.00 25603/25603 bgetvp [39]
0.00 0.00 76801/76801 bremfree [66]
-----------------------------------------------
%%%
getblk() is called the same number of times for each. In the sequential
case, it uses a previously allocated buffer (almost always one allocated
just before) with a probabilty of almost exactly 0.75, but in the
random case it uses a previosly allocated buffer with a probability
of about 0.13. The second probably is only larger than epsilon because
there is a buffer pool with a size of a few thousand. Sometimes you
get a hit in this pool, but for large working data sets, mostly you
don't; then the buffer must be consituted from vm (or the disk).
This problem is (now) fairly small because most working data sets
aren't large compared with the buffer pool. It was much larger 10
years ago when the size of the buffer pool was only a few hundred. It
was much larger still more than about 12 years ago in FreeBSD before
the buffer cache was merged with vm. Then there was only the buffer
pool with nothing between it and the disk, and it was too small.
Linux might not have this problem because it is still using a simple
and better buffer cache. At least 10-15 years ago, its buffer cache
had a fixed block size of 1K where FreeBSD's buffer cache had a variable
block size with the usual size equal to the ffs file system block size
of 4K or 8K. With a block size of 1K, at least 4 times as many buffers
are needed to compete on storage with a block size of 4K, and the
buffer allocation routines need to be at least 4 times as efficient
to compete on efficiency. Linux actually had a much larger multiple
than 4 for the storage. I'm not sure about the efficiency factor, but
it wasn't too bad (any in-memory buffer management is better than
waiting for the disk, the small fixed size of 1K is much easier to
manage than larger, variable sizes).
The FreeBSD buffer management was and is especially unsuited to file
systems with small block sizes like msdsofs floppies (512-blocks) and
the original version of Linux's extfs (1K-blocks). With a buffer cache
(pool) size of 256, you can manage a whole 128KB comprised of 512-blocks
and got enormous thrashing for accessing a 1200KB floppy. With vm backing
and a buffer cache size of a few thousand, the thrashing only occurs in
memory, and a 1200KB floppy now barely fits in the buffer cache (pool).
Also, no one uses 1200KB floppies. More practically, this problem makes
msdosfs on hard disks (normally 4K-blocks) and ext2fs on hard disks
(1K or 4K blocks) slower than they should be under FreeBSD. vm backing
and clustering masks only some of the slowness.
The problem becomes smaller as the read block size appoaches the file
system block size and vanishes when the sizes are identical. Then
there is apparently a different (smaller) problem:
Read size 16K, random:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds
% cumulative self self total
time seconds seconds calls ns/call ns/call name
49.1 0.565 0.565 25643 22037 22037 copyout [11]
12.6 0.710 0.145 0 100.00% mcount [14]
8.8 0.811 0.101 87831 1153 1153 vm_page_splay [17]
7.0 0.892 0.081 112906 715 715 buf_splay [19]
6.1 0.962 0.070 0 100.00% mexitcount [20]
3.4 1.000 0.039 0 100.00% cputime [22]
1.2 1.013 0.013 86883 153 181 vm_page_unwire [28]
1.1 1.027 0.013 0 100.00% user [29]
1.1 1.040 0.013 21852 595 3725 getnewbuf [18]
%%%
Read size 16K, sequential:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds
% cumulative self self total
time seconds seconds calls ns/call ns/call name
57.1 0.550 0.550 25643 21464 21464 copyout [11]
14.2 0.687 0.137 0 100.00% mcount [12]
6.9 0.754 0.066 0 100.00% mexitcount [15]
4.2 0.794 0.040 102830 391 391 vm_page_splay [19]
3.8 0.830 0.037 0 100.00% cputime [20]
1.4 0.844 0.013 102588 130 130 buf_splay [22]
1.3 0.856 0.012 25603 488 1920 getnewbuf [17]
1.0 0.866 0.009 25606 368 368 pmap_qremove [24]
%%%
Now the splay routines are called almost the same number of times, but
take much longer in the random case. buf_splay() seems to be unrelated
to vm -- it is called from gbincore() even if the buffer is already
in the buffer cache. It seems quite slow for that -- almost 1 uS just
to look up compared with 21 uS to copyout a 16K buffer. Linux-sized
buffers would take only 1.5 uS and then 1 uS to look them up is clearly
too much. Another benchmark shows gbincore() taking 501 nS each to
look up 64 in-buffer-cache buffers for 1MB file -- this must be the
best case for it (all these times are for -current on an Athlon XP2700
overclocked to 2025MHz). The generic hash function used in my compiler
takes 40 nS to hash a 16-byte string on this machine.
The merged vm/buffer cache is clearly implemented suboptimally. Direct
access to VMIO pages might be better, but it isn't clear how to implement
it without getting the slowest parts of vm for all accesses. The
buffer cache is now essentially just a cache of vm mappings, with vm
mapping being so slow that it needs to be cached. The last thing that
you want to do is throw away this cache and have to do a slow mapping
for every access. I think the correct method is to wait for larger
virtual address spaces (already here) and use sparse mappings more.
Bruce
More information about the freebsd-performance
mailing list