4.8 ffs_dirpref problem
Don Lewis
truckman at FreeBSD.org
Mon Nov 17 13:27:47 PST 2003
On 16 Nov, Don Lewis wrote:
> On 16 Nov, Don Lewis wrote:
>
>>> I'm somewhat tempted to change the calculation to:
>>> min(avgbfree, max(1, (avgbfree - avgbfree/4), (dirsize/fs->fs_bsize)))
>>> where the last term works out to 4500 with your tunefs parameters.
>>
>> I tried a variation of this on my -CURRENT box and it benchmarked
>> consistently worse. I've got a "spare' 10 GB partition which first
>> copied my /usr/ports/packages to, and then filled by repeatedly tarring
>> my /usr/ports tree over to it. The partition was 100% full, including
>> the reserve space, after four iterations.
>
> I just looked again, and it is more than 100% full, but only slightly
> into the reserve space.
>
>> With minbfree set to max((avgbfree - avgbfree/4), 1) here are two
>> iterations (the fifth line of timing data is for the 'rm -rf' command):
>>
>> 1310.47 real 5.48 user 141.90 sys
>> 1336.78 real 5.62 user 152.27 sys
>> 1368.84 real 6.02 user 151.75 sys
>> 1359.70 real 5.55 user 154.01 sys
>> 423.44 real 2.25 user 107.26 sys
>>
>> 1300.56 real 5.65 user 148.82 sys
>> 1372.20 real 5.79 user 152.25 sys
>> 1359.01 real 6.03 user 152.63 sys
>> 1380.90 real 5.31 user 153.71 sys
>> 437.22 real 2.20 user 105.61 sys
>>
>> With minbfree set to
>> max(min(max(avgbfree - avgbfree / 4, dirsize / fs->fs_bsize),
>> avgbfree), 1)
>> I get the following:
>>
>> 1314.61 real 5.66 user 175.43 sys
>> 1350.40 real 6.12 user 179.15 sys
>> 1386.86 real 6.32 user 179.12 sys
>> 1418.60 real 5.74 user 181.64 sys
>> 508.67 real 2.67 user 119.66 sys
>>
>> 1361.19 real 5.97 user 176.94 sys
>> 1327.63 real 5.72 user 179.60 sys
>> 1376.16 real 6.33 user 179.72 sys
>> 1356.47 real 6.07 user 180.24 sys
>> 462.67 real 2.30 user 119.18 sys
>>
>> I'm using the newfs defaults, but dirsize is recalculated as the
>> filesystem fills if the appropriate value is larger than what is
>> calculated from the parameters set by newfs.
>
> I filled up the file system again with the
> minbree = max((avgbfree - avgbfree/4), 1)
> version of the code.
>
> Based on the output of df and dumpfs, I calculate:
> avgfilesize = 18K
> curdirsize = 83K
> avgbfree = 864
> avgifree = 14631
>
> What suprises me is the poor distribution of free space across the
> cylinder groups in the file system. I now suspect the culprit is
> minifree. The current code calculates minifree as 75% of avgifree, or
> about 10973. There are some cylinder groups that are less than half
> full (capacity is 11761 blocks/group) in this filesystem, but their free
> inode counts are near the 10K minifree limit. It looks like the free
> inode count should be de-emphasized if the filesystem will run out of
> blocks before it runs out of inodes, and vice-versa if inodes are likely
> to be exhausted first. I now suspect that the other version of the
> minbfree code was more likely to bail out because it could not find any
> cylinder groups that met both selection criteria and used the fallback
> code, which probably selected the cylinder groups that were already full
> but had a large number of free inodes. Something to ponder ...
I ran another test with minifree set to a small value, which effectively
removed it from the cylinder group selection criteria. I used
max(min(max(avgbfree - avgbfree / 4, dirsize / fs->fs_bsize),
avgbfree), 1)
for minbfree. The results were similar to the previous
max((avgbfree - avgbfree/4), 1)
tests.
1337.34 real 5.69 user 150.63 sys
1323.58 real 5.87 user 157.96 sys
1347.14 real 5.52 user 159.77 sys
1361.57 real 5.37 user 160.50 sys
419.49 real 2.52 user 114.75 sys
1344.53 real 5.47 user 157.03 sys
1326.97 real 4.77 user 151.57 sys
1322.67 real 4.69 user 153.00 sys
1367.49 real 5.91 user 160.45 sys
409.95 real 2.59 user 114.20 sys
1330.93 real 5.37 user 156.93 sys
1374.03 real 5.59 user 159.14 sys
1367.17 real 5.41 user 160.84 sys
1318.14 real 5.50 user 159.75 sys
411.94 real 2.22 user 114.86 sys
I took a snapshot of the cylinder group state at about 75% full as well
as at 100%. Even at 75%, there are a number of cylinder groups that are
totally full. I think that one of the problems is that the dirpref
allocator lingers too long on a given cylinder group. It should
probably move to a new cylinder group before the old one is totally
full, somewhere around the minfree reserve level. Also, as the file
system fills and a large number of the cylinder groups are totally
filled, the average free space per cylinder group will be quite small,
so the dirpref code will consider cylinder groups with only a small
amount of free space as candidates even though there may be other
cylinder groups that are nearly empty that would be better choices.
75%
dumpfs /dev/da0s2a | grep nbfree
nbfree 191340 ndir 94629 nifree 994237 nffree 1232
cs[].cs_(nbfree,ndir,nifree,nffree):
nbfree 7256 ndir 1976 nifree 14679 nffree 5
nbfree 7592 ndir 1976 nifree 14853 nffree 7
nbfree 35 ndir 663 nifree 20677 nffree 32
nbfree 5992 ndir 35 nifree 23096 nffree 3
nbfree 0 ndir 2965 nifree 10371 nffree 29
nbfree 0 ndir 2465 nifree 12592 nffree 83
nbfree 38 ndir 2463 nifree 12630 nffree 39
nbfree 115 ndir 2461 nifree 12736 nffree 44
nbfree 45 ndir 2462 nifree 12440 nffree 31
nbfree 16 ndir 2461 nifree 12778 nffree 36
nbfree 644 ndir 408 nifree 21729 nffree 56
nbfree 65 ndir 2966 nifree 10759 nffree 58
nbfree 2516 ndir 2462 nifree 12452 nffree 1
nbfree 2859 ndir 2964 nifree 10626 nffree 7
nbfree 723 ndir 2964 nifree 10517 nffree 18
nbfree 2678 ndir 2967 nifree 10184 nffree 24
nbfree 4279 ndir 2983 nifree 10730 nffree 0
nbfree 0 ndir 2982 nifree 10215 nffree 40
nbfree 0 ndir 549 nifree 20947 nffree 44
nbfree 0 ndir 0 nifree 23552 nffree 10
nbfree 0 ndir 724 nifree 20416 nffree 16
nbfree 38 ndir 0 nifree 23552 nffree 67
nbfree 0 ndir 1200 nifree 17872 nffree 12
nbfree 0 ndir 2963 nifree 10769 nffree 7
nbfree 0 ndir 2963 nifree 10506 nffree 17
nbfree 0 ndir 0 nifree 23552 nffree 17
nbfree 0 ndir 2963 nifree 10765 nffree 4
nbfree 2 ndir 2963 nifree 10240 nffree 18
nbfree 4266 ndir 2983 nifree 10137 nffree 1
nbfree 9442 ndir 2982 nifree 10321 nffree 0
nbfree 9415 ndir 2963 nifree 10476 nffree 4
nbfree 10594 ndir 1194 nifree 18382 nffree 4
nbfree 2 ndir 0 nifree 23552 nffree 39
nbfree 8212 ndir 3050 nifree 10268 nffree 1
nbfree 10508 ndir 1288 nifree 17943 nffree 6
nbfree 1 ndir 0 nifree 23552 nffree 4
nbfree 11381 ndir 0 nifree 23552 nffree 0
nbfree 11391 ndir 0 nifree 23552 nffree 0
nbfree 0 ndir 2 nifree 23321 nffree 51
nbfree 0 ndir 0 nifree 23552 nffree 18
nbfree 7902 ndir 40 nifree 22960 nffree 3
nbfree 91 ndir 0 nifree 23552 nffree 46
nbfree 7862 ndir 0 nifree 23552 nffree 0
nbfree 8433 ndir 0 nifree 23552 nffree 0
nbfree 9341 ndir 0 nifree 23552 nffree 0
nbfree 5 ndir 0 nifree 23552 nffree 17
nbfree 8880 ndir 0 nifree 23552 nffree 0
nbfree 11 ndir 1958 nifree 14708 nffree 58
nbfree 12 ndir 1962 nifree 15043 nffree 54
nbfree 2151 ndir 1957 nifree 14900 nffree 20
nbfree 40 ndir 1958 nifree 15136 nffree 29
nbfree 5764 ndir 1957 nifree 14470 nffree 31
nbfree 6517 ndir 1959 nifree 15192 nffree 1
nbfree 8163 ndir 1976 nifree 14941 nffree 6
nbfree 4107 ndir 1956 nifree 15229 nffree 8
nbfree 3 ndir 1975 nifree 14289 nffree 37
nbfree 0 ndir 1974 nifree 15026 nffree 18
nbfree 6475 ndir 1976 nifree 14747 nffree 7
nbfree 0 ndir 1974 nifree 14882 nffree 43
nbfree 5200 ndir 1975 nifree 14912 nffree 1
100%
dumpfs /dev/da0s2a | grep nbfree
nbfree 51875 ndir 120875 nifree 877882 nffree 1443
cs[].cs_(nbfree,ndir,nifree,nffree):
nbfree 3167 ndir 2963 nifree 10330 nffree 6
nbfree 3583 ndir 2982 nifree 10562 nffree 4
nbfree 52 ndir 663 nifree 20677 nffree 39
nbfree 4265 ndir 2982 nifree 10131 nffree 0
nbfree 4185 ndir 2982 nifree 10340 nffree 7
nbfree 9 ndir 2465 nifree 12592 nffree 60
nbfree 2 ndir 2463 nifree 12630 nffree 34
nbfree 1642 ndir 2461 nifree 12736 nffree 19
nbfree 38 ndir 2462 nifree 12440 nffree 31
nbfree 3008 ndir 2461 nifree 12778 nffree 36
nbfree 0 ndir 633 nifree 20564 nffree 42
nbfree 0 ndir 2963 nifree 10778 nffree 22
nbfree 0 ndir 2460 nifree 12459 nffree 12
nbfree 0 ndir 2963 nifree 10667 nffree 7
nbfree 0 ndir 2963 nifree 10491 nffree 3
nbfree 51 ndir 2963 nifree 10626 nffree 35
nbfree 0 ndir 2963 nifree 10547 nffree 18
nbfree 2 ndir 2963 nifree 10673 nffree 38
nbfree 0 ndir 549 nifree 20947 nffree 40
nbfree 0 ndir 0 nifree 23552 nffree 11
nbfree 3 ndir 0 nifree 23552 nffree 0
nbfree 87 ndir 0 nifree 23552 nffree 51
nbfree 0 ndir 1319 nifree 17311 nffree 5
nbfree 30 ndir 2963 nifree 10498 nffree 17
nbfree 4586 ndir 2983 nifree 10062 nffree 2
nbfree 0 ndir 0 nifree 23552 nffree 19
nbfree 9401 ndir 388 nifree 21774 nffree 5
nbfree 2 ndir 3473 nifree 8167 nffree 113
nbfree 103 ndir 3470 nifree 8345 nffree 28
nbfree 395 ndir 3471 nifree 7913 nffree 64
nbfree 1 ndir 3467 nifree 8476 nffree 5
nbfree 1690 ndir 3486 nifree 8049 nffree 7
nbfree 5065 ndir 3486 nifree 8302 nffree 2
nbfree 5762 ndir 3485 nifree 8214 nffree 4
nbfree 5 ndir 3472 nifree 8363 nffree 9
nbfree 0 ndir 2356 nifree 13130 nffree 33
nbfree 0 ndir 0 nifree 23552 nffree 6
nbfree 0 ndir 0 nifree 23552 nffree 11
nbfree 0 ndir 2 nifree 23321 nffree 51
nbfree 0 ndir 0 nifree 23552 nffree 18
nbfree 0 ndir 40 nifree 22960 nffree 6
nbfree 6 ndir 0 nifree 23552 nffree 48
nbfree 0 ndir 0 nifree 23552 nffree 51
nbfree 506 ndir 0 nifree 23552 nffree 22
nbfree 0 ndir 2965 nifree 10371 nffree 52
nbfree 0 ndir 0 nifree 23552 nffree 17
nbfree 139 ndir 2969 nifree 10603 nffree 63
nbfree 0 ndir 1958 nifree 14708 nffree 43
nbfree 37 ndir 1962 nifree 15043 nffree 57
nbfree 237 ndir 1957 nifree 14900 nffree 17
nbfree 0 ndir 1958 nifree 15136 nffree 21
nbfree 0 ndir 2964 nifree 10118 nffree 12
nbfree 805 ndir 3005 nifree 10331 nffree 6
nbfree 561 ndir 2964 nifree 10525 nffree 10
nbfree 5 ndir 2199 nifree 14133 nffree 19
nbfree 0 ndir 1975 nifree 14289 nffree 25
nbfree 2 ndir 1974 nifree 15026 nffree 11
nbfree 2437 ndir 2923 nifree 10441 nffree 5
nbfree 4 ndir 1974 nifree 14882 nffree 36
nbfree 2 ndir 2963 nifree 10451 nffree 8
I think it would work better if dirpref were converted to a two pass
algorithm. The first pass would only consider those cylinder groups
that had more than minfree space. If this first pass failed, the second
pass would look at all cylinder groups.
Another change that I suspect would help is rather than comparing
cylinder groups to minbfree and minifree, calculate how many directories
containing avgfilesperdir files of size avgfilesize they could hold, and
then calculate the average and minimum threshold values of that.
It would be an interesting project to write a filesystem allocation
simulator to test different allocation algorithms without having to bang
on physical disks.
More information about the freebsd-fs
mailing list