4.8 ffs_dirpref problem
Ken Marx
kmarx at vicor.com
Thu Oct 30 11:12:39 PST 2003
Don Lewis wrote:
> On 29 Oct, Ken Marx wrote:
>
>>Don Lewis wrote:
>
>
>>>I think the real problem is the following code in ffs_dirpref():
>>>
>>> avgifree = fs->fs_cstotal.cs_nifree / fs->fs_ncg;
>>> avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg;
>>> avgndir = fs->fs_cstotal.cs_ndir / fs->fs_ncg;
>>>[snip]
>>> maxndir = min(avgndir + fs->fs_ipg / 16, fs->fs_ipg);
>>> minifree = avgifree - fs->fs_ipg / 4;
>>> if (minifree < 0)
>>> minifree = 0;
>>> minbfree = avgbfree - fs->fs_fpg / fs->fs_frag / 4;
>>> if (minbfree < 0)
>>> minbfree = 0;
>>>[snip]
>>> prefcg = ino_to_cg(fs, pip->i_number);
>>> for (cg = prefcg; cg < fs->fs_ncg; cg++)
>>> if (fs->fs_cs(fs, cg).cs_ndir < maxndir &&
>>> fs->fs_cs(fs, cg).cs_nifree >= minifree &&
>>> fs->fs_cs(fs, cg).cs_nbfree >= minbfree) {
>>> if (fs->fs_contigdirs[cg] < maxcontigdirs)
>>> return ((ino_t)(fs->fs_ipg * cg));
>>> }
>>> for (cg = 0; cg < prefcg; cg++)
>>> if (fs->fs_cs(fs, cg).cs_ndir < maxndir &&
>>> fs->fs_cs(fs, cg).cs_nifree >= minifree &&
>>> fs->fs_cs(fs, cg).cs_nbfree >= minbfree) {
>>> if (fs->fs_contigdirs[cg] < maxcontigdirs)
>>> return ((ino_t)(fs->fs_ipg * cg));
>>> }
>>>
>>>If the file system is more than 75% full, minbfree will be zero, which
>>>will allow new directories to be created in cylinder groups that have no
>>>free blocks for either the directory itself, or for any files created in
>>>that directory. If this happens, allocating the blocks for the
>>>directory and its files will require ffs_alloc() to do an expensive
>>>search across the cylinder groups for each block. It looks to me like
>>>minbfree needs to equal, or at least a lot closer to avgbfree.
>
>
> Actually, I think the expensive search will only happen for the first
> block in each file (and the other blocks will be allocated in the same
> cylinder group), but if you are creating tons of files that are only one
> block long ...
>
>
>>>A similar situation exists with minifree. Please note that the fallback
>>>algorithm uses the condition:
>>> fs->fs_cs(fs, cg).cs_nifree >= avgifree
>>>
>>>
>>>
>>
>>Interesting. We (Vicor) will defer to experts here, but are very willing to
>>test anything you come up with.
>
>
> You might try the lightly tested patch below. It tweaks the dirpref
> algorithm so that cylinder groups with free space >= 75% of the average
> free space and free inodes >= 75% of the average number of free inodes
> are candidates for allocating the directory. It will not chose a
> cylinder group that does not have at least one free block and one free
> inode.
>
> It also decreases maxcontigdirs as the free space decreases so that a
> cluster of directories is less likely to cause the cylinder group to
> overflow. I think it would be better to tune maxcontigdirs individually
> for each cylinder group, based on the free space in that cylinder group,
> but that is more complex ...
>
> Index: sys/ufs/ffs/ffs_alloc.c
> ===================================================================
> RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_alloc.c,v
> retrieving revision 1.64.2.2
> diff -u -r1.64.2.2 ffs_alloc.c
> --- sys/ufs/ffs/ffs_alloc.c 21 Sep 2001 19:15:21 -0000 1.64.2.2
> +++ sys/ufs/ffs/ffs_alloc.c 30 Oct 2003 06:01:38 -0000
> @@ -696,18 +696,18 @@
> * optimal allocation of a directory inode.
> */
> maxndir = min(avgndir + fs->fs_ipg / 16, fs->fs_ipg);
> - minifree = avgifree - fs->fs_ipg / 4;
> - if (minifree < 0)
> - minifree = 0;
> - minbfree = avgbfree - fs->fs_fpg / fs->fs_frag / 4;
> - if (minbfree < 0)
> - minbfree = 0;
> + minifree = avgifree - avgifree / 4;
> + if (minifree < 1)
> + minifree = 1;
> + minbfree = avgbfree - avgbfree / 4;
> + if (minbfree < 1)
> + minbfree = 1;
> cgsize = fs->fs_fsize * fs->fs_fpg;
> dirsize = fs->fs_avgfilesize * fs->fs_avgfpdir;
> curdirsize = avgndir ? (cgsize - avgbfree * fs->fs_bsize) / avgndir : 0;
> if (dirsize < curdirsize)
> dirsize = curdirsize;
> - maxcontigdirs = min(cgsize / dirsize, 255);
> + maxcontigdirs = min((avgbfree * fs->fs_bsize) / dirsize, 255);
> if (fs->fs_avgfpdir > 0)
> maxcontigdirs = min(maxcontigdirs,
> fs->fs_ipg / fs->fs_avgfpdir);
>
>
Thanks Don,
re:
...
> cylinder group), but if you are creating tons of files that are only one
> block long ...
Not terribly scientific, but when our test bogs down, it's often
in a directory with 6400 1-block files. So, your comment seems plausible.
Anyway - I just tested your patch. Again, unloaded system, repeatedly
untaring a 1.5gb file, starting at 97% capacity. and:
tunefs: average file size: (-f) 49152
tunefs: average number of files in a directory: (-s) 1500
...
Takes about 74 system secs per 1.5gb untar:
-------------------------------------------
/dev/da0s1e 558889580 497843972 16334442 97% 6858407 63316311 10% /raid
119.23 real 1.28 user 73.09 sys
/dev/da0s1e 558889580 499371100 14807314 97% 6879445 63295273 10% /raid
111.69 real 1.32 user 73.65 sys
/dev/da0s1e 558889580 500898228 13280186 97% 6900483 63274235 10% /raid
116.67 real 1.44 user 74.19 sys
/dev/da0s1e 558889580 502425356 11753058 98% 6921521 63253197 10% /raid
114.73 real 1.25 user 75.01 sys
/dev/da0s1e 558889580 503952484 10225930 98% 6942559 63232159 10% /raid
116.95 real 1.30 user 74.10 sys
/dev/da0s1e 558889580 505479614 8698800 98% 6963597 63211121 10% /raid
115.29 real 1.39 user 74.25 sys
/dev/da0s1e 558889580 507006742 7171672 99% 6984635 63190083 10% /raid
114.01 real 1.16 user 74.04 sys
/dev/da0s1e 558889580 508533870 5644544 99% 7005673 63169045 10% /raid
119.95 real 1.32 user 75.05 sys
/dev/da0s1e 558889580 510060998 4117416 99% 7026711 63148007 10% /raid
114.89 real 1.33 user 74.66 sys
/dev/da0s1e 558889580 511588126 2590288 99% 7047749 63126969 10% /raid
114.91 real 1.58 user 74.64 sys
/dev/da0s1e 558889580 513115254 1063160 100% 7068787 63105931 10% /raid
tot: 1161.06 real 13.45 user 742.89 sys
Compares pretty favorably to our naive, retro 4.4 dirpref hack
that averages in the mid-high 60's:
--------------------------------------------------------------------
/dev/da0s1e 558889580 497843952 16334462 97% 6858406 63316312 10% /raid
110.19 real 1.42 user 65.54 sys
/dev/da0s1e 558889580 499371080 14807334 97% 6879444 63295274 10% /raid
105.47 real 1.47 user 65.09 sys
/dev/da0s1e 558889580 500898208 13280206 97% 6900482 63274236 10% /raid
110.17 real 1.48 user 64.98 sys
/dev/da0s1e 558889580 502425336 11753078 98% 6921520 63253198 10% /raid
131.88 real 1.49 user 71.20 sys
/dev/da0s1e 558889580 503952464 10225950 98% 6942558 63232160 10% /raid
111.61 real 1.62 user 67.47 sys
/dev/da0s1e 558889580 505479594 8698820 98% 6963596 63211122 10% /raid
131.36 real 1.67 user 90.79 sys
/dev/da0s1e 558889580 507006722 7171692 99% 6984634 63190084 10% /raid
115.34 real 1.49 user 65.61 sys
/dev/da0s1e 558889580 508533850 5644564 99% 7005672 63169046 10% /raid
110.26 real 1.39 user 65.26 sys
/dev/da0s1e 558889580 510060978 4117436 99% 7026710 63148008 10% /raid
116.15 real 1.51 user 65.47 sys
/dev/da0s1e 558889580 511588106 2590308 99% 7047748 63126970 10% /raid
112.74 real 1.37 user 65.01 sys
/dev/da0s1e 558889580 513115234 1063180 100% 7068786 63105932 10% /raid
1158.36 real 15.01 user 686.57 sys
Without either, we'd expect timings of 5-20 minutes when things are
going poorly.
Happy to test further if you have tweaks to your patch or
things you'd like us to test in particular. E.g., load,
newfs, etc.
k.
--
Ken Marx, kmarx at vicor-nb.com
As a company we must not put the cart before the horse and set up weekly
meetings on the solution space.
- http://www.bigshed.com/cgi-bin/speak.cgi
More information about the freebsd-fs
mailing list