[patch] zfs kmem fragmentation
Ben Kelly
ben at wanderview.com
Sat May 2 04:49:49 UTC 2009
Hello all,
Lately I've been looking into the "kmem too small" panics that often
occur with zfs if you don't restrict the arc. What I found in my test
environment was that everything works well until the kmem usage hits
the 75% limit set in arc.c. At this point the arc is shrunk and slabs
are reclaimed from uma. Unfortunately, every time this reclamation
process runs the kmem space becomes more fragmented. The vast
majority of the time my machine hits the "kmem too small" panic it has
over 200MB of kmem space available, but the largest fragment is less
than 128KB.
Ideally things would be arranged to free memory without
fragmentation. I have tried a few things along those lines, but none
of them have been successful so far. I'm going to continue that work,
but in the meantime I've put together a patch that tries to avoid
fragmentation by slowing kmem growth before the aggressive reclamation
process is required:
http://www.wanderview.com/svn/public/misc/zfs/zfs_kmem_limit.diff
It uses the following heuristics to do this:
- Start arc_c at arc_c_min instead of arc_c_max. This causes the
system to warm up more slowly.
- Half the rate arc_c grows when kmem exceeds kmem_slow_growth_thresh
- Stop arc_c growth when kmem exceeds kmem_target
- Evict arc data when the kmem exceeds kmem_target
- If kmem usage exceeds kmem_target then ask the pagedaemon to
reclaim pages
- If the largest kmem fragment is less than kmem_fragment_target
then ask the pagedaemon to reclaim pages
- If the largest kmem fragment is less than a kmem_fragment_thresh
then force the aggressve kmem/arc reclamation process
The defaults for the various targets and thresholds are:
kmem_reclaim_threshold = 7/8 kmem
kmem_target = 3/4 kmem
kmem_slow_growth_threshold = 5/8 kmem
kmem_fragment_target = 1/8 kmem
kmem_fragment_thresh = 1/16 kmem
With this patch I've been able to run my load tests with the default
arc size with kmem values of 512MB to 700MB. I tried one loaded run
with a 300MB kmem, but it panic'ed due to legitimate, non-fragmented
kmem exhaustion.
Please note that you may still encounter some fragmentation. Its
possible for the system to get stuck in a degraded state where its
constantly trying to free pages and memory in attempt to fix the
fragmentation. If the system is in this state the
kstat.zfs.misc.arcstats.fragmented_kmem_count sysctl will be
increasing at a fairly rapid rate.
Anyway, I just thought I would put this out there in case anyone
wanted to try to test with it. I've mainly been loading it using
rsync between two pools on a non-SMP, i386, with 2GB memory.
Also, if anyone is interested in helping with the fragmentation
problem please let me know. At this point I think the best odds are
to modify UMA to allow some zones to use a custom slab size of 128KB
(max zfs buffer size) so that most of the allocations from kmem are
the same size. It also occurred to me that much of this mess would be
simpler if kmem information were passed up through the vnode so that
the top layer entities like pagedaemon could make better choices for
the overall memory usage of the system. Right now we have a sub-
system two or three layers down making decisions for everyone.
Anyway, suggestions and insights are more than welcome.
Thanks!
- Ben
More information about the freebsd-current
mailing list