Freeing vnodes.
Matthew Dillon
dillon at apollo.backplane.com
Tue Mar 15 11:11:36 PST 2005
:> I think you did not intend this. Didn't you just want to destroy
:> enough vnodes to have 'wantfreevnodes' worth of slop so getnewvnode()
:> could allocate new vnodes? In that case the calculation would be:
:
:On my system wantfreevnodes is at 2500. Let's say I have 4500 free
:vnodes. 4500 - 2500 = 2000. Divide by 2 gives you 1000. I don't think
:you read the whole patch.
I'm not trying to be confrontational here, Jeff. Please remember that
I'm the one who has done most of the algorithmic work on these
subsystems. I designed the whole 'trigger' mechanism, for example.
The wantfreevnodes calculation is: minvnodes / 10. That's a very small
number. The 'freevnodes' value is typically a much larger value,
especially if a program running through stat()ing things. It is possible
to have tens of thousands of free vnodes. This makes your current
count calculation effectively 'freevnodes / 2'. I really don't think
you want to destroy half the current freevnodes on each pass, do you?
:> can be a HUGE load on getnewvnode() (think of cvsupd and find, or
:> a cvs update, etc...). This load can easily outstrip vnlru_proc()'s
:> new ability to free vnodes and potentially cause a lot of unnecessarily
:> blockages.
:
:We have one buf daemon, one page daemon, one syncer, one vnlru proc, etc.
:In all these cases it would be nice if they gained new contexts when they
:had a lot of work to do, but they don't, and it doesn't seem to be a huge
:problem today. On my system one vnlruproc easily keeps up with the job of
That's because they are carefully written (mostly by me) to not be
subject to pure cpu loads.
buf_daemon: Is primarily only responsible for flushing DIRTY buffers.
The buffer allocation code will happily reuse clean buffers
in-line. Dirty buffers are subject to the I/O limitations
of the system (and they are flushed asynchronously for the
most part), which means that one daemon should have no
trouble handle the buffer load on an MP sytem. Since a
system naturally has many more clean buffers then dirty
buffers (even without algorithmic limitations), except in
certain particular large-write cases which are handled
elsewhere, the buf_daemon usually has very little effect
on the buffer cache's ability to allocate a new buffer.
page_daemon: Same deal. The page daemon is primarily responsible for
flushing out dirty pages and for rebalancing the lists
if they get really out of whack. Pages in the VM page
cache (PQ_CACHE) can be reused on the fly and there are
several *natural* ways for a page to go directly to the
VM page cache without having to pass through the page
daemon. In fact, MOST of the pages that get onto the
PQ_CACHE or PQ_FREE queues are placed there directly by
mechanisms unrelated to the page daemon.
syncer: I've always wanted to rewrite the syncer to be per-mount
or per-physical-device so it could sync out to multiple
physical devices simultaniously.
vnlru_proc: Prior to your patch, vnlru_proc was only responsible for
rebalancing the freevnode list. Typically the ONLY case
where a vnode needs to be forcefully put on the freevnode
list is if there are a lot of vnodes which have VM objects
which still have just one or two VM pages associated with
them, because otherwise a vnode either gets put on the
freevnode list directly by the vnode release code, or it
has enough associated pages for us to not want to recycle
it anyway (which is what the trigger code handles). The
mechanism that leads to the creation of such vnodes also
typically requires a lot of random I/O, which makes
vnlru_proc() immune to cpu load. This means that
vnlru_proc is only PARTIALLY responsible for maintaining
the freevnode list and the part it Is responsible for
tends to be unrelated to pure cpu loads.
There are a ton of ways for a vnode to make it to that
list WITHOUT passing through vnlru_proc, which means that
prior to your patch getnewvnode() typically only has to
wait for vnlru_proc() in the most extreme situations.
By my read, the changes you are currently contemplating for vnlru_proc
changes its characteristics such that it is now COMPLETELY responsible
for freeing up vnodes for getnewvnode(). This was not the case before.
I can only repeat that getnewvnode() has a massive dynamic loading range,
one that is not necessarily dependant on or limited by I/O. For example,
when you are stat()ing a lot of files over and over again there is a good
chance that the related inodes are cached in the VM object representing
the backing store for the filesystem. This means that getnewvnode() can
cycle very quickly, on the order of tens of thousands of vnodes per
second in certain situations. By my read, you are forcing *ALL* the
vnode recycling activity to run through vnlru_proc() now. The only way
now for getnewvnode() to get a new vnode is by allocating it out of
the zone. This was not the case before.
:freeing free vnodes. Remember these vnodes have no pages associated with
:them, so at most you're freeing an inode for a deleted file, and in the
:common case the whole operation runs on memory without blocking for io.
:...
:We presently single thread the most critical case, where we have no free
:vnodes and are not allowed to allocate any more while we wait for
:vnlru_proc() to do io on vnodes with cached pages to reclaim some. I'm
:not convinced this is a real problem.
Which means that in systems with a large amount of memory (large VM page
cache) doing certainly operations (such as stat()ing a large number
of files e.g. a find or cvsupd), where the file set is larger then
the number of vnodes available, will now have to cycle all of those
vnodes through a single thread in order to reuse them.
The current pre-patch case is very different. With your patch,
in addition to the issues already mentioned, the inode synchronization
is now being single-threaded and while the writes are asynchronous,
the reads are not (if the inode happens to not be in the VM page cache
any more because it's been cached so long the system has decided to
throw away the page to accomodate other cached data).
In the current pre-patch case, that read load was distributed over ALL
processes trying to do a getnewvnode(). i.e. it was a parallel read
load that actually scaled fairly well to load.
:> I love the idea of being able to free vnodes in vnlru_proc() rather
:> then free-and-reuse them in allocvnode(), but I cannot figure out how
:> vnlru_proc() could possibly adapt to the huge load range that
:> getnewvnode() has to deal with. Plus keep in mind that the vnodes
:> being reused at that point are basically already dead except for
:> the vgonel().
:>
:> This brings up the true crux of the problem, where the true overhead
:> of reusing a vnode inline with the getnewvnode() call is... and that
:> is that vgonel() potentially has to update the related inode and could
:> cause an unrelated process to block inside getnewvnode(). But even
:
:Yes, this is kind of gross, and would cause lock order problems except
:that we LK_NOWAIT on the vn lock in vtryrecycle(). It'd be better if we
:didn't try doing io on unrelated vnodes while this deep in the stack.
I agree. It is gross, though I will note that the fact that the vnode
is ON the free list tends to mean that it isn't being referenced by
anyone so there should not be any significant lock ordering issues.
I haven't 'fixed' this in DragonFly because I haven't been able to
figure out how to distribute the recycling load and deal with the
huge dynamic loading range that getnewvnode() has.
I've been working on the buffer cache code since, what, 1998? These
are real issues. It's always very easy to design algorithms that
work for specific machine configurations, the trick is to make them
work across the board.
One thing I LIKE about your code is the concept of being able to reuse
a vnode (or in your case allocate a new vnode) without having to perform
any I/O. The re-use case in the old code always has the potential to
block an unrelated process if it has to do I/O recycling the vnode it
wants to reuse. But this is a very easy effect to accomplish simply by
leaving the recycling code in getnewvnode() intact but STILL adding new
code to vnlru_proc() to ensure that a minimum number of vnodes are
truely reusable without having to perform any I/O. This would enhance
light-load (light getnewvnode() load that is) performance. It would
have virtually no effect under heavier loads, which is why the vnode
re-use code in getnewvnode() would have to stay, but the light-load
benefit is undeniable.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-arch
mailing list