Freeing vnodes.

Tue Mar 15 11:11:36 PST 2005

:>     I think you did not intend this.  Didn't you just want to destroy
:>     enough vnodes to have 'wantfreevnodes' worth of slop so getnewvnode()
:>     could allocate new vnodes?  In that case the calculation would be:
:
:On my system wantfreevnodes is at 2500.  Let's say I have 4500 free
:vnodes.  4500 - 2500 = 2000.  Divide by 2 gives you 1000.  I don't think
:you read the whole patch.

    I'm not trying to be confrontational here, Jeff.  Please remember that
    I'm the one who has done most of the algorithmic work on these
    subsystems.  I designed the whole 'trigger' mechanism, for example.

    The wantfreevnodes calculation is:  minvnodes / 10.  That's a very small
    number.  The 'freevnodes' value is typically a much larger value, 
    especially if a program running through stat()ing things.  It is possible
    to have tens of thousands of free vnodes.  This makes your current
    count calculation effectively 'freevnodes / 2'.  I really don't think
    you want to destroy half the current freevnodes on each pass, do you?

:>     can be a HUGE load on getnewvnode() (think of cvsupd and find, or
:>     a cvs update, etc...).  This load can easily outstrip vnlru_proc()'s
:>     new ability to free vnodes and potentially cause a lot of unnecessarily
:>     blockages.
:
:We have one buf daemon, one page daemon, one syncer, one vnlru proc, etc.
:In all these cases it would be nice if they gained new contexts when they
:had a lot of work to do, but they don't, and it doesn't seem to be a huge
:problem today.  On my system one vnlruproc easily keeps up with the job of

    That's because they are carefully written (mostly by me) to not be
    subject to pure cpu loads.

    buf_daemon:	  Is primarily only responsible for flushing DIRTY buffers.
		  The buffer allocation code will happily reuse clean buffers
		  in-line.  Dirty buffers are subject to the I/O limitations
		  of the system (and they are flushed asynchronously for the
		  most part), which means that one daemon should have no
		  trouble handle the buffer load on an MP sytem.  Since a
		  system naturally has many more clean buffers then dirty
		  buffers (even without algorithmic limitations), except in
		  certain particular large-write cases which are handled
		  elsewhere, the buf_daemon usually has very little effect
		  on the buffer cache's ability to allocate a new buffer.

    page_daemon:  Same deal.  The page daemon is primarily responsible for
		  flushing out dirty pages and for rebalancing the lists
		  if they get really out of whack.  Pages in the VM page
		  cache (PQ_CACHE) can be reused on the fly and there are
		  several *natural* ways for a page to go directly to the
		  VM page cache without having to pass through the page
		  daemon.  In fact, MOST of the pages that get onto the
		  PQ_CACHE or PQ_FREE queues are placed there directly by
		  mechanisms unrelated to the page daemon.

    syncer:	  I've always wanted to rewrite the syncer to be per-mount
		  or per-physical-device so it could sync out to multiple
		  physical devices simultaniously. 

    vnlru_proc:	  Prior to your patch, vnlru_proc was only responsible for 
		  rebalancing the freevnode list.  Typically the ONLY case
		  where a vnode needs to be forcefully put on the freevnode
		  list is if there are a lot of vnodes which have VM objects
		  which still have just one or two VM pages associated with
		  them, because otherwise a vnode either gets put on the 
		  freevnode list directly by the vnode release code, or it
		  has enough associated pages for us to not want to recycle
		  it anyway (which is what the trigger code handles).  The
		  mechanism that leads to the creation of such vnodes also
		  typically requires a lot of random I/O, which makes
		  vnlru_proc() immune to cpu load.  This means that
		  vnlru_proc is only PARTIALLY responsible for maintaining
		  the freevnode list and the part it Is responsible for
		  tends to be unrelated to pure cpu loads.

		  There are a ton of ways for a vnode to make it to that
		  list WITHOUT passing through vnlru_proc, which means that
		  prior to your patch getnewvnode() typically only has to
		  wait for vnlru_proc() in the most extreme situations.  

    By my read, the changes you are currently contemplating for vnlru_proc
    changes its characteristics such that it is now COMPLETELY responsible
    for freeing up vnodes for getnewvnode().  This was not the case before.

    I can only repeat that getnewvnode() has a massive dynamic loading range,
    one that is not necessarily dependant on or limited by I/O.  For example,
    when you are stat()ing a lot of files over and over again there is a good
    chance that the related inodes are cached in the VM object representing
    the backing store for the filesystem.  This means that getnewvnode() can
    cycle very quickly, on the order of tens of thousands of vnodes per
    second in certain situations.  By my read, you are forcing *ALL* the
    vnode recycling activity to run through vnlru_proc() now.  The only way
    now for getnewvnode() to get a new vnode is by allocating it out of
    the zone.  This was not the case before.

:freeing free vnodes.  Remember these vnodes have no pages associated with
:them, so at most you're freeing an inode for a deleted file, and in the
:common case the whole operation runs on memory without blocking for io.
:...
:We presently single thread the most critical case, where we have no free
:vnodes and are not allowed to allocate any more while we wait for
:vnlru_proc() to do io on vnodes with cached pages to reclaim some.  I'm
:not convinced this is a real problem.

    Which means that in systems with a large amount of memory (large VM page
    cache) doing certainly operations (such as stat()ing a large number
    of files e.g. a find or cvsupd), where the file set is larger then 
    the number of vnodes available, will now have to cycle all of those
    vnodes through a single thread in order to reuse them.

    The current pre-patch case is very different.  With your patch,
    in addition to the issues already mentioned, the inode synchronization
    is now being single-threaded and while the writes are asynchronous,
    the reads are not (if the inode happens to not be in the VM page cache
    any more because it's been cached so long the system has decided to
    throw away the page to accomodate other cached data).

    In the current pre-patch case, that read load was distributed over ALL
    processes trying to do a getnewvnode().  i.e. it was a parallel read
    load that actually scaled fairly well to load.

:>     I love the idea of being able to free vnodes in vnlru_proc() rather
:>     then free-and-reuse them in allocvnode(), but I cannot figure out how
:>     vnlru_proc() could possibly adapt to the huge load range that
:>     getnewvnode() has to deal with.  Plus keep in mind that the vnodes
:>     being reused at that point are basically already dead except for
:>     the vgonel().
:>
:>     This brings up the true crux of the problem, where the true overhead
:>     of reusing a vnode inline with the getnewvnode() call is... and that
:>     is that vgonel() potentially has to update the related inode and could
:>     cause an unrelated process to block inside getnewvnode().  But even
:
:Yes, this is kind of gross, and would cause lock order problems except
:that we LK_NOWAIT on the vn lock in vtryrecycle().  It'd be better if we
:didn't try doing io on unrelated vnodes while this deep in the stack.

    I agree.  It is gross, though I will note that the fact that the vnode
    is ON the free list tends to mean that it isn't being referenced by
    anyone so there should not be any significant lock ordering issues.

    I haven't 'fixed' this in DragonFly because I haven't been able to 
    figure out how to distribute the recycling load and deal with the
    huge dynamic loading range that getnewvnode() has.

    I've been working on the buffer cache code since, what, 1998?  These
    are real issues.   It's always very easy to design algorithms that 
    work for specific machine configurations, the trick is to make them
    work across the board. 

    One thing I LIKE about your code is the concept of being able to reuse
    a vnode (or in your case allocate a new vnode) without having to perform
    any I/O.  The re-use case in the old code always has the potential to
    block an unrelated process if it has to do I/O recycling the vnode it
    wants to reuse.  But this is a very easy effect to accomplish simply by
    leaving the recycling code in getnewvnode() intact but STILL adding new
    code to vnlru_proc() to ensure that a minimum number of vnodes are 
    truely reusable without having to perform any I/O.  This would enhance
    light-load (light getnewvnode() load that is) performance.  It would
    have virtually no effect under heavier loads, which is why the vnode
    re-use code in getnewvnode() would have to stay, but the light-load
    benefit is undeniable.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>