Cleaning up vgone.

Thu Mar 10 02:26:36 PST 2005

:I've run into a few vclean races and some related problems with VOP_CLOSE
:not using locks.  I've made some fairly major changes to the way vfs
:handles vnode teardown in the process of fixing this.  I'll summarize what
:I've done here.
:
:The main problem with teardown was the two stage locking scheme involving
:the XLOCK.  I got rid of the XLOCK and simply require the vnode lock
:throughout the whole operation.  To accommodate this, VOP_INACTIVE,
:VOP_RECLAIM, VOP_CLOSE, and VOP_REVOKE all require the vnode lock.  As
:does vgone().
:
:Prior to this, vgone() would set XLOCK and then do a LK_DRAIN to make
:sure there were no callers waiting in VOP_LOCK so that they would always
:see the VI_XLOCK and know that the vnode had changed identities.  Now,
:vgone sets XLOCK, and all lockers who use vget() and vn_lock() check for
:VI_DOOMED before and after acquiring the vnode lock.  To wait for the
:transition to complete, you simply wait on the vnode lock.
:
:This really only required minor changes of the filesystems in the tree.
:Most only required the removal of a VOP_UNLOCK in VOP_INACTIVE, and a few
:acquired the lock in VOP_CLOSE to do operations which they otherwise could
:not.  There is one change to ffs and coda which inspect v_data in their
:vop_lock routines.  This is only safe with the interlock held, where
:before the XLOCK would have protected v_data in the one case that could
:lead to panic.
:
:The patch is available at http://www.chesapeake.net/~jroberson/vgone.diff
:
:Cheers,
:Jeff

    I did basically the same thing in DragonFly.  I can only caution that
    you have to be very, *very* careful to ensure that there are no races
    between entities trying to ffs_vget() through the inode verses the
    related vnode undergoing a vgone().  Be aware of the fact that there
    is a major non-atomicy problem between the destruction of the vnode
    and the clearing of the inode's bit in the bitmap and the destruction
    of the inode in the inode hash table.

    In DragonFly there were races in the inode-reuse case where an inode
    would be reused before it was entirely disconnected from the vnode.
    This case can occur because the inode's bitmap bit is cleared before
    the vnode is unlocked (and you have to do it that way, you can't clear
    the inode's bitmap bit after the vnode has been unlocked without creating
    yet more issues).  

    This means that a totally unrelated process creating a file or directory
    could allocate the SAME inode number out of the inode bitmap and call
    FFS_VGET() *BEFORE* the original vnode controlling that inode number
    finishes being reclaimed.  In fact, the original inode is still pointing
    at the vnode undergoing destruction and the vnode's v_data is still
    pointing at the inode.  The result:  bad things happen.

    It got so hairy in DFly that I wound up reordering the code as follows:

	(1) lock vnode
	(2) free the inode in the bitmap (UFS_IFREE() equivalent)
	(3) Set v_data to NULL
	(4) Remove inode from the inode hash table (clear back pointer)
		[inode now fully disassociated]
	(5) unlock vnode.

	*** Modify the bitmap allocation code for inodes to check whether the
	    candidate inode is still present in the inode hash and SKIP THAT
	    INODE if it is.  (This in the UFS code).

    If you do not do that you wind up with a case where the filesystem cannot
    differentiate between someone trying to lock the original vnode and
    someone trying to vget() the vnode related to a newly created file whos
    inode was just allocated out of the inode bitmap.

    The word 'nasty' doesn't even begin to describe the problem.  There are
    *THREE* races here:  The vnode lock, the inode hash table, AND the 
    inode bitmap.  Due to disk I/O it is possible for the system to block
    in between any of the related operations so you have to make sure that
    races against all three points are handled properly.  My solution was
    to end-run around the points by leaving the inode hash table intact until
    the very end.  In DragonFly I don't call ufs_ihashrem() until the inode
    has been completely divorced from the vnode and that gives me a way
    to check for the race (by checking whether the inode is present in the
    inode hash table or not).

    Another thing I did in DragonFly was to get rid of the crazy handling of
    the vnode's reference count during the reclaim.  I don't know what
    FreeBSD-HEAD is doing now, but FreeBSD-4 dropped the ref count to 0
    and then did the crazy VXLOCK stuff.  In DragonFly I changed that so
    the ref count is left at 1 (not 0) during the reclaim.  This required
    fixing up a few cases that were checking the refcount against an absolute
    0 or 1 (I forget the cases), but it made the code a whole lot easier to
    understand.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>