Cleaning up vgone.
Matthew Dillon
dillon at apollo.backplane.com
Thu Mar 10 02:26:36 PST 2005
:I've run into a few vclean races and some related problems with VOP_CLOSE
:not using locks. I've made some fairly major changes to the way vfs
:handles vnode teardown in the process of fixing this. I'll summarize what
:I've done here.
:
:The main problem with teardown was the two stage locking scheme involving
:the XLOCK. I got rid of the XLOCK and simply require the vnode lock
:throughout the whole operation. To accommodate this, VOP_INACTIVE,
:VOP_RECLAIM, VOP_CLOSE, and VOP_REVOKE all require the vnode lock. As
:does vgone().
:
:Prior to this, vgone() would set XLOCK and then do a LK_DRAIN to make
:sure there were no callers waiting in VOP_LOCK so that they would always
:see the VI_XLOCK and know that the vnode had changed identities. Now,
:vgone sets XLOCK, and all lockers who use vget() and vn_lock() check for
:VI_DOOMED before and after acquiring the vnode lock. To wait for the
:transition to complete, you simply wait on the vnode lock.
:
:This really only required minor changes of the filesystems in the tree.
:Most only required the removal of a VOP_UNLOCK in VOP_INACTIVE, and a few
:acquired the lock in VOP_CLOSE to do operations which they otherwise could
:not. There is one change to ffs and coda which inspect v_data in their
:vop_lock routines. This is only safe with the interlock held, where
:before the XLOCK would have protected v_data in the one case that could
:lead to panic.
:
:The patch is available at http://www.chesapeake.net/~jroberson/vgone.diff
:
:Cheers,
:Jeff
I did basically the same thing in DragonFly. I can only caution that
you have to be very, *very* careful to ensure that there are no races
between entities trying to ffs_vget() through the inode verses the
related vnode undergoing a vgone(). Be aware of the fact that there
is a major non-atomicy problem between the destruction of the vnode
and the clearing of the inode's bit in the bitmap and the destruction
of the inode in the inode hash table.
In DragonFly there were races in the inode-reuse case where an inode
would be reused before it was entirely disconnected from the vnode.
This case can occur because the inode's bitmap bit is cleared before
the vnode is unlocked (and you have to do it that way, you can't clear
the inode's bitmap bit after the vnode has been unlocked without creating
yet more issues).
This means that a totally unrelated process creating a file or directory
could allocate the SAME inode number out of the inode bitmap and call
FFS_VGET() *BEFORE* the original vnode controlling that inode number
finishes being reclaimed. In fact, the original inode is still pointing
at the vnode undergoing destruction and the vnode's v_data is still
pointing at the inode. The result: bad things happen.
It got so hairy in DFly that I wound up reordering the code as follows:
(1) lock vnode
(2) free the inode in the bitmap (UFS_IFREE() equivalent)
(3) Set v_data to NULL
(4) Remove inode from the inode hash table (clear back pointer)
[inode now fully disassociated]
(5) unlock vnode.
*** Modify the bitmap allocation code for inodes to check whether the
candidate inode is still present in the inode hash and SKIP THAT
INODE if it is. (This in the UFS code).
If you do not do that you wind up with a case where the filesystem cannot
differentiate between someone trying to lock the original vnode and
someone trying to vget() the vnode related to a newly created file whos
inode was just allocated out of the inode bitmap.
The word 'nasty' doesn't even begin to describe the problem. There are
*THREE* races here: The vnode lock, the inode hash table, AND the
inode bitmap. Due to disk I/O it is possible for the system to block
in between any of the related operations so you have to make sure that
races against all three points are handled properly. My solution was
to end-run around the points by leaving the inode hash table intact until
the very end. In DragonFly I don't call ufs_ihashrem() until the inode
has been completely divorced from the vnode and that gives me a way
to check for the race (by checking whether the inode is present in the
inode hash table or not).
Another thing I did in DragonFly was to get rid of the crazy handling of
the vnode's reference count during the reclaim. I don't know what
FreeBSD-HEAD is doing now, but FreeBSD-4 dropped the ref count to 0
and then did the crazy VXLOCK stuff. In DragonFly I changed that so
the ref count is left at 1 (not 0) during the reclaim. This required
fixing up a few cases that were checking the refcount against an absolute
0 or 1 (I forget the cases), but it made the code a whole lot easier to
understand.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-arch
mailing list