Possible softupdates bug when a indirect block buffer is reused

Sun Jul 31 19:47:42 GMT 2005

Hi Matthew,

We have been testing a fix for this for a few weeks.
I will check it in today.

Stephan

On Sun, 2005-07-31 at 14:40, Matthew Dillon wrote:
>     Hi Kirk, hackers!
> 
>     I'm trying to track down a bug that is causing a buffer to be left
>     in a locked state and then causes the filesystem to lock up because
>     of that.
> 
>     The symptoms are that a heavily used filesystem suddenly starts running
>     out of space.  It isn't due to deleted files with open descriptors, it's
>     due to the syncer getting stuck in a getblk state.  This is in DragonFly,
>     but I can't find anything DFlyish wrong so I'm beginning to think it may
>     be an actual bug in softupdates.
> 
>     I have wound up with a situation where a getblk()'d bp has been
>     associated with a indirdep dependancy, i.e. stored in
>     indirdep->ir_savebp, but is never released.  When something like
>     the syncer comes along and tries to access it, it locks up, and this
>     of course leads to inodes not getting cleared and the filesystem
>     eventually runs out of space when a lot of files are being created and
>     deleted.
> 
>     What has got me really confused is that the buffer in question seems to
>     wind up with a D_INDIRDEP dependancy that points back to itself.
> 
>     Here's the situation from a live gdb.  Here is where the syncer is 
>     stuck:
> 
> (kgdb) back
> #0  lwkt_switch () at thread2.h:95
> #1  0xc02a8a79 in tsleep (ident=0x0, flags=0, wmesg=0xc04eadb0 "getblk", 
>     timo=0) at /usr/src-125beta/sys/kern/kern_synch.c:428
> #2  0xc02956bb in acquire (lkp=0xc758b4e0, extflags=33554464, wanted=1536)
>     at /usr/src-125beta/sys/kern/kern_lock.c:127
> #3  0xc0295a92 in lockmgr (lkp=0xc758b4e0, flags=33620002, interlkp=0x0, 
>     td=0xd68f6400) at /usr/src-125beta/sys/kern/kern_lock.c:354
> #4  0xc02d6828 in getblk (vp=0xc71b3058, blkno=94440240, size=8192, slpflag=0, 
>     slptimeo=0) at thread.h:79
> #5  0xc02d4404 in bread (vp=0xc71b3058, blkno=0, size=0, bpp=0x0)
>     at /usr/src-125beta/sys/kern/vfs_bio.c:567
> #6  0xc03f24fe in indir_trunc (ip=0xe048fc0c, dbn=94440240, level=1, lbn=2060, 
>     countp=0xe048fbf8) at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2221
> #7  0xc03f22df in handle_workitem_freeblocks (freeblks=0xe2fcef98)
>     at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2138
> #8  0xc03f0462 in process_worklist_item (matchmnt=0x0, flags=0)
>     at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:726
> #9  0xc03f026c in softdep_process_worklist (matchmnt=0x0)
>     at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:625
> #10 0xc02e5ff3 in sched_sync () at /usr/src-125beta/sys/kern/vfs_sync.c:244
> #11 0xc0294863 in kthread_create_stk (func=0, arg=0x0, tdp=0xff800000, 
>     stksize=0, fmt=0x0) at /usr/src-125beta/sys/kern/kern_kthread.c:104
> (kgdb) 
> 
>     The buffer it is stuck on:
> 
> (kgdb) print bp
> $62 = (struct buf *) 0xc758b4b8
> (kgdb) print *bp
> $63 = {
>   b_hash = {
>     le_next = 0x0, 
>     le_prev = 0xc7391348
>   }, 
>   b_vnbufs = {
>     tqe_next = 0xc739b890, 
>     tqe_prev = 0xc76f32b8
>   }, 
>   b_freelist = {
>     tqe_next = 0xc768d610, 
>     tqe_prev = 0xc0565ac0
>   }, 
>   b_act = {
>     tqe_next = 0x0, 
>     tqe_prev = 0x0
>   }, 
>   b_flags = 536870912, 	<<<<<<<<< 0x20000000 (getblk with no bread, etc)
>   b_qindex = 0, 
>   b_xflags = 2 '\002', 
>   b_lock = {
>     lk_interlock = {
>       t_cpu = 0xff800000, 
>       t_reqcpu = 0xff800000, 
>       t_unused01 = 0
>     }, 
>     lk_flags = 2098176, 
>     lk_sharecount = 0, 
>     lk_waitcount = 1, 
>     lk_exclusivecount = 1, 
>     lk_prio = 0, 
>     lk_wmesg = 0xc04eadb0 "getblk", 
>     lk_timo = 0, 
>     lk_lockholder = 0xfffffffe
>   }, 
>   b_error = 0, 
>   b_bufsize = 8192, 
>   b_runningbufspace = 0, 
>   b_bcount = 8192, 
>   b_resid = 0, 
>   b_dev = 0xde0f0e38, 
>   b_data = 0xcf824000 "\205\002", 
>   b_kvabase = 0xcf824000 "\205\002", 
>   b_kvasize = 16384, 
>   b_lblkno = 94440240, 
>   b_blkno = 94440240, 
>   b_offset = 48353402880, 
>   b_iodone = 0, 
>   b_iodone_chain = 0x0, 
>   b_vp = 0xc71b3058, 
>   b_dirtyoff = 0, 
>   b_dirtyend = 0, 
>   b_pblkno = 87503631, 
>   b_saveaddr = 0x0, 
>   b_driver1 = 0x0, 
>   b_caller1 = 0x0, 
>   b_pager = {
>     pg_spc = 0x0, 
>     pg_reqpage = 0
>   }, 
>   b_cluster = {
>     cluster_head = {
>       tqh_first = 0x0, 
>       tqh_last = 0xc768d6bc
> ---Type <return> to continue, or q <return> to quit--- 
>     }, 
>     cluster_entry = {
>       tqe_next = 0x0, 
>       tqe_prev = 0xc768d6bc
>     }
>   }, 
>   b_xio = {
>     xio_pages = 0xc758b584, 
>     xio_npages = 2, 
>     xio_offset = 0, 
>     xio_bytes = 0, 
>     xio_flags = 0, 
>     xio_error = 0, 
>     xio_internal_pages = {0xc34e5870, 0xc4aeb2b4, 0x0 <repeats 30 times>}
>   }, 
>   b_dep = {
>     lh_first = 0xc7045040
>   }, 
>   b_chain = {
>     parent = 0x0, 
>     count = 0
>   }
> }
> 
>     As you can see from b_flags, which is 0x20000000, the buffer has been
>     getblk()'d but not bread() or anything else.  It is the typical state
>     that occurs when a buffer is placed in an indirdep->ir_savebp state
>     in setup_allocindir_phase2().
> 
>     The buffer's dependancy looks like this:
> 
> (kgdb) print bp
> $65 = (struct buf *) 0xc758b4b8
> (kgdb) print *(struct indirdep *)bp->b_dep.lh_first
> $66 = {
>   ir_list = {
>     wk_list = {
>       le_next = 0x0, 
>       le_prev = 0xc758b604
>     }, 
>     wk_type = 5, 
>     wk_state = 33025		<<<<<<<<<<<<<  ATTACHED|GOINGAWAY|ONWORKLIST
>   }, 
>   ir_saveddata = 0xdeadc0de "", 
>   ir_savebp = 0xc758b4b8, 	<<<<<<<<<<<<< points back to itself
>   ir_donehd = {
>     lh_first = 0x0
>   }, 
>   ir_deplisthd = {
>     lh_first = 0x0
>   }
> }
> (kgdb) 
> 
>     As you can see, the buffer's dependancy seems to point to itself.
>     As you may know, ir_savebp buffers are left in a locked state, so a
>     buffer that has gotten into this situation winds up being permanently
>     deadlocked.
> 
>     This is on DragonFly, but I can't find anything else wrong.  This is 
>     on a filesystem running a myrid of jails which is constantly creating
>     and deleting files in parallel... so buffers are being reused quite often.
>     It takes about a week of this heavy activity for the bug to rear its
>     ugly head.  It's fairly difficult to reproduce (takes about a week to
>     reproduce).
> 
>     When the problem does occur the system remains functional... the disk
>     device continues to work, the filesystem continues to work, except for
>     any blockages that chain down to the particular block that has deadlocked
>     (usually the syncer, but as time goes by more parts of the system will
>     deadlock).  It IS possible to run a live gdb when the situation is 
>     caught early enough.
> 
>     I am trying to figure out how the buffer manages to get into this 
>     self-locked state.  I'm pretty much stuck once I get to the
>     deallocate_dependancies() procedure.  This procedure seems to be trying
>     to move a D_INDIRDEP dependancy from the passed bp into the buffer 
>     pointed to by ir_savebp.  As far as I can tell this is what is creating
>     the situation that makes the buffer's dependancy self-referential
>     and deadlocks the syncer.
> 
>     I have noticed a number of FreeBSD bug reports related to blocking in
>     getblk or to softupdates.  I don't know if this is a similar problem
>     but it's worth Ccing freebsd-hackers on it as well.
> 
> 						-Matt
> 
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
> 
>