Possible softupdates bug when a indirect block buffer is reused
Stephan Uphoff
ups at tree.com
Sun Jul 31 19:47:42 GMT 2005
Hi Matthew,
We have been testing a fix for this for a few weeks.
I will check it in today.
Stephan
On Sun, 2005-07-31 at 14:40, Matthew Dillon wrote:
> Hi Kirk, hackers!
>
> I'm trying to track down a bug that is causing a buffer to be left
> in a locked state and then causes the filesystem to lock up because
> of that.
>
> The symptoms are that a heavily used filesystem suddenly starts running
> out of space. It isn't due to deleted files with open descriptors, it's
> due to the syncer getting stuck in a getblk state. This is in DragonFly,
> but I can't find anything DFlyish wrong so I'm beginning to think it may
> be an actual bug in softupdates.
>
> I have wound up with a situation where a getblk()'d bp has been
> associated with a indirdep dependancy, i.e. stored in
> indirdep->ir_savebp, but is never released. When something like
> the syncer comes along and tries to access it, it locks up, and this
> of course leads to inodes not getting cleared and the filesystem
> eventually runs out of space when a lot of files are being created and
> deleted.
>
> What has got me really confused is that the buffer in question seems to
> wind up with a D_INDIRDEP dependancy that points back to itself.
>
> Here's the situation from a live gdb. Here is where the syncer is
> stuck:
>
> (kgdb) back
> #0 lwkt_switch () at thread2.h:95
> #1 0xc02a8a79 in tsleep (ident=0x0, flags=0, wmesg=0xc04eadb0 "getblk",
> timo=0) at /usr/src-125beta/sys/kern/kern_synch.c:428
> #2 0xc02956bb in acquire (lkp=0xc758b4e0, extflags=33554464, wanted=1536)
> at /usr/src-125beta/sys/kern/kern_lock.c:127
> #3 0xc0295a92 in lockmgr (lkp=0xc758b4e0, flags=33620002, interlkp=0x0,
> td=0xd68f6400) at /usr/src-125beta/sys/kern/kern_lock.c:354
> #4 0xc02d6828 in getblk (vp=0xc71b3058, blkno=94440240, size=8192, slpflag=0,
> slptimeo=0) at thread.h:79
> #5 0xc02d4404 in bread (vp=0xc71b3058, blkno=0, size=0, bpp=0x0)
> at /usr/src-125beta/sys/kern/vfs_bio.c:567
> #6 0xc03f24fe in indir_trunc (ip=0xe048fc0c, dbn=94440240, level=1, lbn=2060,
> countp=0xe048fbf8) at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2221
> #7 0xc03f22df in handle_workitem_freeblocks (freeblks=0xe2fcef98)
> at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2138
> #8 0xc03f0462 in process_worklist_item (matchmnt=0x0, flags=0)
> at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:726
> #9 0xc03f026c in softdep_process_worklist (matchmnt=0x0)
> at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:625
> #10 0xc02e5ff3 in sched_sync () at /usr/src-125beta/sys/kern/vfs_sync.c:244
> #11 0xc0294863 in kthread_create_stk (func=0, arg=0x0, tdp=0xff800000,
> stksize=0, fmt=0x0) at /usr/src-125beta/sys/kern/kern_kthread.c:104
> (kgdb)
>
> The buffer it is stuck on:
>
> (kgdb) print bp
> $62 = (struct buf *) 0xc758b4b8
> (kgdb) print *bp
> $63 = {
> b_hash = {
> le_next = 0x0,
> le_prev = 0xc7391348
> },
> b_vnbufs = {
> tqe_next = 0xc739b890,
> tqe_prev = 0xc76f32b8
> },
> b_freelist = {
> tqe_next = 0xc768d610,
> tqe_prev = 0xc0565ac0
> },
> b_act = {
> tqe_next = 0x0,
> tqe_prev = 0x0
> },
> b_flags = 536870912, <<<<<<<<< 0x20000000 (getblk with no bread, etc)
> b_qindex = 0,
> b_xflags = 2 '\002',
> b_lock = {
> lk_interlock = {
> t_cpu = 0xff800000,
> t_reqcpu = 0xff800000,
> t_unused01 = 0
> },
> lk_flags = 2098176,
> lk_sharecount = 0,
> lk_waitcount = 1,
> lk_exclusivecount = 1,
> lk_prio = 0,
> lk_wmesg = 0xc04eadb0 "getblk",
> lk_timo = 0,
> lk_lockholder = 0xfffffffe
> },
> b_error = 0,
> b_bufsize = 8192,
> b_runningbufspace = 0,
> b_bcount = 8192,
> b_resid = 0,
> b_dev = 0xde0f0e38,
> b_data = 0xcf824000 "\205\002",
> b_kvabase = 0xcf824000 "\205\002",
> b_kvasize = 16384,
> b_lblkno = 94440240,
> b_blkno = 94440240,
> b_offset = 48353402880,
> b_iodone = 0,
> b_iodone_chain = 0x0,
> b_vp = 0xc71b3058,
> b_dirtyoff = 0,
> b_dirtyend = 0,
> b_pblkno = 87503631,
> b_saveaddr = 0x0,
> b_driver1 = 0x0,
> b_caller1 = 0x0,
> b_pager = {
> pg_spc = 0x0,
> pg_reqpage = 0
> },
> b_cluster = {
> cluster_head = {
> tqh_first = 0x0,
> tqh_last = 0xc768d6bc
> ---Type <return> to continue, or q <return> to quit---
> },
> cluster_entry = {
> tqe_next = 0x0,
> tqe_prev = 0xc768d6bc
> }
> },
> b_xio = {
> xio_pages = 0xc758b584,
> xio_npages = 2,
> xio_offset = 0,
> xio_bytes = 0,
> xio_flags = 0,
> xio_error = 0,
> xio_internal_pages = {0xc34e5870, 0xc4aeb2b4, 0x0 <repeats 30 times>}
> },
> b_dep = {
> lh_first = 0xc7045040
> },
> b_chain = {
> parent = 0x0,
> count = 0
> }
> }
>
> As you can see from b_flags, which is 0x20000000, the buffer has been
> getblk()'d but not bread() or anything else. It is the typical state
> that occurs when a buffer is placed in an indirdep->ir_savebp state
> in setup_allocindir_phase2().
>
> The buffer's dependancy looks like this:
>
> (kgdb) print bp
> $65 = (struct buf *) 0xc758b4b8
> (kgdb) print *(struct indirdep *)bp->b_dep.lh_first
> $66 = {
> ir_list = {
> wk_list = {
> le_next = 0x0,
> le_prev = 0xc758b604
> },
> wk_type = 5,
> wk_state = 33025 <<<<<<<<<<<<< ATTACHED|GOINGAWAY|ONWORKLIST
> },
> ir_saveddata = 0xdeadc0de "",
> ir_savebp = 0xc758b4b8, <<<<<<<<<<<<< points back to itself
> ir_donehd = {
> lh_first = 0x0
> },
> ir_deplisthd = {
> lh_first = 0x0
> }
> }
> (kgdb)
>
> As you can see, the buffer's dependancy seems to point to itself.
> As you may know, ir_savebp buffers are left in a locked state, so a
> buffer that has gotten into this situation winds up being permanently
> deadlocked.
>
> This is on DragonFly, but I can't find anything else wrong. This is
> on a filesystem running a myrid of jails which is constantly creating
> and deleting files in parallel... so buffers are being reused quite often.
> It takes about a week of this heavy activity for the bug to rear its
> ugly head. It's fairly difficult to reproduce (takes about a week to
> reproduce).
>
> When the problem does occur the system remains functional... the disk
> device continues to work, the filesystem continues to work, except for
> any blockages that chain down to the particular block that has deadlocked
> (usually the syncer, but as time goes by more parts of the system will
> deadlock). It IS possible to run a live gdb when the situation is
> caught early enough.
>
> I am trying to figure out how the buffer manages to get into this
> self-locked state. I'm pretty much stuck once I get to the
> deallocate_dependancies() procedure. This procedure seems to be trying
> to move a D_INDIRDEP dependancy from the passed bp into the buffer
> pointed to by ir_savebp. As far as I can tell this is what is creating
> the situation that makes the buffer's dependancy self-referential
> and deadlocks the syncer.
>
> I have noticed a number of FreeBSD bug reports related to blocking in
> getblk or to softupdates. I don't know if this is a similar problem
> but it's worth Ccing freebsd-hackers on it as well.
>
> -Matt
>
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
>
>
More information about the freebsd-hackers
mailing list