Possible softupdates bug when a indirect block buffer is reused
Matthew Dillon
dillon at apollo.backplane.com
Sun Jul 31 18:40:33 GMT 2005
Hi Kirk, hackers!
I'm trying to track down a bug that is causing a buffer to be left
in a locked state and then causes the filesystem to lock up because
of that.
The symptoms are that a heavily used filesystem suddenly starts running
out of space. It isn't due to deleted files with open descriptors, it's
due to the syncer getting stuck in a getblk state. This is in DragonFly,
but I can't find anything DFlyish wrong so I'm beginning to think it may
be an actual bug in softupdates.
I have wound up with a situation where a getblk()'d bp has been
associated with a indirdep dependancy, i.e. stored in
indirdep->ir_savebp, but is never released. When something like
the syncer comes along and tries to access it, it locks up, and this
of course leads to inodes not getting cleared and the filesystem
eventually runs out of space when a lot of files are being created and
deleted.
What has got me really confused is that the buffer in question seems to
wind up with a D_INDIRDEP dependancy that points back to itself.
Here's the situation from a live gdb. Here is where the syncer is
stuck:
(kgdb) back
#0 lwkt_switch () at thread2.h:95
#1 0xc02a8a79 in tsleep (ident=0x0, flags=0, wmesg=0xc04eadb0 "getblk",
timo=0) at /usr/src-125beta/sys/kern/kern_synch.c:428
#2 0xc02956bb in acquire (lkp=0xc758b4e0, extflags=33554464, wanted=1536)
at /usr/src-125beta/sys/kern/kern_lock.c:127
#3 0xc0295a92 in lockmgr (lkp=0xc758b4e0, flags=33620002, interlkp=0x0,
td=0xd68f6400) at /usr/src-125beta/sys/kern/kern_lock.c:354
#4 0xc02d6828 in getblk (vp=0xc71b3058, blkno=94440240, size=8192, slpflag=0,
slptimeo=0) at thread.h:79
#5 0xc02d4404 in bread (vp=0xc71b3058, blkno=0, size=0, bpp=0x0)
at /usr/src-125beta/sys/kern/vfs_bio.c:567
#6 0xc03f24fe in indir_trunc (ip=0xe048fc0c, dbn=94440240, level=1, lbn=2060,
countp=0xe048fbf8) at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2221
#7 0xc03f22df in handle_workitem_freeblocks (freeblks=0xe2fcef98)
at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:2138
#8 0xc03f0462 in process_worklist_item (matchmnt=0x0, flags=0)
at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:726
#9 0xc03f026c in softdep_process_worklist (matchmnt=0x0)
at /usr/src-125beta/sys/vfs/ufs/ffs_softdep.c:625
#10 0xc02e5ff3 in sched_sync () at /usr/src-125beta/sys/kern/vfs_sync.c:244
#11 0xc0294863 in kthread_create_stk (func=0, arg=0x0, tdp=0xff800000,
stksize=0, fmt=0x0) at /usr/src-125beta/sys/kern/kern_kthread.c:104
(kgdb)
The buffer it is stuck on:
(kgdb) print bp
$62 = (struct buf *) 0xc758b4b8
(kgdb) print *bp
$63 = {
b_hash = {
le_next = 0x0,
le_prev = 0xc7391348
},
b_vnbufs = {
tqe_next = 0xc739b890,
tqe_prev = 0xc76f32b8
},
b_freelist = {
tqe_next = 0xc768d610,
tqe_prev = 0xc0565ac0
},
b_act = {
tqe_next = 0x0,
tqe_prev = 0x0
},
b_flags = 536870912, <<<<<<<<< 0x20000000 (getblk with no bread, etc)
b_qindex = 0,
b_xflags = 2 '\002',
b_lock = {
lk_interlock = {
t_cpu = 0xff800000,
t_reqcpu = 0xff800000,
t_unused01 = 0
},
lk_flags = 2098176,
lk_sharecount = 0,
lk_waitcount = 1,
lk_exclusivecount = 1,
lk_prio = 0,
lk_wmesg = 0xc04eadb0 "getblk",
lk_timo = 0,
lk_lockholder = 0xfffffffe
},
b_error = 0,
b_bufsize = 8192,
b_runningbufspace = 0,
b_bcount = 8192,
b_resid = 0,
b_dev = 0xde0f0e38,
b_data = 0xcf824000 "¨\205Ð\002",
b_kvabase = 0xcf824000 "¨\205Ð\002",
b_kvasize = 16384,
b_lblkno = 94440240,
b_blkno = 94440240,
b_offset = 48353402880,
b_iodone = 0,
b_iodone_chain = 0x0,
b_vp = 0xc71b3058,
b_dirtyoff = 0,
b_dirtyend = 0,
b_pblkno = 87503631,
b_saveaddr = 0x0,
b_driver1 = 0x0,
b_caller1 = 0x0,
b_pager = {
pg_spc = 0x0,
pg_reqpage = 0
},
b_cluster = {
cluster_head = {
tqh_first = 0x0,
tqh_last = 0xc768d6bc
---Type <return> to continue, or q <return> to quit---
},
cluster_entry = {
tqe_next = 0x0,
tqe_prev = 0xc768d6bc
}
},
b_xio = {
xio_pages = 0xc758b584,
xio_npages = 2,
xio_offset = 0,
xio_bytes = 0,
xio_flags = 0,
xio_error = 0,
xio_internal_pages = {0xc34e5870, 0xc4aeb2b4, 0x0 <repeats 30 times>}
},
b_dep = {
lh_first = 0xc7045040
},
b_chain = {
parent = 0x0,
count = 0
}
}
As you can see from b_flags, which is 0x20000000, the buffer has been
getblk()'d but not bread() or anything else. It is the typical state
that occurs when a buffer is placed in an indirdep->ir_savebp state
in setup_allocindir_phase2().
The buffer's dependancy looks like this:
(kgdb) print bp
$65 = (struct buf *) 0xc758b4b8
(kgdb) print *(struct indirdep *)bp->b_dep.lh_first
$66 = {
ir_list = {
wk_list = {
le_next = 0x0,
le_prev = 0xc758b604
},
wk_type = 5,
wk_state = 33025 <<<<<<<<<<<<< ATTACHED|GOINGAWAY|ONWORKLIST
},
ir_saveddata = 0xdeadc0de "",
ir_savebp = 0xc758b4b8, <<<<<<<<<<<<< points back to itself
ir_donehd = {
lh_first = 0x0
},
ir_deplisthd = {
lh_first = 0x0
}
}
(kgdb)
As you can see, the buffer's dependancy seems to point to itself.
As you may know, ir_savebp buffers are left in a locked state, so a
buffer that has gotten into this situation winds up being permanently
deadlocked.
This is on DragonFly, but I can't find anything else wrong. This is
on a filesystem running a myrid of jails which is constantly creating
and deleting files in parallel... so buffers are being reused quite often.
It takes about a week of this heavy activity for the bug to rear its
ugly head. It's fairly difficult to reproduce (takes about a week to
reproduce).
When the problem does occur the system remains functional... the disk
device continues to work, the filesystem continues to work, except for
any blockages that chain down to the particular block that has deadlocked
(usually the syncer, but as time goes by more parts of the system will
deadlock). It IS possible to run a live gdb when the situation is
caught early enough.
I am trying to figure out how the buffer manages to get into this
self-locked state. I'm pretty much stuck once I get to the
deallocate_dependancies() procedure. This procedure seems to be trying
to move a D_INDIRDEP dependancy from the passed bp into the buffer
pointed to by ir_savebp. As far as I can tell this is what is creating
the situation that makes the buffer's dependancy self-referential
and deadlocks the syncer.
I have noticed a number of FreeBSD bug reports related to blocking in
getblk or to softupdates. I don't know if this is a similar problem
but it's worth Ccing freebsd-hackers on it as well.
-Matt
More information about the freebsd-hackers
mailing list