Ufs dead-locks on freebsd 6.2

Sat May 19 04:35:27 UTC 2007

Fsck didn't help but below is a list of processes that were stuck in
disk.  Also, one potential problem I've hit is I have mrtg scripts that
get launched from cron every min.  MRTG is supposed to have a locking
mechanism to prevent the same script from running at the same time but I
suspect since the filesystem was unaccessible the cron jobs just kept
piling up and piling up until the system would eventually crash.  I
caught it when the load avg. was at 620 and killed all the cron's I
could.  That brought the load avg. down to under 1 however system is
still taking up 30% of the processor time and the disks are basically
idle.  I can still do an ls -l on the root of all my mounted ufs and nfs
filesystems but on one it's taking a considerable amount longer than the
rest.  This particular rsync that I was running is copying into the /d2
fs.

The system is still running and I can make tpc connections and
somethings I have running from inetd work but ssh stops responding right
away and I can't logon via the console.  So, I've captured a core dump
of the system and rebooted so that I could use it again.  Are there any
suggestion as to what to do next?  I'm debaiting installing an adaptec
raid and rebuilding the system to see if I get the same problem, my
worry is that it's the intel raid drivers that are causing this problem
and I have 4 other systems with the same card.

  PID  TT  STAT      TIME COMMAND
    2  ??  DL     0:04.86 [g_event]
    3  ??  DL     2:05.90 [g_up]
    4  ??  DL     1:07.95 [g_down]
    5  ??  DL     0:00.00 [xpt_thrd]
    6  ??  DL     0:00.00 [kqueue taskq]
    7  ??  DL     0:00.00 [thread taskq]
    8  ??  DL     0:06.96 [pagedaemon]
    9  ??  DL     0:00.00 [vmdaemon]
   15  ??  DL     0:22.28 [yarrow]
   24  ??  DL     0:00.01 [usb0]
   25  ??  DL     0:00.00 [usbtask]
   27  ??  DL     0:00.01 [usb1]
   29  ??  DL     0:00.01 [usb2]
   36  ??  DL     1:28.73 [pagezero]
   37  ??  DL     0:08.76 [bufdaemon]
   38  ??  DL     0:00.54 [vnlru]
   39  ??  DL     1:08.12 [syncer]
   40  ??  DL     0:04.00 [softdepflush]
   41  ??  DL     0:11.05 [schedcpu]
27182  ??  Ds     0:05.75 /usr/sbin/syslogd -l /var/run/log -l
/var/named/var/run/log -b 127.0.0.1 -a 10.128.0.0/10
27471  ??  Is     0:01.10 /usr/local/bin/postmaster -D
/usr/local/pgsql/data (postgres)
27594  ??  Is     0:00.04 /usr/libexec/ftpd -m -D -l -l
27602  ??  DL     0:00.28 [smbiod1]
96581  ??  D      0:00.00 cron: running job (cron)
96582  ??  D      0:00.00 cron: running job (cron)
96583  ??  D      0:00.00 cron: running job (cron)
96585  ??  D      0:00.00 cron: running job (cron)
96586  ??  D      0:00.00 cron: running job (cron)
96587  ??  D      0:00.00 cron: running job (cron)
96588  ??  D      0:00.00 cron: running job (cron)
96589  ??  D      0:00.00 cron: running job (cron)
96590  ??  D      0:00.00 cron: running job (cron)
96591  ??  D      0:00.00 cron: running job (cron)
96592  ??  D      0:00.00 cron: running job (cron)
96593  ??  D      0:00.00 cron: running job (cron)
96594  ??  D      0:00.00 cron: running job (cron)
96607  ??  D      0:00.00 cron: running job (cron)
96608  ??  D      0:00.00 cron: running job (cron)
96609  ??  D      0:00.00 cron: running job (cron)
96610  ??  D      0:00.00 cron: running job (cron)
96611  ??  D      0:00.00 cron: running job (cron)
96612  ??  D      0:00.00 cron: running job (cron)
96613  ??  D      0:00.00 cron: running job (cron)
96614  ??  D      0:00.00 cron: running job (cron)
96615  ??  D      0:00.00 cron: running job (cron)
96616  ??  D      0:00.00 cron: running job (cron)
96617  ??  D      0:00.00 cron: running job (cron)
96631  ??  D      0:00.00 cron: running job (cron)
96632  ??  D      0:00.00 cron: running job (cron)
96633  ??  D      0:00.00 cron: running job (cron)
96634  ??  D      0:00.00 cron: running job (cron)
96635  ??  D      0:00.00 cron: running job (cron)
96636  ??  D      0:00.00 cron: running job (cron)
96637  ??  D      0:00.00 cron: running job (cron)
96638  ??  D      0:00.00 cron: running job (cron)
96639  ??  D      0:00.00 cron: running job (cron)
96642  ??  D      0:00.00 cron: running job (cron)
96650  ??  D      0:00.00 cron: running job (cron)
29393  p0  D+    22:04.58 /usr/local/bin/rsync

real    0m0.012s
user    0m0.000s
sys     0m0.010s
/

real    0m0.019s
user    0m0.000s
sys     0m0.016s
/var

real    0m0.028s
user    0m0.008s
sys     0m0.018s
/diskless

real    0m0.017s
user    0m0.008s
sys     0m0.007s
/usr

real    0m0.016s
user    0m0.000s
sys     0m0.015s
/d2

real    0m0.024s
user    0m0.000s
sys     0m0.023s
/exports/home

real    0m2.559s
user    0m0.216s
sys     0m2.307s

-----Original Message-----
From: owner-freebsd-fs at freebsd.org [mailto:owner-freebsd-fs at freebsd.org]
On Behalf Of Andrew Edwards
Sent: Friday, May 18, 2007 6:44 PM
To: freebsd-fs at freebsd.org; freebsd-performance at freebsd.org
Subject: RE: Ufs dead-locks on freebsd 6.2

Okay, I let memtest run for a full day and there has been no memory
errors.  What do I do next?  Just to be on the safe side I'll fsck all
of my fs's and try to reproduce the problem again.

I also don't know what zonelimit is, I see this on similarily configured
machines but running 5.4.  I know it's related to network as I
periodically get network connections to work i.e. ssh, ftp (both server
and client side) but eventually the box will deadlock.  Should I start a
different thread on this?  Happens about once every 30 days on two
server although I havn't checked the exact timing.

-----Original Message-----
From: owner-freebsd-fs at freebsd.org [mailto:owner-freebsd-fs at freebsd.org]
On Behalf Of Eric Anderson
Sent: Friday, May 18, 2007 3:09 PM
To: Kris Kennaway
Cc: freebsd-fs at freebsd.org
Subject: Re: Ufs dead-locks on freebsd 6.2

On 05/18/07 14:00, Kris Kennaway wrote:
> On Thu, May 17, 2007 at 11:38:20PM -0500, Eric Anderson wrote:
>> On 05/17/07 12:47, Kostik Belousov wrote:
>>> On Thu, May 17, 2007 at 01:03:37PM -0400, Andrew Edwards wrote:
>>>> Here it is.
>>>>
>>>> db> show vnode 0xccd47984
>>>> vnode 0xccd47984: tag ufs, type VDIR
>>>>    usecount 5135, writecount 0, refcount 5137 mountedhere 0
>>>>    flags (VV_ROOT)
>>>>    v_object 0xcd02518c ref 0 pages 1
>>>>    #0 0xc0593f0d at lockmgr+0x4ed
>>>> #1 0xc06b8e0e at ffs_lock+0x76
>>>> #2 0xc0739787 at VOP_LOCK_APV+0x87
>>>> #3 0xc0601c28 at vn_lock+0xac
>>>> #4 0xc05ee832 at lookup+0xde
>>>> #5 0xc05ee4b2 at namei+0x39a
>>>> #6 0xc05e2ab0 at unp_connect+0xf0
>>>> #7 0xc05e1a6a at uipc_connect+0x66
>>>> #8 0xc05d9992 at soconnect+0x4e
>>>> #9 0xc05dec60 at kern_connect+0x74
>>>> #10 0xc05debdf at connect+0x2f
>>>> #11 0xc0723e2b at syscall+0x25b
>>>> #12 0xc070ee0f at Xint0x80_syscall+0x1f
>>>>
>>>>        ino 2, on dev amrd0s1a
>>> It seems to be the sort of things that cannot happen. VOP_LOCK() 
>>> returned 0, but vnode was not really locked.
>>>
>>> Although claiming that kernel code cannot have such bug is too 
>>> optimistic, I would first make sure that:
>>> 1. You checked the memory of the machine.
>>> 2. Your kernel is built from pristine sources.
>>
>> This looks precisely like a lock I was seeing on one of my NFS
servers. 
>>  Only one of the filesystems would cause it, but it was the same one 
>> each time, not necessarily under any kind of load.  Things like 
>> mountd would get wedged in state 'ufs', and other things would get 
>> stuck in one of the lock states (I can't recall).
> 
> ...so you cannot conclude that it looks "precisely like" this case.
> 
> Please, don't confuse bug reports by this kind of claim unless you 
> have made a detailed comparison of the debugging traces to yours.

Understood - my mistake.

Eric

_______________________________________________
freebsd-fs at freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
_______________________________________________
freebsd-fs at freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"