panic: bufwrite: buffer is not busy???

Mon Jan 31 20:43:47 UTC 2011

On Monday, January 31, 2011 1:31:54 am Eugene Grosbein wrote:
> On 15.01.2011 01:37, John Baldwin wrote:
> > On Friday, January 14, 2011 1:44:19 pm Eugene Grosbein wrote:
> >> On 14.01.2011 18:46, Mike Tancsa wrote:
> >>
> >>>> I'm using mpd 5.5 on three PPPoE routers, each servicing about 300 
PPPoE
> >>>> concurrent sessions. Routers are based on Intel SR1630GP hardware 
platforms and
> >>>> runs FreeBSD 7.3-RELEASE.
> >>>>
> >>>> I'm experiencing stability issues related to Netgraph. None of above 
routers can
> >>>> survive more than 20-30 days of uptime under typical load. There are 
different
> >>>> flavors of kernel panics, but all are somehow related to netgraph. 
Typical
> >>>> backtraces follow
> >>>
> >>> I also have stability issues on RELENG_8.
> >>>
> >>> http://www.freebsd.org/cgi/query-pr.cgi?pr=153497
> >>
> >> And for one of my servers (8.2-PRERELEASE/amd64 with 4GB RAM) I just 
cannot obtain crashdump,
> >> it cannot finish to write it. For example, it happened an hour ago:
> >>
> >> Fatal trap 12: page fault while in kernel mode
> >> cpuid = 2; apic id = 04
> >> fault virtual address   = 0x200000040
> >> fault code              = supervisor read data, page not present
> >> instruction pointer     = 0x20:0xffffffff803cc979
> > 
> > Assuming your kernel is built with debug symbols (which is the default), 
one
> > thing you can do to aid in debugging is this:
> > 
> > gdb /boot/kernel/kernel
> > (gdb) l *0xffffffff803cc979
> > 
> > Where the 0xfff<blah> bit is the part of the 'instruction pointer' value
> > above after the colon (:) and then send the output of that in your e-mail 
to
> > the list.  This allows us to the source line at which the fault occurred.
> > 
> 
> Yesterday I've got another kernel panic of this kind with RELENG_8 updated 
20 January
> and it still could not finish writing of crashdump:
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 1; apic id = 02
> fault virtual address   = 0x200000030
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff803c1315
> stack pointer           = 0x28:0xffffff8000130780
> frame pointer           = 0x28:0xffffff80001307a0
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 12 (irq259: em1:rx 0)
> trap number             = 12
> panic: page fault
> cpuid = 1
> Uptime: 19h41m8s
> Dumping 4087 MB (3 chunks)
>   chunk 0: 1MB (150 pages) ... ok
>   chunk 1: 3575MB (915088 pages) 3559 3543panic: bufwrite: buffer is not 
busy???
> cpuid = 1
> Uptime: 19h41m9s
> Automatic reboot in 15 seconds - press a key on the console to abort
> 
> This time I have all debug symbols handy:
> 
> 
> # gdb kernel
> GNU gdb 6.1.1 [FreeBSD]
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain 
conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "amd64-marcel-freebsd"...
> (gdb) l *0xffffffff803c1315
> 0xffffffff803c1315 is in ng_address_hook 
(/home/src/sys/netgraph/ng_base.c:3504).
> 3499             * Quick sanity check..
> 3500             * Since a hook holds a reference on it's node, once we know
> 3501             * that the peer is still connected (even if invalid,) we 
know
> 3502             * that the peer node is present, though maybe invalid.
> 3503             */
> 3504            if ((hook == NULL) ||
> 3505                NG_HOOK_NOT_VALID(hook) ||
> 3506                NG_HOOK_NOT_VALID(peer = NG_HOOK_PEER(hook)) ||
> 3507                NG_NODE_NOT_VALID(peernode = NG_PEER_NODE(hook))) {
> 3508                    NG_FREE_ITEM(item);

Hmmm.  I think you might have a hardware problem.  Notice the fault address, 
it is 0x200000030.  Can you do 'x/i <instruction pointer>'?  I suspect it is 
doing a memory access from that has a constant offset of 0x30, in which case 
the original pointer was 0x200000000, meaning it would be NULL except it has a 
single-bit error.  That would likely be caused by a hardware issue such as 
failing RAM, etc.

-- 
John Baldwin