Repeated msgs & kernel panic w/ r246437 (Revamp the CAM enclosure services driver)
Kenneth D. Merry
ken at freebsd.org
Tue Apr 23 14:18:48 UTC 2013
On Tue, Apr 23, 2013 at 11:09:42 +0300, Alexander Motin wrote:
> On 22.04.2013 06:00, John wrote:
> >Hi Folks,
> >
> > After updating one of our servers to the latest stable image,
> >it appears that commit r246437 appears to be causing it to panic.
> >
> >The commit:
> >
> >http://svnweb.freebsd.org/base?view=revision&revision=246437
> >
> >What one of our servers looks like:
> >
> >http://people.freebsd.org/~jwd/zfsnfsserver.jpg
> >
> >The last known working commit:
> >
> >http://people.freebsd.org/~jwd/r246437/dmesg.r246431.clean.txt
> >
> >With commit r246437:
> >
> >http://people.freebsd.org/~jwd/r246437/dmesg.r246437.log.txt
> >
> >Note, most of the dmesg output is related to the ses devices. It
> >repeats itself multiple times before the panic.
> >
> >ses39: ses0,pass20: Element descriptor: ' '
> >ses39: ses0,pass20: SAS Expander: 24 Physses39: phy 0: connector 255
> >other 255
> >ses39: phy 1: connector 255 other 255
> >ses39: phy 2: connector 255 other 255
> >ses39: phy 3: connector 255 other 255
> >ses39: phy 4: connector 255 other 255
> >ses39: phy 5: connector 255 other 255
> >ses39: phy 6: connector 255 other 255
> >
> >etc, etc...
>
> That is not my part of code, but I think it is just too verbose debug
> messages, that should be hidden.
Yes, it is probably too verbose, especially on such a large system.
> >After just a few minutes, the system panics. A pair of images
> >of the screen (sorry, no serial console at this time):
> >
> >Panic: http://people.freebsd.org/~jwd/r246437/20130419_160143.jpg
> >
> >bt: http://people.freebsd.org/~jwd/r246437/20130419_110158.jpg
>
> Despite that you are talking about "latest stable image", I believe your
> kernel is not latest 9-STABLE. Your backtrace reminds me about locking
> problems that should be already fixed from several sides. For example,
> on present 9-STABLE ses_path_iter_devid_callback() doesn't call
> xpt_create_path(), but calls xpt_create_path_unlocked() instead. If you
> can reproduce the issue with latest 9-STABLE, please provide respective
> information.
I agree. I added the xpt_create_path_unlocked() call to fix a
panic with a stack trace just like the one above. It looks like a problem
due to running r246437 exactly.
> >We are currently running a test to see if the fact that all our
> >shelves are dual-attached, allowing us to use geom multipath is
> >related. ie: we have disabled the 2nd HBA thus cutting the total
> >number of da & ses devices in half and thus not executing the
> >code in the commit that tracks duplicate ses devices.
> >
> >Note, if we disable both HBA devices and boot the system up it
> >does not panic or print out the repeated messages, but of course
> >we have no disks :-)
> >
> >I am unclear on the "connector 255 other 255" messages and have not
> >taken the time to look into them yet.
> >
> >I would appreciate any insights folks can provide.
> >
> >Many Thanks,
> >John
> >
> >ps: We've had to seriously increase the console buffer size to
> >capture the complete dmesg output...
> >
> >options MSGBUF_SIZE=(32768*32)
> >
> >Can we delay starting the kernel daemon until after the system
> >is up and /var/log/messages is available? Just a thought...
>
> The goal of this code was to create persistent location-dependent names
> for devices. It may be better to have them earlier.
Yes, I agree.
Ken
--
Kenneth Merry
ken at FreeBSD.ORG
More information about the freebsd-scsi
mailing list