Panic during kernel booting on HP Proliant DL180G6 and latest STABLE

Thu Sep 22 04:19:27 UTC 2011

On Wed, Sep 21, 2011 at 08:26:46PM -0700, Craig Leres wrote:
> I have a lot of supermicro motherboards and the newest ones have igb
> chipsets; they've been quite a headache with respect to FreeBSD 8. I'm
> running 8.2-RELEASE but have upgraded parts of my kernel to 8-RELENG (as
> of a few months ago). Some of them work ok while others panic on bootup.
> Upgrading to newer versions of the intel igb code fixes some but breaks
> others. It's been frustrating.
> 
> While working on this today, I saw two different kernel panics:
> 
>     Could not setup receive structures
>     m_getzone: m_getjcl: invalid cluster type
> 
> I tried John Baldwin's patch but got the "invalid cluster type" panic so
> I backed it out.
> 
> Later I figured out that either turning off hw.igb.enable_msix
> (loader.conf) or raising kern.ipc.nmbclusters to 131072 (sysctl.conf)
> and setting hw.igb.num_queues to 4 (loader.conf) would avoid the
> "receive structures" panic but either way I was seeing the "invalid
> cluster type" panic.
> 
> Looking m_getjcl(), I suspected the passed size to be 0; some debugging
> confirmed this. Looks like a race here where a receive interrupt comes
> in before adapter->rx_mbuf_sz has been initialized.
> 
> Attached is the hack I added to avoid the panic when booting. The idea
> is to pretend m_getjcl() failed to allocate a cluster rather than to go
> down in flames.
> 
>                 Craig

> Index: if_igb.c
> ===================================================================
> --- if_igb.c	(revision 31)
> +++ if_igb.c	(working copy)
> @@ -3695,6 +3695,11 @@
>  		    htole64(hseg[0].ds_addr);
>  no_split:
>  		if (rxbuf->m_pack == NULL) {
> +			if (adapter->rx_mbuf_sz == 0) {
> +				printf("igb_refresh_mbufs: "
> +				     "avoid m_getjcl() panic\n");
> +				goto update;
> +			}
>  			mp = m_getjcl(M_DONTWAIT, MT_DATA,
>  			    M_PKTHDR, adapter->rx_mbuf_sz);
>  			if (mp == NULL)
> @@ -3912,6 +3917,12 @@
>  
>  skip_head:
>  		/* Now the payload cluster */
> +		if (adapter->rx_mbuf_sz == 0) {
> +			printf("igb_setup_receive_ring: "
> +			    "avoid m_getjcl() panic\n");
> +			error = ENOBUFS;
> +			goto fail;
> +		}
>  		rxbuf->m_pack = m_getjcl(M_DONTWAIT, MT_DATA,
>  		    M_PKTHDR, adapter->rx_mbuf_sz);
>  		if (rxbuf->m_pack == NULL) {

The fact you have this happening on multiple systems is uncomfortable.
It makes me uncomfortable because we use Supermicro hardware
exclusively.

Your Email contains no reference ID or in-reply-to headers so it appears
as a new thread.  As such I'll point readers to the thread which spans
over months:

http://lists.freebsd.org/pipermail/freebsd-stable/2011-May/062596.html
http://lists.freebsd.org/pipermail/freebsd-stable/2011-June/062949.html
http://lists.freebsd.org/pipermail/freebsd-stable/2011-September/063867.html

Also re-CC'ing Jack Vogel.

Chris, I am under the impression that to get proper visibility and
attention in this matter, you're probably going to need to set up serial
console (both BIOS-level and bootloader-level) for remote debugging
capability.  Jack, John, or someone familiar with kernel debugging is
probably going to need to get access to a machine which is experiencing
this problem so they can figure out what's going on.

The tricky part here is that you're going to need to have a custom
kernel built that includes numerous debugging options.  PXE booting is
probably the easiest method.  Remember you don't need filesystems on the
system, just a kernel that boots/loads and will drop to ddb> when the
panic happens.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |