mpr causing a boot hang sometime after r348368 - NUMA related?
Terry Kennedy
TERRY at glaver.org
Sun Sep 1 04:52:29 UTC 2019
TL;DR - mpr controller becomes increasingly likely to hang boot when on
the 2nd CPU as FreeBSD 12.0-STABLE moves forward.
I have a Dell PowerEdge R730 (configuration details available if needed)
with a PERC H730 mini (mrsas driver) and a "12Gbps external HBA", Dell part
number T93GD (mpr driver). There is an external Dell LTO4 drive attached to
the external HBA and is the only thing connected to it.
r348368 boots normally, and the HBA and tape are recognized as:
mpr0: <Avago Technologies (LSI) SAS3008> port 0x8000-0x80ff mem 0xc9100000-0xc910ffff,0xc8000000-0xc80fffff irq 64 at device 0.0 numa-domain 1 on pci17
mpr0: Firmware: 16.00.04.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>
mpr0: Found device <c01<SspTarg,Direct>,End Device> <6.0Gbps> handle<0x0009> enclosureHandle<0x0001> slot 7
mpr0: At enclosure level 0 and connector name (1 )
sa0 at mpr0 bus 0 scbus14 target 7 lun 0
The next revision I tried was r350268. That boots most of the time, but
sometimes hangs with various messages, not in any particular order, such
as (forgive any typos, I could only get these as screen grabs):
mpr_config_get_dpm_pg0: request for page completed with error 60
mpr0: Out of chain frames, consider increasing hw.mpr.max_chains
(probe0:mpr0:0:7:0): Down reving Protocol Version from 4 to 0?
mpr0: Calling Reinit from mpr_wait_command, timeout=60, elapsed=60)
mpr0: Reinit success
run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
This all happens whether or not the external tape drive is plugged into
the system (unplugged at the system end, so no dangling cables).
The problem goes away (with unacceptable loss of performance) if I boot
in safe mode. Setting hw.mpr.disable_msi=1 and hw.mpr.disable_msix=1 has
no effect.
r350970 behaves in much the same way, working sometimes but needing safe
mode to have a 100% successful chance of booting.
r351637 seems to never boot unless I boot in safe mode, then works 100%
of the time.
Dell has replaced the controller and the problem persists. Since it still
happens with the tape drive disconnected, I didn't have them replace the
drive and cable.
The one thing I noted when Dell had the chassis open was that the slot
this card is in is labeled "CPU 2", which would seem to be confirmed by
the "numa-domain 1" in the working dmesg output. Unfortunately, all of the
low-profile slots in this chassis are on CPU 2, and the part number of my
card (and the Dell spare) is a low-profile-only card. I had the tech put
the card in one of the full-height CPU 1 slots (which involved removing
the card bracket and installing it "naked", which he wasn't comfortable
with). Lo and behold, it boots when the card is in numa-domain 0:
mpr0: <Avago Technologies (LSI) SAS3008> port 0x2000-0x20ff mem 0x93600000-0x9360ffff,0x92500000-0x925fffff irq 32 at device 0.0 numa-domain 0 on pci4
mpr0: Firmware: 16.00.04.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>
mpr0: Found device <c01<SspTarg,Direct>,End Device> <6.0Gbps> handle<0x0009> enclosureHandle<0x0001> slot 7
mpr0: At enclosure level 0 and connector name (1 )
sa0 at mpr0 bus 0 scbus2 target 7 lun 0
I was able to do 4 consecutive working boots before the tech got antsy
and wanted to either put the card back in a low-profile slot or start the
meter for billable time.
Based on this, it seems to be a timing-related issue when the mpr card
is on the 2nd CPU (and when SMP is enabled)
Any suggestions for further diagnostic information, other things to try,
or (preferably) "here. try this patch"?
Terry Kennedy http://www.glaver.org New York, NY USA
More information about the freebsd-stable
mailing list