mpr(4) SAS3008 Repeated Crashing

Wed Mar 2 07:23:18 UTC 2016

> On 01 Mar 2016, at 23:08, Steven Hartland <killing at multiplay.co.uk> wrote:
> 
> Initial ideas would be bad signalling.
> 
> If you have the option to drop the speeds down and that helps then almost certainly the case.
> 
> The original mfi driver was very bad at recovering from issues like this too, I spent over a month fixing and patching it to get it working reliably when there where hardware related issues. In my case it turned out the be a dodge CPU causing memory corruption but you'll get similar behaviour from badly designed installs, particularly with expanders in play for high speed devices (6-12Gbps) link speed.

I’ve suffered similar problems, although not as severe, on one of my storage servers. It’s an IBM X Series with a LSI 3008 HBA 
connected to the backplane, using SATA SSDs. But mine are almost certainly hardware problems. An identical system is working
without issues.

The symptom: with high I/O activity, for example, running Bonnie++, some commands abort with the disks returning a
unit attention (power on/reset) asc 0,29.

it definitely this looks like a hardware problem to me. Might be the backplane
(it doesn’t affect the same disk every time, it’s completely random) or maybe a power supply problem making the disks reset?

And it hasn’t caused serious data corruption. (It’s decomissioned for now, of coursw!) Now and then ZFS will complain of a checksum failure, but a scrub
fixes it.

Now I’m fighting with IBM (now Lenovo) because all the components were sourced from them and it’s their call to debug it. Maybe I’ll hook an oscilloscope
to the power rails to check for suspicious transients or something like that, though. So far their response has been absolutely unacceptable. They ask for the
“RAID vendor”, and they seem unable to understand that someone might want to run these things with an OS different than Windows, and without creating
RAID volumes with the built in controller. Sigh.

Maybe I could bribe someone to pose as “RAID vendor” ;)

Feb 12 07:43:59 clientes-ssd8 kernel: (noperiph:mpr0:0:4294967295:0): SMID 33 Aborting command 0xfffffe0000c7baf0
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 989 terminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 953 terminated ioc 804b scsi 0 state c xfe(da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 40 00 00 20 00 
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: Command timeout
Feb 12 07:43:59 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 571 terminated ioc 804b scsi 0 state c xfe(da14:r 0
Feb 12 07:43:59 clientes-ssd8 kernel: mpr0:0:	(da14:mpr0:0:40:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 638 te40:rminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). CDB: 2a 00 39 a1 fe f0 00 00 20 00 length 16384 SMID 818 terminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 40 00 00 20 00 length 16384 SMID 952 terminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 20 00 00 18 00 length 12288 SMID 922 terminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): READ(10). CDB: 28 00 31 40 ea 00 00 00 20 00 length 16384 SMID 823 terminated ioc 804b scsi 0 state c xfer 0
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): WRITE(10). CDB: 2a 00 39 a1 fe f0 00 00 20 00 
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): CAM status: SCSI Status Error
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI status: Check Condition
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Feb 12 07:44:00 clientes-ssd8 kernel: (da14:mpr0:0:40:0): Retrying command (per sense data)

Borja.