mpr(4) SAS3008 Repeated Crashing

Wed Mar 2 18:46:13 UTC 2016

> On Mar 2, 2016, at 12:23 AM, Borja Marcos <borjam at sarenet.es> wrote:
> 
> 
>> On 01 Mar 2016, at 23:08, Steven Hartland <killing at multiplay.co.uk> wrote:
>> 
>> Initial ideas would be bad signalling.
>> 
>> If you have the option to drop the speeds down and that helps then almost certainly the case.
>> 
>> The original mfi driver was very bad at recovering from issues like this too, I spent over a month fixing and patching it to get it working reliably when there where hardware related issues. In my case it turned out the be a dodge CPU causing memory corruption but you'll get similar behaviour from badly designed installs, particularly with expanders in play for high speed devices (6-12Gbps) link speed.
> 
> I’ve suffered similar problems, although not as severe, on one of my storage servers. It’s an IBM X Series with a LSI 3008 HBA 
> connected to the backplane, using SATA SSDs. But mine are almost certainly hardware problems. An identical system is working
> without issues.
> 
> The symptom: with high I/O activity, for example, running Bonnie++, some commands abort with the disks returning a
> unit attention (power on/reset) asc 0,29.
> 

In your case, the UA is actually a secondary effect.  What’s happening is that a command is timing out so the driver is resetting the disk.  That causes the disk to report a UA with an ASC of 29/0 on the next command it gets after it comes back up.  It’s not fatal and I’m not sure if it should actually cause a retry, but that’s an investigation for a different time.  It does produce a lot of noise on the console/log, though.

One thing I noticed in your log is that one of the commands was a passthrough ATA command of 0x06 and feature of 0x01, which is DSM TRIM.  It’s not clear if this command was at fault, I need to add better logging for this case, but it’s highly suspect.  It was only being asked to trim one sector, but given how unpredictable TRIM responses are from the drive, I don’t know if this matters.  What it might point to, though, is that either the timeout for the command was too short, the drive doesn’t support DSM TRIM that well, or the LSI adapter doesn’t support it well (since it’s not an NCQ command, the LSI firmware would have to remember to flush out the pending NCQ reads and writes first before doing the DSM command).  The default timeout is 60 seconds, which should be enough unless you changed it deliberately.  If this is a reproducible case, would you be willing to re-try with a different delete method, i.e. fiddle with the kern.cam.da.X.delete_method sysctl?

In any case, I doubt that the problem is with cabling.  Active backplanes have been known to cause problems with LSI controllers and SATA disks, but the problem that reported in your log doesn’t match the typical pattern for that.

Scott