mpr(4) SAS3008 Repeated Crashing

Thu Mar 3 07:42:31 UTC 2016

> On 02 Mar 2016, at 19:43, Scott Long <scott4long at yahoo.com> wrote:
>> I’ve suffered similar problems, although not as severe, on one of my storage servers. It’s an IBM X Series with a LSI 3008 HBA 
>> connected to the backplane, using SATA SSDs. But mine are almost certainly hardware problems. An identical system is working
>> without issues.
>> 
>> The symptom: with high I/O activity, for example, running Bonnie++, some commands abort with the disks returning a
>> unit attention (power on/reset) asc 0,29.
>> 
> 
> In your case, the UA is actually a secondary effect.  What’s happening is that a command is timing out so the driver is resetting the disk.  That causes the disk to report a UA with an ASC of 29/0 on the next command it gets after it comes back up.  It’s not fatal and I’m not sure if it should actually cause a retry, but that’s an investigation for a different time.  It does produce a lot of noise on the 
> console/log, though.

Hmm. Interesting. It does indeed cause problems, although nothing that a ZFS scrub cannot fix. 

So it’s the driver that is resetting the disks? I was assuming that the disks were resetting themselves for some reason. 

> One thing I noticed in your log is that one of the commands was a passthrough ATA command of 0x06 and feature of 0x01, which is DSM TRIM.  It’s not clear if this command was at fault, I need to add better logging for this case, but it’s highly suspect.  It was only being asked to trim one sector, but given how unpredictable TRIM responses are from the drive, I don’t know if this matters.  What it might point to, though, is that either the timeout for the command was too short, the drive doesn’t support DSM TRIM that well, or the LSI adapter doesn’t support it well (since it’s not an NCQ command, the LSI firmware would have to remember to flush out the pending NCQ reads and writes first before doing the DSM command).  The default timeout is 60 seconds, which should be enough unless you changed it deliberately.  If this is a reproducible case, would you be willing to re-try with a different delete method, i.e. fiddle with the kern.cam.da.X.delete_method sysctl?

The server is not in production for now, so I can run experiments on it. I am trying with delete_method=DISABLE. Although using these disks without trim would have
a performance impact I guess. 

What is puzzling is, the “twin” server is working like a charm. Same hardware, same software. We only updated firmwares on the ailing one when we noticed problems,
just in case.

Actually we’ve been poking the dealer and they are going to send a new one to test. Given how the twin works, the problem should go away.

Thanks!

Borja.