mpr(4) SAS3008 Repeated Crashing

Thu Mar 3 09:37:53 UTC 2016

On 03/03/2016 07:42, Borja Marcos wrote:
>> On 02 Mar 2016, at 19:43, Scott Long <scott4long at yahoo.com> wrote:
>>> I’ve suffered similar problems, although not as severe, on one of my storage servers. It’s an IBM X Series with a LSI 3008 HBA
>>> connected to the backplane, using SATA SSDs. But mine are almost certainly hardware problems. An identical system is working
>>> without issues.
>>>
>>> The symptom: with high I/O activity, for example, running Bonnie++, some commands abort with the disks returning a
>>> unit attention (power on/reset) asc 0,29.
>>>
>> In your case, the UA is actually a secondary effect.  What’s happening is that a command is timing out so the driver is resetting the disk.  That causes the disk to report a UA with an ASC of 29/0 on the next command it gets after it comes back up.  It’s not fatal and I’m not sure if it should actually cause a retry, but that’s an investigation for a different time.  It does produce a lot of noise on the
>> console/log, though.
This sounds similar to what we saw in mfi; while the cause was different 
the real problem was the error paths in the driver where untested and 
buggy causing more problems and resulting in panics.

I was lucky, or unlucky depending on your point of view, that the HW 
issue we had was very good at triggering pretty much every failure path 
in the driver which allowed me to fix them, without that its really hard 
to truly test these code paths which hardly ever get exercised.
> Hmm. Interesting. It does indeed cause problems, although nothing that a ZFS scrub cannot fix.
>
> So it’s the driver that is resetting the disks? I was assuming that the disks were resetting themselves for some reason.
>
>> One thing I noticed in your log is that one of the commands was a passthrough ATA command of 0x06 and feature of 0x01, which is DSM TRIM.  It’s not clear if this command was at fault, I need to add better logging for this case, but it’s highly suspect.  It was only being asked to trim one sector, but given how unpredictable TRIM responses are from the drive, I don’t know if this matters.  What it might point to, though, is that either the timeout for the command was too short, the drive doesn’t support DSM TRIM that well, or the LSI adapter doesn’t support it well (since it’s not an NCQ command, the LSI firmware would have to remember to flush out the pending NCQ reads and writes first before doing the DSM command).  The default timeout is 60 seconds, which should be enough unless you changed it deliberately.  If this is a reproducible case, would you be willing to re-try with a different delete method, i.e. fiddle with the kern.cam.da.X.delete_method sysctl?
> The server is not in production for now, so I can run experiments on it. I am trying with delete_method=DISABLE. Although using these disks without trim would have
> a performance impact I guess.
>
> What is puzzling is, the “twin” server is working like a charm. Same hardware, same software. We only updated firmwares on the ailing one when we noticed problems,
> just in case.
>
> Actually we’ve been poking the dealer and they are going to send a new one to test. Given how the twin works, the problem should go away.
>
We've seen HW issues before where the first thing to start triggering 
the problem was TRIM requests, it seems like its an afterthought in most 
FW's unfortunately, so one of the first things to go bad. I'm not saying 
this is you issue, but its something to keep in mind.

     Regards
     Steve