mpr(4) SAS3008 Repeated Crashing

Sean Bruno sbruno at freebsd.org
Wed Mar 2 20:23:17 UTC 2016


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512



On 03/02/16 10:43, Scott Long via freebsd-scsi wrote:
> 
>> On Mar 2, 2016, at 12:23 AM, Borja Marcos <borjam at sarenet.es>
>> wrote:
>> 
>> 
>>> On 01 Mar 2016, at 23:08, Steven Hartland
>>> <killing at multiplay.co.uk> wrote:
>>> 
>>> Initial ideas would be bad signalling.
>>> 
>>> If you have the option to drop the speeds down and that helps
>>> then almost certainly the case.
>>> 
>>> The original mfi driver was very bad at recovering from issues
>>> like this too, I spent over a month fixing and patching it to
>>> get it working reliably when there where hardware related
>>> issues. In my case it turned out the be a dodge CPU causing
>>> memory corruption but you'll get similar behaviour from badly
>>> designed installs, particularly with expanders in play for high
>>> speed devices (6-12Gbps) link speed.
>> 
>> I’ve suffered similar problems, although not as severe, on one of
>> my storage servers. It’s an IBM X Series with a LSI 3008 HBA 
>> connected to the backplane, using SATA SSDs. But mine are almost
>> certainly hardware problems. An identical system is working 
>> without issues.
>> 
>> The symptom: with high I/O activity, for example, running
>> Bonnie++, some commands abort with the disks returning a unit
>> attention (power on/reset) asc 0,29.
>> 
> 
> In your case, the UA is actually a secondary effect.  What’s
> happening is that a command is timing out so the driver is
> resetting the disk.  That causes the disk to report a UA with an
> ASC of 29/0 on the next command it gets after it comes back up.
> It’s not fatal and I’m not sure if it should actually cause a
> retry, but that’s an investigation for a different time.  It does
> produce a lot of noise on the console/log, though.
> 
> One thing I noticed in your log is that one of the commands was a
> passthrough ATA command of 0x06 and feature of 0x01, which is DSM
> TRIM.  It’s not clear if this command was at fault, I need to add
> better logging for this case, but it’s highly suspect.  It was only
> being asked to trim one sector, but given how unpredictable TRIM
> responses are from the drive, I don’t know if this matters.  What
> it might point to, though, is that either the timeout for the
> command was too short, the drive doesn’t support DSM TRIM that
> well, or the LSI adapter doesn’t support it well (since it’s not an
> NCQ command, the LSI firmware would have to remember to flush out
> the pending NCQ reads and writes first before doing the DSM
> command).  The default timeout is 60 seconds, which should be
> enough unless you changed it deliberately.  If this is a
> reproducible case, would you be willing to re-try with a different
> delete method, i.e. fiddle with the kern.cam.da.X.delete_method
> sysctl?
> 
> In any case, I doubt that the problem is with cabling.  Active
> backplanes have been known to cause problems with LSI controllers
> and SATA disks, but the problem that reported in your log doesn’t
> match the typical pattern for that.
> 
> Scott
> 

The controller itself was running a quite "old" firmware revision,
6.00.00.00

We've updated it to 10.00.03.00 and are restarting our tests to see
what, if anything is different.

I've disabled the tool that was sending the ATA commands down with
TRIM and we've rewired the cabling inside the host a bit to give us a
better idea of what's going on.

sean
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQF8BAEBCgBmBQJW10uuXxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w
ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXRCQUFENDYzMkU3MTIxREU4RDIwOTk3REQx
MjAxRUZDQTFFNzI3RTY0AAoJEBIB78oecn5k8r4H/RFu+OT6sa0qaivncWPLzOqQ
Q5Mzv8znFWHYyxX5es9EPwodEPnfOg/9QXLzC7TNha9ukDu+a723nka/1WUVl2Wq
9G93AImGy4AxjA6W/0bB0TXYGU26x8hVQ71E/xZB6XaVywGczuBbAtQIEESPi2n7
ScpBpOd6ctXexO4bCHPfu3Hz7Sq6Tbr3F1IHOqXGhbYTLekwDtBRzCDs+LTAr/qN
tF9ou4vL2Hn3KjfFFSDIiTKT2vcod7aCHsNJMAkXnmHe9HdCbQBEElvN48Al0O6E
6iLhYsqeuIfQ2THsn4/T2/f8MuSxa9xTxGtG8WqWeEVUtXU1V9A2l84/EqhoCpI=
=tMOi
-----END PGP SIGNATURE-----


More information about the freebsd-scsi mailing list