mpr(4) SAS3008 Repeated Crashing

Thu Mar 3 17:12:53 UTC 2016

> On Mar 3, 2016, at 3:26 AM, Borja Marcos <borjam at sarenet.es> wrote:
> 
> 
>> On 03 Mar 2016, at 10:38, Steven Hartland <killing at multiplay.co.uk> wrote:
>> 
>> We've seen HW issues before where the first thing to start triggering the problem was TRIM requests, it seems like its an afterthought in most FW's unfortunately, so one of the first things to go bad. I'm not saying this is you issue, but its something to keep in mind.
> 
> Thanks :)
> 
> Not trim related, it seems. I’ve ran the tests with kern.cam.X.delete_method set to DISABLE and I still see errors:
> 
> In paranormal cases like this it would be awesome to have access to a logic analyzer… I keep dreaming of course :)
> 
> 
> 
> Mar  3 11:12:53 clientes-ssd8 kernel: (noperiph:mpr0:0:4294967295:0): SMID 2 Aborting command 0xfffffe0000c7cab0
> Mar  3 11:12:54 clientes-ssd8 kernel: (da2:mpr0:0:26:0): READ(10). CDB: 28 00 26 d7 d0 98 00 00 20 00 length 16384 SMID 322 terminated ioc 804b scsi 0 state c xfer 0
> Mar  3 11:12:54 clientes-ssd8 kernel: (da2:mpr0:0:26:0): READ(10). CDB: 28 00 26 d7 d0 b8 00 00 18 00 length 12288 SMID 217 terminated ioc 804b scsi 0 state c xfer 0
> Mar  3 11:12:54 clientes-ssd8 kernel: (da2:mpr0:0:26:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 205 terminated ioc 804b scsi 0 sta(da2:mpr0:0:26:0): READ(10). CDB: 28 00 26 d7 d0 18 00 00 78 00 
> Mar  3 11:12:54 clientes-ssd8 kernel: te c xfer 0
> Mar  3 11:12:54 clientes-ssd8 kernel: (da2:mpr0:0:26:0): CAM status: Command timeout
> Mar  3 11:12:54 clientes-ssd8 kernel: (da2:mpr0:0:26:0): Retrying command
> Mar  3 11:12:54 clientes-ssd8 kernel: (da2:mpr0:0:26:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
> Mar  3 11:12:54 clientes-ssd8 kernel: (da2:mpr0:0:26:0): CAM status: SCSI Status Error
> Mar  3 11:12:54 clientes-ssd8 kernel: (da2:mpr0:0:26:0): SCSI status: Check Condition
> Mar  3 11:12:54 clientes-ssd8 kernel: (da2:mpr0:0:26:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Mar  3 11:12:54 clientes-ssd8 kernel: (da2:mpr0:0:26:0): Retrying command (per sense data)
> 

SYNC CACHE seems to have been involved this time, and while it’s sometimes a source of trouble with SATA disks, I’m very hesitant to blame it.  Given the seemingly random nature of your problems, I’m not as certain anymore to rule out a fault of the disk enclosure.  This looks to be a different disk than your last report, and your statement that a sibling system exhibits no problems is very interesting.  Maybe there’s an issue with the power supply, and the disks are getting under-voltage conditions periodically.  If you can run smartctl against the disks, the output might be useful.  Also, if you’re able, could you make sure that both this system and the one that is working well are being fed with sufficient and similar AC power?  And if the power supply modules in your enclosures are swappable, maybe swap them between systems and see if the problem follows the module?  If that doesn’t fix it then I’ll think of ways to provide more instrumentation.

Scott