mpr(4) SAS3008 Repeated Crashing

Steven Hartland killing at multiplay.co.uk
Fri Mar 4 09:16:25 UTC 2016


On 04/03/2016 08:02, Borja Marcos wrote:
>> On 03 Mar 2016, at 18:09, Scott Long <scott4long at yahoo.com> wrote:
>>
>>
>> SYNC CACHE seems to have been involved this time, and while it’s sometimes a source of trouble with SATA disks, I’m very hesitant to blame it.  Given the seemingly random nature of your problems, I’m not as certain anymore to rule out a fault of the disk enclosure.  This looks to be a different disk than your last report, and your statement that a sibling system exhibits no problems is very interesting.  Maybe there’s an issue with the power supply, and the disks are getting under-voltage conditions periodically.  If you can run smartctl against the disks, the output might be useful.  Also, if you’re able, could you make sure that both this system and the one that is working well are being fed with sufficient and similar AC power?  And if the power supply modules in your enclosures are swappable, maybe swap them between systems and see if the problem follows the module?  If that doesn’t fix it then I’ll think of ways to provide more instrumentation.
> The affected disks are completely random. I didn’t copy a lot of instances to avoid too much litter, but each time it’s a different disk.
>
> Both systems are in the same datacenter, and yes, the power infrastructure is working. Swapping modules can be done if
> the dealer sends us another one because I prefer not to mess with a working system.
>
> The fact that it’s a different disk each time, and that the other system works perfectly is what makes me quite certain that it’s a hardware problem. Either some trouble
> with the backplane or a power problem.
>
> I am tempted to go the oscilloscope route (monitoring the internal power rails). But if the problem is in the power distribution of the backplane itself
> I’ll need to destroy a broken disk to build a backplane power probe :)
>
Its very rare but we've also seen this type of behaviour from a failing 
Intel CPU. There was no other indication the CPU had an issue, which one 
might expect, so just wanted to make you aware of the possibility.

That said the most common cause of this we've seen, when its not a 
common disk or disks, is a bad backplane or cabling to the backplane.

     Regards
     Steve



More information about the freebsd-scsi mailing list