mpr(4) SAS3008 Repeated Crashing, LSI's spiritual advice would be appreciated

Steven Hartland killing at multiplay.co.uk
Mon Apr 25 09:37:28 UTC 2016



On 25/04/2016 09:29, Borja Marcos wrote:
>> On 04 Mar 2016, at 12:07, Steven Hartland <killing at multiplay.co.uk> wrote:
>>
>> On 04/03/2016 10:58, Borja Marcos wrote:
>>>> On 04 Mar 2016, at 10:16, Steven Hartland <killing at multiplay.co.uk> wrote:
>>>>
>>>> Its very rare but we've also seen this type of behaviour from a failing Intel CPU. There was no other indication the CPU had an issue, which one might expect, so just wanted to make you aware of the possibility.
>>>>
>>>> That said the most common cause of this we've seen, when its not a common disk or disks, is a bad backplane or cabling to the backplane.
>>> Now I’m really curious!
>>>
>>> How did you determine that it was the CPU? And what kind of issue was it causing? Noise in the power rails? Interference?
>> After a month or so of fixing mfi so it recovered from all bad events and prevented all the various kernel panics, the machine stayed running long enough to log an MCA which pointed to a failing CPU cache.
>>
>> We we're lucky it was CPU #2 so we disabled all cores for said CPU in /boot/loader.conf and all the issues disappeared. We replaced the CPU and no more issues.
>>
>> We we're in the same situation as you, two machines identical configs, one which was constantly panicing in mfi the other was rock solid.
> An update, long due. After the compliete inaction by IBM’ so called “support” who blamed us for using non official operating systems, we complained
> quite loudly (and harshly) and they accepted to “replace a backplane for mere reasons of customer satisfaction”. Despite me insisting to bring also
> a HBA because we really didn´t know what was wrong.
>
> So they sent a technician with one of the three almost passive boards of the backplane, even though I told them that the issue was spread among the 24 disks, not
> just a group of 8. He changed one of them at random (I was on vacation when he came) and, as I imagined, the issue wasn’t solved at all.
>
> Tired of dealing with them I pulled the SAS3 HBA and installed a classic LSI2008 card. A nightmare in itself, because the stupid firmware of the IBM hangs during
> boot (“connecting RAID adapters and boot devices” or something like that, I left it like that for 24 hours just to see if it eventually exited the loop). I had to erase the
> boot services flash from the HBA even though I had already disabled BIOS and UEFI services for the riser PCI card. Anyway I digress.
>
> Repeating all of our tests, with the LSI2008 card everything works like a charm, although I’ve seen some surprising behavior. I spent a lot of time running
> benchmarks. I could repeat the error condition in less than an hour fairly reliably with the LSI3008 card, and I was unable to reproduce the error with the LSI2008.
> Of course, these days this is the most sure you can be, unless someone presents you with a proper oscilloscope and SAS pod. I even suggested that to IBM,
> offering to do a serious diagnosis of the problem for them ;)
>
> The odd behavior, for which LSI’s spiritual advice would be welcome, is this: 6 minutes after booting the system, while doing a scrub in order to generate
> I/O load, and before beginning to run the error triggering benchmarks, I saw some surprising messages on /var/log/messages:
>
>
> ———
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: Element descriptor: 'SLOT 000'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: SAS Device Slot Element: 1 Phys at Slot 0, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd99
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: Element descriptor: 'SLOT 001'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: SAS Device Slot Element: 1 Phys at Slot 1, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9a
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: Element descriptor: 'SLOT 002'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: SAS Device Slot Element: 1 Phys at Slot 2, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9b
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: Element descriptor: 'SLOT 003'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: SAS Device Slot Element: 1 Phys at Slot 3, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9c
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: Element descriptor: 'SLOT 004'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: SAS Device Slot Element: 1 Phys at Slot 4, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9d
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: Element descriptor: 'SLOT 005'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: SAS Device Slot Element: 1 Phys at Slot 5, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9e
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: Element descriptor: 'SLOT 006'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: SAS Device Slot Element: 1 Phys at Slot 6, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9f
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: Element descriptor: 'SLOT 007'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: SAS Device Slot Element: 1 Phys at Slot 7, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fda0
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: Element descriptor: 'SLOT 008'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: SAS Device Slot Element: 1 Phys at Slot 8, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd91
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: Element descriptor: 'SLOT 009'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: SAS Device Slot Element: 1 Phys at Slot 9, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd92
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: Element descriptor: 'SLOT 010'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: SAS Device Slot Element: 1 Phys at Slot 10, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd93
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: Element descriptor: 'SLOT 011'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: SAS Device Slot Element: 1 Phys at Slot 11, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd94
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: Element descriptor: 'SLOT 012'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: SAS Device Slot Element: 1 Phys at Slot 12, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd95
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: Element descriptor: 'SLOT 013'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: SAS Device Slot Element: 1 Phys at Slot 13, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd96
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: Element descriptor: 'SLOT 014'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: SAS Device Slot Element: 1 Phys at Slot 14, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd97
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: Element descriptor: 'SLOT 015'
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: SAS Device Slot Element: 1 Phys at Slot 15, Not All Phys
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: SATA device
> Apr 20 11:06:38 clientes-ssd8 kernel: ses1:  phy 0: parent 500507603ea6fd90 addr 500507603ea6fd98
>
> ——————
>
>
>
> And at 17:41, something similar:
>
>
>
> ——————
>
>
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: Element descriptor: 'SLOT 016'
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: SAS Device Slot Element: 1 Phys at Slot 16, Not All Phys
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent 500507603ea6d720 addr 500507603ea6d721
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: Element descriptor: 'SLOT 017'
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: SAS Device Slot Element: 1 Phys at Slot 17, Not All Phys
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent 500507603ea6d720 addr 500507603ea6d722
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: Element descriptor: 'SLOT 018'
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: SAS Device Slot Element: 1 Phys at Slot 18, Not All Phys
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent 500507603ea6d720 addr 500507603ea6d723
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: Element descriptor: 'SLOT 019'
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: SAS Device Slot Element: 1 Phys at Slot 19, Not All Phys
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent 500507603ea6d720 addr 500507603ea6d724
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: Element descriptor: 'SLOT 020'
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: SAS Device Slot Element: 1 Phys at Slot 20, Not All Phys
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent 500507603ea6d720 addr 500507603ea6d725
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: Element descriptor: 'SLOT 021'
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: SAS Device Slot Element: 1 Phys at Slot 21, Not All Phys
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent 500507603ea6d720 addr 500507603ea6d726
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: Element descriptor: 'SLOT 022'
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: SAS Device Slot Element: 1 Phys at Slot 22, Not All Phys
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: SATA device
> Apr 20 17:41:41 clientes-ssd8 kernel: ses0:  phy 0: parent 500507603ea6d720 addr 500507603ea6d727
>
> ———————
>
>
> After those events I did a scrub just in case, and no errors were found. Can it be some expander oddity that somewhat
> confused the LSI3008 and not the LSI2008?
>
> The system is working as a charm anyway, but I wonder if there’s some non obvious problem waiting to become a time bomb.
>
> Regarding IBM, well, unless we can fix this the expensive piece of hardware it will be scrapped. And I really doubt
> any piece of kit from IBM/Lenovo (seems that Lenovo is in charge of support for these servers now) will be purchased here on
> my watch, ever.
>
2008 is 6Gbps component, 3008 is a 12Gbps one so if you have 12Gbps 
capable devices its quite possible that where a 2008 works fine, 
negotiates at 6Gbps, the 3008 could fail @ 12Gbps due to the tighter 
tolerances required from all components.

We had similar issues when chassis first started moving from 3Gbps to 
6Gbps, in fact we found that Dell shipped drives with amended firmware 
that limited their negotiation speed down to 3Gbps specifically to 
workaround signalling issues in their chassis, even though they 
advertised them as 6Gbps compatible.

     Regards
     Steve


More information about the freebsd-scsi mailing list