mpr(4) SAS3008 Repeated Crashing, LSI's spiritual advice would be appreciated
Borja Marcos
borjam at sarenet.es
Mon Apr 25 08:36:25 UTC 2016
> On 04 Mar 2016, at 12:07, Steven Hartland <killing at multiplay.co.uk> wrote:
>
> On 04/03/2016 10:58, Borja Marcos wrote:
>>> On 04 Mar 2016, at 10:16, Steven Hartland <killing at multiplay.co.uk> wrote:
>>>
>>> Its very rare but we've also seen this type of behaviour from a failing Intel CPU. There was no other indication the CPU had an issue, which one might expect, so just wanted to make you aware of the possibility.
>>>
>>> That said the most common cause of this we've seen, when its not a common disk or disks, is a bad backplane or cabling to the backplane.
>> Now I’m really curious!
>>
>> How did you determine that it was the CPU? And what kind of issue was it causing? Noise in the power rails? Interference?
> After a month or so of fixing mfi so it recovered from all bad events and prevented all the various kernel panics, the machine stayed running long enough to log an MCA which pointed to a failing CPU cache.
>
> We we're lucky it was CPU #2 so we disabled all cores for said CPU in /boot/loader.conf and all the issues disappeared. We replaced the CPU and no more issues.
>
> We we're in the same situation as you, two machines identical configs, one which was constantly panicing in mfi the other was rock solid.
An update, long due. After the compliete inaction by IBM’ so called “support” who blamed us for using non official operating systems, we complained
quite loudly (and harshly) and they accepted to “replace a backplane for mere reasons of customer satisfaction”. Despite me insisting to bring also
a HBA because we really didn´t know what was wrong.
So they sent a technician with one of the three almost passive boards of the backplane, even though I told them that the issue was spread among the 24 disks, not
just a group of 8. He changed one of them at random (I was on vacation when he came) and, as I imagined, the issue wasn’t solved at all.
Tired of dealing with them I pulled the SAS3 HBA and installed a classic LSI2008 card. A nightmare in itself, because the stupid firmware of the IBM hangs during
boot (“connecting RAID adapters and boot devices” or something like that, I left it like that for 24 hours just to see if it eventually exited the loop). I had to erase the
boot services flash from the HBA even though I had already disabled BIOS and UEFI services for the riser PCI card. Anyway I digress.
Repeating all of our tests, with the LSI2008 card everything works like a charm, although I’ve seen some surprising behavior. I spent a lot of time running
benchmarks. I could repeat the error condition in less than an hour fairly reliably with the LSI3008 card, and I was unable to reproduce the error with the LSI2008.
Of course, these days this is the most sure you can be, unless someone presents you with a proper oscilloscope and SAS pod. I even suggested that to IBM,
offering to do a serious diagnosis of the problem for them ;)
The odd behavior, for which LSI’s spiritual advice would be welcome, is this: 6 minutes after booting the system, while doing a scrub in order to generate
I/O load, and before beginning to run the error triggering benchmarks, I saw some surprising messages on /var/log/messages:
———
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: Element descriptor: 'SLOT 000'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da15,pass16: SAS Device Slot Element: 1 Phys at Slot 0, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd99
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: Element descriptor: 'SLOT 001'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da16,pass17: SAS Device Slot Element: 1 Phys at Slot 1, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9a
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: Element descriptor: 'SLOT 002'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da17,pass18: SAS Device Slot Element: 1 Phys at Slot 2, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9b
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: Element descriptor: 'SLOT 003'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da18,pass19: SAS Device Slot Element: 1 Phys at Slot 3, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9c
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: Element descriptor: 'SLOT 004'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da19,pass20: SAS Device Slot Element: 1 Phys at Slot 4, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9d
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: Element descriptor: 'SLOT 005'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da20,pass21: SAS Device Slot Element: 1 Phys at Slot 5, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9e
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: Element descriptor: 'SLOT 006'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da21,pass22: SAS Device Slot Element: 1 Phys at Slot 6, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd9f
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: Element descriptor: 'SLOT 007'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da22,pass23: SAS Device Slot Element: 1 Phys at Slot 7, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fda0
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: Element descriptor: 'SLOT 008'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da7,pass8: SAS Device Slot Element: 1 Phys at Slot 8, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd91
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: Element descriptor: 'SLOT 009'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da8,pass9: SAS Device Slot Element: 1 Phys at Slot 9, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd92
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: Element descriptor: 'SLOT 010'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da9,pass10: SAS Device Slot Element: 1 Phys at Slot 10, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd93
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: Element descriptor: 'SLOT 011'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da10,pass11: SAS Device Slot Element: 1 Phys at Slot 11, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd94
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: Element descriptor: 'SLOT 012'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da11,pass12: SAS Device Slot Element: 1 Phys at Slot 12, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd95
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: Element descriptor: 'SLOT 013'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da12,pass13: SAS Device Slot Element: 1 Phys at Slot 13, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd96
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: Element descriptor: 'SLOT 014'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da13,pass14: SAS Device Slot Element: 1 Phys at Slot 14, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd97
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: Element descriptor: 'SLOT 015'
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: da14,pass15: SAS Device Slot Element: 1 Phys at Slot 15, Not All Phys
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: SATA device
Apr 20 11:06:38 clientes-ssd8 kernel: ses1: phy 0: parent 500507603ea6fd90 addr 500507603ea6fd98
——————
And at 17:41, something similar:
——————
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: Element descriptor: 'SLOT 016'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da0,pass0: SAS Device Slot Element: 1 Phys at Slot 16, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d721
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: Element descriptor: 'SLOT 017'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da1,pass1: SAS Device Slot Element: 1 Phys at Slot 17, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d722
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: Element descriptor: 'SLOT 018'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da2,pass2: SAS Device Slot Element: 1 Phys at Slot 18, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d723
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: Element descriptor: 'SLOT 019'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da3,pass3: SAS Device Slot Element: 1 Phys at Slot 19, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d724
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: Element descriptor: 'SLOT 020'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da4,pass4: SAS Device Slot Element: 1 Phys at Slot 20, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d725
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: Element descriptor: 'SLOT 021'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da5,pass5: SAS Device Slot Element: 1 Phys at Slot 21, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d726
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: Element descriptor: 'SLOT 022'
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: da6,pass6: SAS Device Slot Element: 1 Phys at Slot 22, Not All Phys
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: SATA device
Apr 20 17:41:41 clientes-ssd8 kernel: ses0: phy 0: parent 500507603ea6d720 addr 500507603ea6d727
———————
After those events I did a scrub just in case, and no errors were found. Can it be some expander oddity that somewhat
confused the LSI3008 and not the LSI2008?
The system is working as a charm anyway, but I wonder if there’s some non obvious problem waiting to become a time bomb.
Regarding IBM, well, unless we can fix this the expensive piece of hardware it will be scrapped. And I really doubt
any piece of kit from IBM/Lenovo (seems that Lenovo is in charge of support for these servers now) will be purchased here on
my watch, ever.
Thanks,
Borja.
More information about the freebsd-scsi
mailing list