problems with SAS JBODs 2
Oliver Sech
crimsonthunder at gmx.net
Tue Jul 24 18:22:49 UTC 2018
update 2: I continued to test with more and different hardware.
tested with a LSI SAS9207-8e HBA:
* after disconnect all devices properly disappear /dev/daX /dev/ses
no rescans or writing necessary
* no more targets in mpsutil (not mprutil)
* after reconnect all disks and all ses devs appear!
tested with hardware raid LSI SAS 9286CV-8e
* no problems with the shelf/sas in different configurations
* switching the controller and importing configuration works reliably
So far I think there is a problem with the mpr driver and I'm quite confident that it does affect other people.
With a simple configuration is probably not immediately noticeable as everything seems to work after the first connect/boot.
It probably gets scarier for people with multipathing and big SAS chains I guess...
I will downgrade to SAS2 HBAs shortly as I'm running out of space. If there is anything I can help with while I still have hardware in the lab let me know.
Oliver
On 07/23/2018 04:14 PM, Oliver Sech wrote:
> Sorry for the delay. I moved to a different office and could not focus on this issue last week.
>
> I tested all of the hardware with different drivers and firmware on Linux to make sure this is not a hardware problem:
> * Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD
> * Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD
> * Firmware 16.00.01.00 + Driver 26.000.00.00 -> BAD (42 out of 44 disks after reconnect)
> * Firmware 16.00.01.00 + Driver 12.100.00.00 -> BAD (42 out of 44 disks after reconnect)
>
> I tested a different HBA with an old firmware as well and there were no issues. Only with the latest FW disks are missing after a reconnect with the error "mpt3sas_cm0: "device is not present handle"
> I don't know yet how different Firmware behaves between version 09.00.000.00 and 16...
>
> Additional Info/Changes:
> * Upgraded testsystem to 11.2 as suggested in the mailing list. -> No Change
> * "camcontrol rescan all" removes the devices that are still present after the cable has been removed. "camcontrol devlist -v" does not show them anymore
>
>
> Setting the driver "use_phy_num" to 0 and using the clearDPM script between connects does not help. In fact I do not see a different behavior at all?
> I reflashed the controller multiple times and erased everything except the "manufacturing" area to make sure that no previous settings are kept.
> The only thing I know that "fixes" the missing drives is to reboot the server.
>
> A (similar?) problem also occurs once I start the server with all 6 disk shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up properly with 5 shelves, once I offline connect the 6th shelve, then some random disks are missing and I cannot longer import the ZFS pool.
>
> The following logs were collected with the very old FW 09.00.101.00 that worked on Linux.
> Logs: https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=0
>
> best regards,
> Oliver
>
> On 07/12/2018 03:38 PM, Ken Merry wrote:
>>
>>> On Jul 12, 2018, at 6:00 AM, Oliver Sech <crimsonthunder at gmx.net> wrote:
>>>
>>> On 07/11/2018 10:35 PM, Ken Merry wrote:
>>>> Oliver, what happens when you try to do I/O to the devices that don’t go away after you pull the cable? Does that cause the devices to go away?
>>>
>>> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least the "da" device disappears.
>>
>> Ok, that’s good. Can you send the dmesg output and check with ‘camcontrol devlist -v’ to make sure the device has fully gone away?
>>
>> The reason I ask is that I have spent lots of time over the years debugging device arrival and departure problems in CAM, GEOM and devfs, and I want to make sure we aren’t running into any non-SAS related problems.
>>
>>>
>>>> Looking at the mprutil output, it also shows the devices sticking around from the adapter’s standpoint.
>>>>
>>>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’ (where N is the scbus number shown by ‘camcontrol devlist -v’). That will do some basic probes for each of the devices and should in theory cause them to go away if they aren’t accessible.
>>>>
>>>> It seems like the adapter may not be recognizing that the devices in question have gone.
>>>
>>>
>>> I'm pretty sure that I tried this 'camcontrol rescan all' a few times. While I not sure anymore if that cleans up the non-working devices, I'm sure that no new devices were added.
>>
>> If doing a read from the device with dd makes it go away, ‘camcontrol rescan all’ should make it go away as well. It sends command to every device, and if the mpr(4) driver tells CAM the drive is no longer there, it’ll get removed.
>>
>> If it doesn’t cause the device to get removed (and the rescan doesn’t hang), it means that you’re getting a response from a device that is no longer physically connected to the machine, which is impossible with SAS.
>>
>>>
>>> Unfortunately I haven't gotten yet to Steves 'clear controller mapping' script but I did a few other things:
>>
>> Steve’s email made it sound like he was going to send it. I just sent it to you separately.
>>
>>> * The last time I tried to upgrade the firmware I had all sorts of problems. "sas3flash" reported bad checksums while flashing some of the files.
>>> So I reflashed both controllers with the DOS version of sas3flash. This was basically a challenge in itself because the DOS version of this utility does not seem to run on computers of this decade. (ERROR: Failed to initialize PAL. Exiting program.)
>>> The equivalent sas3flash.EFI version seems to be out of date and caused the checksum problems described before.
>>> (This time I wiped them before flashing with "sas3flash -o -e 6”.)
>>
>> That is unfortunate…perhaps Steve has some insight.
>>
>>>
>>> * I tried to change mpr tuneable "use_phy_num" after that but this has not improved the situation. I will retry and collect logs with Steves script.
>>
>> Changed it to what? I think it defaults to 1. Did you try 0?
>>
>>> * I retried with the latest "mpr.ko" from the broadcom download page. (Same problems, no "use_phy_num" tuneable.)
>>>
>>> * I retested this hardware with Linux (4.15 and 4.17)
>>> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45 disks disappear, 45 disks reappear)
>>> ** The newest shelf 2 disks were missing after the replugging (ie: 44 disks show up, 44 disks disappear, 42 disks reappear) (kernel log mpt3sas_cm0: "device is not present handle)
>>>
>>> * I tired a different controller
>>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) (Firmware 16.00.01.00 or 15.00.00.00)
>>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar with 09*))
>>> With the new controller everything seems work on Linux. It might be the old Firmware?...
>>> It is better with the new controller on FreeBSD in that sense that I at least get one out of two /dev/sesX devices back. But disks are still missing and are not getting completely cleaned up…
>>
>> It does sound a bit like a mapping table problem. Clearing it might help, we’ll see.
>>
>>> This whole thing is a bit frustrating, especially since up until now I thought that HBAs are kind of "connect and forget" devices. Next step is to set up a separate test environment and try to get it to work there. I will keep you updated and try provide log for all FreeBSD related problems.
>>
>> Thanks for debugging this. Unfortunately there are a number of ways it can go wrong. The mapping code has been the source of some problems, sometimes enclosure vendors do the wrong thing, and sometimes there are other bugs.
>>
>> Ken
>>
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>
More information about the freebsd-scsi
mailing list