problems with SAS JBODs 2

Ken Merry ken at freebsd.org
Thu Jul 12 13:38:42 UTC 2018


> On Jul 12, 2018, at 6:00 AM, Oliver Sech <crimsonthunder at gmx.net> wrote:
> 
> On 07/11/2018 10:35 PM, Ken Merry wrote:
>> Oliver, what happens when you try to do I/O to the devices that don’t go away after you pull the cable?  Does that cause the devices to go away?
> 
> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least the "da" device disappears.

Ok, that’s good.  Can you send the dmesg output and check with ‘camcontrol devlist -v’ to make sure the device has fully gone away?

The reason I ask is that I have spent lots of time over the years debugging device arrival and departure problems in CAM, GEOM and devfs, and I want to make sure we aren’t running into any non-SAS related problems.

> 
>> Looking at the mprutil output, it also shows the devices sticking around from the adapter’s standpoint.
>> 
>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’ (where N is the scbus number shown by ‘camcontrol devlist -v’).  That will do some basic probes for each of the devices and should in theory cause them to go away if they aren’t accessible.
>> 
>> It seems like the adapter may not be recognizing that the devices in question have gone.
> 
> 
> I'm pretty sure that I tried this 'camcontrol rescan all' a few times. While I not sure anymore if that cleans up the non-working devices, I'm sure that no new devices were added.

If doing a read from the device with dd makes it go away, ‘camcontrol rescan all’ should make it go away as well.  It sends command to every device, and if the mpr(4) driver tells CAM the drive is no longer there, it’ll get removed.

If it doesn’t cause the device to get removed (and the rescan doesn’t hang), it means that you’re getting a response from a device that is no longer physically connected to the machine, which is impossible with SAS.

> 
> Unfortunately I haven't gotten yet to Steves 'clear controller mapping' script but I did a few other things:

Steve’s email made it sound like he was going to send it.  I just sent it to you separately.

> * The last time I tried to upgrade the firmware I had all sorts of problems. "sas3flash" reported bad checksums while flashing some of the files.
> So I reflashed both controllers with the DOS version of sas3flash. This was basically a challenge in itself because the DOS version of this utility does not seem to run on computers of this decade. (ERROR:  Failed to initialize PAL.  Exiting program.)
> The equivalent sas3flash.EFI version seems to be out of date and caused the checksum problems described before.
> (This time I wiped them before flashing with "sas3flash -o -e 6”.)

That is unfortunate…perhaps Steve has some insight.

> 
> * I tried to change mpr tuneable "use_phy_num" after that but this has not improved the situation. I will retry and collect logs with Steves script.

Changed it to what?  I think it defaults to 1.  Did you try 0?

> * I retried with the latest "mpr.ko" from the broadcom download page. (Same problems, no "use_phy_num" tuneable.)
> 
> * I retested this hardware with Linux (4.15 and 4.17)
> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45 disks disappear, 45 disks reappear)
> ** The newest shelf 2 disks were missing after the replugging (ie: 44 disks show up, 44 disks disappear, 42 disks reappear) (kernel log mpt3sas_cm0: "device is not present handle)
> 
> * I tired a different controller
> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) (Firmware 16.00.01.00 or 15.00.00.00)
> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar with 09*))
> With the new controller everything seems work on Linux. It might be the old Firmware?...
> It is better with the new controller on FreeBSD in that sense that I at least get one out of two /dev/sesX devices back. But disks are still missing and are not getting completely cleaned up…

It does sound a bit like a mapping table problem.  Clearing it might help, we’ll see.

> This whole thing is a bit frustrating, especially since up until now I thought that HBAs are kind of "connect and forget" devices. Next step is to set up a separate test environment and try to get it to work there. I will keep you updated and try provide log for all FreeBSD related problems.

Thanks for debugging this.  Unfortunately there are a number of ways it can go wrong.  The mapping code has been the source of some problems, sometimes enclosure vendors do the wrong thing, and sometimes there are other bugs.

Ken  



More information about the freebsd-scsi mailing list