problems with SAS JBODs 2
Stephen Mcconnell
stephen.mcconnell at broadcom.com
Wed Jul 25 19:09:00 UTC 2018
Can you enable Mapping Debugging, then do these steps again and send the
logs. If I don't see anything interesting in the logs I might have you turn
more debug bits on. So, first set the debug_level to 0x203. What I'm looking
for is some indication that the driver is dropping a device or not adding
it. It that's not happening at the driver level, something else is causing
the problem. You can try setting the Event Debug flag as well, but that
might be too overwhelming to capture (debug_level = 0x207).
Steve
> -----Original Message-----
> From: Oliver Sech [mailto:crimsonthunder at gmx.net]
> Sent: Wednesday, July 25, 2018 4:24 AM
> To: Stephen Mcconnell; FreeBSD-scsi
> Subject: Re: problems with SAS JBODs 2
>
> I ran the clear_dpm.sh script and changed the value you suggested.
> Rebooted and retested. As far as I can tell there is no difference.
>
> I tried the menu option (99. Reset port) in lsiutil and this helps with
> missing
> devices. After reseting the port I get all my disks and ses devs again.
>
> Read NVRAM or current values? [0=NVRAM, 1=Current, default is 0]
>
> 0000 : 21080600
> 0004 : 00000001
> 0008 : 00180080
> 000c : 00000001
> 0010 : 00000000
> 0014 : 00000000
>
> On 07/24/2018 10:22 PM, Stephen Mcconnell wrote:
> > Oliver, can you try changing the mapping mode on the controller? I think
> > you're using Enclosure/Slot Mapping and I want to see what happens with
> > Device Persistent Mapping. To do that, follow these steps:
> > 1. Run Ken’s script to clear the DPM entries
> > 2. Use LSIUtil to change the mapping mode in IOC Page 8. Command 9,
> Page
> > Type 1, Page Number 8. If you see 0000002 at offset 0x0C you're using
> > Enclosure/Slot Mapping and I'd like you to change this. You will be
> > asked if
> > you want to make changes. Select ‘yes’ and then change offset 0x0C to
> > 00000001 (you might have to type C instead of 0x0C for the offset). Just
> use
> > the default setting to change NVRAM.
> > 3. Reboot and see what happens and let me know how it goes.
> >
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: owner-freebsd-scsi at freebsd.org [mailto:owner-freebsd-
> >> scsi at freebsd.org] On Behalf Of Oliver Sech
> >> Sent: Tuesday, July 24, 2018 12:23 PM
> >> To: FreeBSD-scsi
> >> Subject: Re: problems with SAS JBODs 2
> >>
> >> update 2: I continued to test with more and different hardware.
> >>
> >> tested with a LSI SAS9207-8e HBA:
> >> * after disconnect all devices properly disappear /dev/daX /dev/ses
> >> no rescans or writing necessary
> >> * no more targets in mpsutil (not mprutil)
> >> * after reconnect all disks and all ses devs appear!
> >>
> >> tested with hardware raid LSI SAS 9286CV-8e
> >> * no problems with the shelf/sas in different configurations
> >> * switching the controller and importing configuration works reliably
> >>
> >> So far I think there is a problem with the mpr driver and I'm quite
> >> confident
> >> that it does affect other people.
> >> With a simple configuration is probably not immediately noticeable as
> >> everything seems to work after the first connect/boot.
> >> It probably gets scarier for people with multipathing and big SAS
> >> chains I
> >> guess...
> >>
> >> I will downgrade to SAS2 HBAs shortly as I'm running out of space. If
> >> there is
> >> anything I can help with while I still have hardware in the lab let me
> >> know.
> >>
> >> Oliver
> >>
> >> On 07/23/2018 04:14 PM, Oliver Sech wrote:
> >>> Sorry for the delay. I moved to a different office and could not focus
> >>> on
> >> this issue last week.
> >>>
> >>> I tested all of the hardware with different drivers and firmware on
> >>> Linux to
> >> make sure this is not a hardware problem:
> >>> * Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD
> >>> * Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD
> >>> * Firmware 16.00.01.00 + Driver 26.000.00.00 -> BAD (42 out of 44
> >>> disks
> >> after reconnect)
> >>> * Firmware 16.00.01.00 + Driver 12.100.00.00 -> BAD (42 out of 44
> >>> disks
> >> after reconnect)
> >>>
> >>> I tested a different HBA with an old firmware as well and there were
> >>> no
> >> issues. Only with the latest FW disks are missing after a reconnect
> >> with
> >> the
> >> error "mpt3sas_cm0: "device is not present handle"
> >>> I don't know yet how different Firmware behaves between version
> >> 09.00.000.00 and 16...
> >>>
> >>> Additional Info/Changes:
> >>> * Upgraded testsystem to 11.2 as suggested in the mailing list. -> No
> >> Change
> >>> * "camcontrol rescan all" removes the devices that are still present
> >>> after
> >> the cable has been removed. "camcontrol devlist -v" does not show them
> >> anymore
> >>>
> >>>
> >>> Setting the driver "use_phy_num" to 0 and using the clearDPM script
> >> between connects does not help. In fact I do not see a different
> >> behavior
> >> at
> >> all?
> >>> I reflashed the controller multiple times and erased everything except
> >>> the
> >> "manufacturing" area to make sure that no previous settings are kept.
> >>> The only thing I know that "fixes" the missing drives is to reboot the
> >>> server.
> >>>
> >>> A (similar?) problem also occurs once I start the server with all 6
> >>> disk
> >> shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up
> >> properly with 5 shelves, once I offline connect the 6th shelve, then
> >> some
> >> random disks are missing and I cannot longer import the ZFS pool.
> >>>
> >>> The following logs were collected with the very old FW 09.00.101.00
> >>> that
> >> worked on Linux.
> >>> Logs:
> https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=0
> >>>
> >>> best regards,
> >>> Oliver
> >>>
> >>> On 07/12/2018 03:38 PM, Ken Merry wrote:
> >>>>
> >>>>> On Jul 12, 2018, at 6:00 AM, Oliver Sech <crimsonthunder at gmx.net>
> >> wrote:
> >>>>>
> >>>>> On 07/11/2018 10:35 PM, Ken Merry wrote:
> >>>>>> Oliver, what happens when you try to do I/O to the devices that
> don’t
> >> go away after you pull the cable? Does that cause the devices to go
> away?
> >>>>>
> >>>>> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least
> >>>>> the
> >> "da" device disappears.
> >>>>
> >>>> Ok, that’s good. Can you send the dmesg output and check with
> >> ‘camcontrol devlist -v’ to make sure the device has fully gone away?
> >>>>
> >>>> The reason I ask is that I have spent lots of time over the years
> >>>> debugging
> >> device arrival and departure problems in CAM, GEOM and devfs, and I
> want
> >> to make sure we aren’t running into any non-SAS related problems.
> >>>>
> >>>>>
> >>>>>> Looking at the mprutil output, it also shows the devices sticking
> >>>>>> around
> >> from the adapter’s standpoint.
> >>>>>>
> >>>>>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan
> >>>>>> N’
> >> (where N is the scbus number shown by ‘camcontrol devlist -v’). That
> >> will
> >> do
> >> some basic probes for each of the devices and should in theory cause
> them
> >> to go away if they aren’t accessible.
> >>>>>>
> >>>>>> It seems like the adapter may not be recognizing that the devices
> >>>>>> in
> >> question have gone.
> >>>>>
> >>>>>
> >>>>> I'm pretty sure that I tried this 'camcontrol rescan all' a few
> >>>>> times.
> >>>>> While
> >> I not sure anymore if that cleans up the non-working devices, I'm sure
> >> that
> >> no new devices were added.
> >>>>
> >>>> If doing a read from the device with dd makes it go away, ‘camcontrol
> >> rescan all’ should make it go away as well. It sends command to every
> >> device, and if the mpr(4) driver tells CAM the drive is no longer
> >> there,
> >> it’ll get
> >> removed.
> >>>>
> >>>> If it doesn’t cause the device to get removed (and the rescan doesn’t
> >> hang), it means that you’re getting a response from a device that is no
> >> longer physically connected to the machine, which is impossible with
> >> SAS.
> >>>>
> >>>>>
> >>>>> Unfortunately I haven't gotten yet to Steves 'clear controller
> >>>>> mapping'
> >> script but I did a few other things:
> >>>>
> >>>> Steve’s email made it sound like he was going to send it. I just
> >>>> sent
> >>>> it to
> >> you separately.
> >>>>
> >>>>> * The last time I tried to upgrade the firmware I had all sorts of
> >> problems. "sas3flash" reported bad checksums while flashing some of the
> >> files.
> >>>>> So I reflashed both controllers with the DOS version of sas3flash.
> >>>>> This
> >> was basically a challenge in itself because the DOS version of this
> >> utility does
> >> not seem to run on computers of this decade. (ERROR: Failed to
> >> initialize
> >> PAL. Exiting program.)
> >>>>> The equivalent sas3flash.EFI version seems to be out of date and
> >>>>> caused
> >> the checksum problems described before.
> >>>>> (This time I wiped them before flashing with "sas3flash -o -e 6”.)
> >>>>
> >>>> That is unfortunate…perhaps Steve has some insight.
> >>>>
> >>>>>
> >>>>> * I tried to change mpr tuneable "use_phy_num" after that but this
> has
> >> not improved the situation. I will retry and collect logs with Steves
> >> script.
> >>>>
> >>>> Changed it to what? I think it defaults to 1. Did you try 0?
> >>>>
> >>>>> * I retried with the latest "mpr.ko" from the broadcom download
> page.
> >> (Same problems, no "use_phy_num" tuneable.)
> >>>>>
> >>>>> * I retested this hardware with Linux (4.15 and 4.17)
> >>>>> ** Some shelves could be replugged reliably (ie: 45 disks show up,
> >>>>> 45
> >> disks disappear, 45 disks reappear)
> >>>>> ** The newest shelf 2 disks were missing after the replugging (ie:
> >>>>> 44
> >> disks show up, 44 disks disappear, 42 disks reappear) (kernel log
> >> mpt3sas_cm0: "device is not present handle)
> >>>>>
> >>>>> * I tired a different controller
> >>>>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216)
> >> (Firmware 16.00.01.00 or 15.00.00.00)
> >>>>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI
> >> 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something
> similar
> >> with 09*))
> >>>>> With the new controller everything seems work on Linux. It might be
> >>>>> the
> >> old Firmware?...
> >>>>> It is better with the new controller on FreeBSD in that sense that I
> >>>>> at
> >> least get one out of two /dev/sesX devices back. But disks are still
> >> missing
> >> and are not getting completely cleaned up…
> >>>>
> >>>> It does sound a bit like a mapping table problem. Clearing it might
> >>>> help,
> >> we’ll see.
> >>>>
> >>>>> This whole thing is a bit frustrating, especially since up until now
> >>>>> I
> >> thought that HBAs are kind of "connect and forget" devices. Next step
> >> is
> >> to
> >> set up a separate test environment and try to get it to work there. I
> >> will
> >> keep
> >> you updated and try provide log for all FreeBSD related problems.
> >>>>
> >>>> Thanks for debugging this. Unfortunately there are a number of ways
> >>>> it
> >> can go wrong. The mapping code has been the source of some problems,
> >> sometimes enclosure vendors do the wrong thing, and sometimes there
> are
> >> other bugs.
> >>>>
> >>>> Ken
> >>>>
> >>> _______________________________________________
> >>> freebsd-scsi at freebsd.org mailing list
> >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> >>> To unsubscribe, send any mail to "freebsd-scsi-
> unsubscribe at freebsd.org"
> >>>
> >> _______________________________________________
> >> freebsd-scsi at freebsd.org mailing list
> >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
More information about the freebsd-scsi
mailing list