Avago LSI SAS 3008 & Intel SSD Timeouts

Steven Hartland killing at multiplay.co.uk
Mon Oct 24 20:46:40 UTC 2016


Thanks for the update, good to know!

On 24/10/2016 19:39, list-news wrote:
> Someone on the list contacted me directly and asked if I had ever come 
> to a conclusion on the SAS 3008 issues I was finding back in June.  
> Firstly, my apologies to everyone, I had forgotten to get back to the 
> list on my findings.
>
> Eventually I ended up cross flashing the card to an LSI 9300-8i 
> firmware version 12.  Cross flashing obviously wasn't supported by 
> SuperMicro but I was without options and they didn't have a v12 
> firmware available directly.
>
> This v12 firmware functioned stable, and caused the drives to stop 
> disconnecting (and subsequently, zfs was no longer showing errors at 
> 100% io for 24+ hours).  I posted my findings to SuperMicro, and they 
> quickly released a v12 firmware for the SAS 3008 HBA, which as I 
> understand is now publicly available on their site.
>
> The story is long, and painful, but in the end the answer is to 
> absolutely make sure you are running v12 firmware on the SAS 3008 
> cards.  If you are not running v12 (or later) firmware on the SAS 
> 3008, your data is unstable! (at least with intel drives, probably 
> others as well)
>
> Thanks again for all the help/ideas from everyone.  It's been a few 
> months now with zero errors reporting!
>
> -Kyle
>
>
> On 6/7/16 6:39 PM, David Gwynne wrote:
>>> On 8 Jun 2016, at 09:30, Steven Hartland <killing at multiplay.co.uk> 
>>> wrote:
>>>
>>> Oh another thing to test is iirc 3008 is supported by mrsas so you 
>>> might want to try adding the following into loader.conf to switch 
>>> drivers:
>>> hw.mfi.mrsas_enable="1"
>> i believe the 3008s can run two different firmwares, one that 
>> provides the mpt2 interface and the other than provides the megaraid 
>> sas fusion interface. you have to flash them to switch though, you 
>> cant just point a driver at it and hope for the best.
>>
>> each fw presents different pci ids. eg, in 
>> http://pciids.sourceforge.net/v2.2/pci.ids you can see:
>>
>>     005f  MegaRAID SAS-3 3008 [Fury]
>>     0097  SAS3008 PCI-Express Fusion-MPT SAS-3
>>
>> dlg
>>
>>> On 07/06/2016 23:43, list-news wrote:
>>>> No, it threw errors on both da6 and da7 and then I stopped it.
>>>>
>>>> Your last e-mail gave me thoughts though.  I have a server with 
>>>> 2008 controllers (entirely different backplane design, cpu, memory, 
>>>> etc).  I've moved the 4 drives to that and I'm running the test now.
>>>>
>>>> # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215
>>>> # sysctl dev.mps.0
>>>> dev.mps.0.spinup_wait_time: 3
>>>> dev.mps.0.chain_alloc_fail: 0
>>>> dev.mps.0.enable_ssu: 1
>>>> dev.mps.0.max_chains: 2048
>>>> dev.mps.0.chain_free_lowwater: 1176
>>>> dev.mps.0.chain_free: 2048
>>>> dev.mps.0.io_cmds_highwater: 510
>>>> dev.mps.0.io_cmds_active: 0
>>>> dev.mps.0.driver_version: 20.00.00.00-fbsd
>>>> dev.mps.0.firmware_version: 17.00.01.00
>>>> dev.mps.0.disable_msi: 0
>>>> dev.mps.0.disable_msix: 0
>>>> dev.mps.0.debug_level: 3
>>>> dev.mps.0.%parent: pci5
>>>> dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 
>>>> subdevice=0x3020 class=0x010700
>>>> dev.mps.0.%location: slot=0 function=0
>>>> dev.mps.0.%driver: mps
>>>> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008
>>>>
>>>> About 1.5 hours has passed at full load, no errors.
>>>>
>>>> gstat drive busy% seems to hang out around 30-40 instead of 
>>>> ~60-70.  Overall throughput seems to be 20-30% less with my rough 
>>>> benchmarks.
>>>>
>>>> I'm not sure if this gets us closer to the answer, if this doesn't 
>>>> time-out on the 2008 controller, it looks like one of these:
>>>> 1) The Intel drive firmware is being overloaded somehow when 
>>>> connected to the 3008.
>>>> or
>>>> 2) The 3008 firmware or driver has an issue reading drive 
>>>> responses, sporadically thinking the command timed-out (when maybe 
>>>> it really didn't).
>>>>
>>>> Puzzle pieces:
>>>> A) Why does setting tags of 1 on drives connected to the 3008 fix 
>>>> the problem?
>>>> B) With tags of 255.  Why does postgres (and assuming a large fsync 
>>>> count), seem to cause the problem within minutes?  While running 
>>>> other heavy i/o commands (zpool scrub, bonnie++, fio), all of which 
>>>> show similarly high or higher iops take hours to cause the problem 
>>>> (if ever).
>>>>
>>>> I'll let this continue to run to further test.
>>>>
>>>> Thanks again for all the help.
>>>>
>>>> -Kyle
>>>>
>>>> On 6/7/16 4:22 PM, Steven Hartland wrote:
>>>>> Always da6?
>>>>>
>>>>> On 07/06/2016 21:19, list-news wrote:
>>>>>> Sure Steve:
>>>>>>
>>>>>> # cat /boot/loader.conf | grep trim
>>>>>> vfs.zfs.trim.enabled=0
>>>>>>
>>>>>> # sysctl vfs.zfs.trim.enabled
>>>>>> vfs.zfs.trim.enabled: 0
>>>>>>
>>>>>> # uptime
>>>>>> 3:14PM  up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07
>>>>>>
>>>>>> # tail -f /var/log/messages:
>>>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 
>>>>>> 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout 
>>>>>> cm 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, 
>>>>>> handle(0x0010)
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>>>>>> connector name (    )
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 
>>>>>> allocated tm 0xfffffe0001322150
>>>>>> Jun  7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 
>>>>>> 1 Aborting command 0xfffffe0001375580
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: Sending reset from 
>>>>>> mprsas_send_abort for target ID 16
>>>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE 
>>>>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 
>>>>>> command timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 
>>>>>> target 16, handle(0x0010)
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>>>>>> connector name (    )
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: queued timedout cm 
>>>>>> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>>>> Jun  7 15:13:50 s18 kernel: Flags: 1<InProgress>
>>>>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Started
>>>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 
>>>>>> 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm 
>>>>>> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 
>>>>>> 804b scsi 0 state c xfer 0
>>>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 
>>>>>> 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 
>>>>>> 804b scsi 0 state c xfer 0
>>>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 
>>>>>> 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm 
>>>>>> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 
>>>>>> 804b scsi 0 state c xfer 0
>>>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 
>>>>>> 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 
>>>>>> 804b scsi 0 state c xfer 0
>>>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 
>>>>>> 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm 
>>>>>> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 
>>>>>> 804b scsi 0 state c xfer 0
>>>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 
>>>>>> 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 
>>>>>> 804b scsi 0 state c xfer 0
>>>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 
>>>>>> 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed 
>>>>>> timedout cm 0xfffffe0001375580 ccb 0xfffff8039895f800 during 
>>>>>> recovery ioc 8048 scsi 0 state c    (da6:mpr0:0:16:0): 
>>>>>> SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 
>>>>>> 0 SMID 786 completed timedout cm 0xfffffe(da6:mpr0:0:16:0): 
>>>>>> WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00
>>>>>> Jun  7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 
>>>>>> during recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: 
>>>>>> Command timeout
>>>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE 
>>>>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 
>>>>>> terminated ioc 804b scsi 0 sta(da6:te c xfer 0
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 
>>>>>> abort TaskMID 1016 status 0x0 code 0x0 count 5
>>>>>> Jun  7 15:13:50 s18 kernel: 16:    (xpt0:mpr0:0:16:0): SMID 1 
>>>>>> finished recovery after aborting TaskMID 1016
>>>>>> Jun  7 15:13:50 s18 kernel: 0): mpr0: Retrying command
>>>>>> Jun  7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 4
>>>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>>>> Jun  7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
>>>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>>>> Jun  7 15:13:50 s18 kernel: EnclosureHandle: 0x2
>>>>>> Jun  7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
>>>>>> Jun  7 15:13:50 s18 kernel: NumPhys: 31
>>>>>> Jun  7 15:13:50 s18 kernel: NumEntries: 1
>>>>>> Jun  7 15:13:50 s18 kernel: StartPhyNum: 8
>>>>>> Jun  7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
>>>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>>>> Jun  7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
>>>>>> Jun  7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
>>>>>> Jun  7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working 
>>>>>> on  Event: [16]
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event 
>>>>>> Free: [16]
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working 
>>>>>> on  Event: [1c]
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event 
>>>>>> Free: [1c]
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>>>> Jun  7 15:13:50 s18 kernel: Flags: 0
>>>>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
>>>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working 
>>>>>> on  Event: [16]
>>>>>> Jun  7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event 
>>>>>> Free: [16]
>>>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE 
>>>>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
>>>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI 
>>>>>> Status Error
>>>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check 
>>>>>> Condition
>>>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT 
>>>>>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>>>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command 
>>>>>> (per sense data)
>>>>>>
>>>>>> -Kyle
>>>>>>
>>>>>> On 6/7/16 2:53 PM, Steven Hartland wrote:
>>>>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using 
>>>>>>> BIO delete but assuming your running ZFS can you set the 
>>>>>>> following in loader.conf and see how you get on.
>>>>>>> vfs.zfs.trim.enabled=0
>>>>>>>
>>>>>>>     Regards
>>>>>>>     Steve
>>>>>>
>>>>>> _______________________________________________
>>>>>> freebsd-scsi at freebsd.org mailing list
>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>>>> To unsubscribe, send any mail to 
>>>>>> "freebsd-scsi-unsubscribe at freebsd.org"
>>>>> _______________________________________________
>>>>> freebsd-scsi at freebsd.org mailing list
>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>>> To unsubscribe, send any mail to 
>>>>> "freebsd-scsi-unsubscribe at freebsd.org"
>>>>
>>>> _______________________________________________
>>>> freebsd-scsi at freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>> To unsubscribe, send any mail to 
>>>> "freebsd-scsi-unsubscribe at freebsd.org"
>>> _______________________________________________
>>> freebsd-scsi at freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>> _______________________________________________
>> freebsd-scsi at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>
>
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"



More information about the freebsd-scsi mailing list