Write Timeouts with MPS

Wed Apr 18 03:41:55 UTC 2012

I have updated all the drives with the firmware provided by Seagate.  Performance is up and I don't see any timeouts when doing a zpool scrub.  I'm going to give the system more of a workout, but so far I think the drive firmware did the trick.  

Seatools for windows is a pain.  It will let you select a firmware file anywhere on your system, but silently fail if you don't put the firmware update in its program directory.  It also seems to have a hard display limit of ~13 drives.  Has anyone had success with using camcontrol fwdownload with Seagate .LOD firmware files?

John

On Apr 12, 2012, at 1:16 PM, John Hickey wrote:

> I have a firmware update in hand for the drives.  I am going to update my drives and see if I can still reproduce this.
> 
> John
> 
> On Apr 12, 2012, at 5:26 AM, Desai, Kashyap wrote:
> 
>> We never see this issue on our test machines.
>> Adding Sreekanth and he will plan to reproduce this issue locally to have further analysis on issue.
>> 
>> Please help Sreekanth to reproduce it locally.
>> 
>> 
>> ~ Kashyap
>> 
>>> -----Original Message-----
>>> From: owner-freebsd-scsi at freebsd.org [mailto:owner-freebsd-
>>> scsi at freebsd.org] On Behalf Of John Hickey
>>> Sent: Wednesday, April 11, 2012 1:06 PM
>>> To: freebsd-scsi at freebsd.org
>>> Subject: Re: Write Timeouts with MPS
>>> 
>>> I pretty much did this and filed a ticket with Seagate this afternoon.
>>> They told me the latest firmware is 0006 (I am at 0001) and wanted
>>> the serial numbers of the other drives in the array (probably to
>>> confirm firmware compatibility).  I suspect I'll have the update in
>>> hand tomorrow and see how that works.  Running FreeBSD didn't seem to
>>> be an issue to them aside from concern about reading the serial numbers
>>> without seatools.  Only issue with that was that I initially gave them
>>> the whole inquiry serial string, but only the first 8 (X) characters of
>>> inquiry are the serial number:
>>> 
>>>   $ sudo camcontrol inquiry da3
>>>   pass3: <SEAGATE ST2000NM0001 0001> Fixed Direct Access SCSI-6 device
>>>   pass3: Serial Number XXXXXXXX0000YYYYYYYY
>>>   pass3: 600.000MB/s transfers, Command Queueing Enabled
>>> 
>>> John
>>> 
>>> On Wed, Apr 11, 2012 at 07:35:09AM +0200, Peter Maloney wrote:
>>>> Well, when I emailed some Seagate people, they just told me to use
>>>> supported ones. So I suggest you email them about it, telling them it
>>> is
>>>> on the compatibility list, and asking for an explanation and fix (eg.
>>>> firmware bug fix). You could also say it is fairly common on seagate
>>>> (and Samsung) disks, and very uncommon with other brands.
>>>> 
>>>> Peter
>>>> 
>>>> On 11.04.2012 00:26, John Hickey wrote:
>>>>> I have 19 drives in my array, so changing them isn't that easy. ;-)
>>> They are Seagate Constellation ES 2TB SAS drives (SEAGATE ST2000NM0001
>>> 0001) and according to LSI documents my whole setup should be supported.
>>> The drives at least aren't being marked as failed.  I believe a change
>>> was made a while back to make FreeBSD less sensitive to these sorts of
>>> timeouts.  I have had a panic or two on the system, but haven't tracked
>>> down the exact cause yet.
>>>>> 
>>>>> John
>>>>> 
>>>>> On Apr 10, 2012, at 12:35 PM, Peter Maloney wrote:
>>>>> 
>>>>>> I found this only happens with specific disks / disk firmware...
>>> but
>>>>>> nobody seems to listen to me about it. They all seem to blame the
>>>>>> driver. (I blame both, but changing disks is a simple fix.)
>>>>>> 
>>>>>> And looking around, most reports are with various Seagates
>>> (including
>>>>>> one that can cause this type of error with smartctl -a with a SAS
>>>>>> Seagate, but cannot reproduce with the binary LSI driver) or
>>> Samsung
>>>>>> Spinpoints. The only other disk I know of that does this is a
>>> Crucial
>>>>>> SSD with old firmware. One guy said he can do a camcontrol rescan
>>> to get
>>>>>> it back; I tried that and get either panics, hangs, or nothing.
>>>>>> 
>>>>>> What HBA are you using? With my LSI 9211-8i HBAs, the new 3TB
>>> Seagate
>>>>>> greens don't seem to have this problem. I have no idea if different
>>>>>> disks behave differently with different controllers. I asked
>>> Seagate
>>>>>> about it and they reply with marketing nonsense about buying
>>> enterprise
>>>>>> disks instead, and say I should buy disks that are on the specific
>>>>>> compatibility list for the HBA.
>>>>>> 
>>>>>> I found that with the few disks that I have that fail randomly (and
>>>>>> others), I can reproduce the issue (not exact same symptoms though)
>>> by
>>>>>> hot pulling the disk while writing something, putting it back, wait
>>> a
>>>>>> few seconds (<10; less than enough for the SCSI controller to
>>> rescan)
>>>>>> pull and replace again. The old 2TB seagate greens fail this test,
>>> but
>>>>>> the 3TB ones pass. All 2 and 3 TB Hitachis I tried pass this test,
>>> as
>>>>>> well as 3TB WD greens. (all enterprise disks I tried pass this test
>>>>>> except the Toshiba 2TB ones I tried)
>>>>>> 
>>>>>> If I put a "failed" disk back in, it does not work. If I put it in
>>> a
>>>>>> different slot, same. But if I put any other disk in, it works
>>> fine. So
>>>>>> it is the disk, but it is also FreeBSD not being able to
>>> reset/rescan
>>>>>> it. But it is simple enough to blame both, and since you can't get
>>> rid
>>>>>> of the driver, get different disks (eg. swap them with some
>>> different
>>>>>> same sized ones in a different machine).
>>>>>> 
>>>>>> Here is my forum thread about it, including disk product ids for
>>> ones I
>>>>>> tested, and a huge list of things that don't fix it.
>>>>>> http://forums.freebsd.org/showthread.php?t=28252
>>>>>> 
>>>>>> Peter
>>>>>> 
>>>>>> 
>>>>>> On 10.04.2012 03:52, John Hickey wrote:
>>>>>>> I've seen people having this problem before, but I don't think
>>> anyone
>>>>>>> has figured it out.  I am running:
>>>>>>> 
>>>>>>> FreeBSD zfs 10.0-CURRENT FreeBSD 10.0-CURRENT #5: Sat Apr  7
>>> 18:05:57 PDT 2012     root at zfs:/usr/obj/usr/src/sys/GENERIC  amd64
>>>>>>> 
>>>>>>> I have the latest LSI IT firmware 13 loaded:
>>>>>>> 
>>>>>>> mps1: <LSI SAS2008> port 0xc000-0xc0ff mem 0xfe93c000-
>>> 0xfe93ffff,0xfe940000-0xfe97ffff irq 16 at device 0.0 on pci5
>>>>>>> mps1: Firmware: 13.00.01.00, Driver: 13.00.00.00-fbsd
>>>>>>> mps1: IOCCapabilities:
>>> 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDis
>>> c>
>>>>>>> 
>>>>>>> All disks are on a SuperMicro SAS II backplane:
>>>>>>> 
>>>>>>> root at zfs:/usr/ports/sysutils/dmidecode# camcontrol devlist
>>>>>>> <SEAGATE ST3300657SS 0008>         at scbus0 target 0 lun 0
>>> (da0,pass0)
>>>>>>> <SEAGATE ST3300657SS 0008>         at scbus0 target 1 lun 0
>>> (da1,pass1)
>>>>>>> <SEAGATE ST2000NM0001 0001>        at scbus1 target 8 lun 0
>>> (da2,pass2)
>>>>>>> .... x16 more of the same
>>>>>>> <SEAGATE ST2000NM0001 0001>        at scbus1 target 46 lun 0
>>> (da20,pass20)
>>>>>>> <LSI CORP SAS2X36 0717>            at scbus1 target 47 lun 0
>>> (ses0,pass21)
>>>>>>> 
>>>>>>> Essentially when putting the ZFS filesystem under load, I am
>>> getting
>>>>>>> these sorts of errors:
>>>>>>> 
>>>>>>> (da13:mps1:0:21:0): WRITE(10). CDB: 2a 0 19 29 32 f2 0 1 0 0
>>> length 131072 SMID 213 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da7:mps1:0:13:0): WRITE(10). CDB: 2a 0 19 3d fa ae 0 1 0 0 length
>>> 131072 SMID 386 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da11:mps1:0:18:0): WRITE(10). CDB: 2a 0 18 a 24 ee 0 1 0 0 length
>>> 131072 SMID 542 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da14:mps1:0:22:0): WRITE(10). CDB: 2a 0 19 2a c6 b1 0 1 0 0
>>> length 131072 SMID 214 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da16:mps1:0:25:0): WRITE(10). CDB: 2a 0 19 2b 83 aa 0 1 0 0
>>> length 131072 SMID 879 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da7:mps1:0:13:0): WRITE(10). CDB: 2a 0 19 40 d f9 0 1 0 0 length
>>> 131072 SMID 474 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da9:mps1:0:15:0): WRITE(10). CDB: 2a 0 18 c 3 31 0 1 0 0 length
>>> 131072 SMID 578 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da4:mps1:0:10:0): WRITE(10). CDB: 2a 0 19 41 6f ff 0 1 0 0 length
>>> 131072 SMID 703 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da12:mps1:0:19:0): WRITE(10). CDB: 2a 0 18 c e5 2e 0 1 0 0 length
>>> 131072 SMID 684 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da3:mps1:0:9:0): WRITE(10). CDB: 2a 0 19 41 b1 4b 0 1 0 0 length
>>> 131072 SMID 212 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da9:mps1:0:15:0): WRITE(10). CDB: 2a 0 18 d 1e 5c 0 1 0 0 length
>>> 131072 SMID 63 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da11:mps1:0:18:0): WRITE(10). CDB: 2a 0 18 d 56 1c 0 1 0 0 length
>>> 131072 SMID 412 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da4:mps1:0:10:0): WRITE(10). CDB: 2a 0 19 42 2c f1 0 1 0 0 length
>>> 131072 SMID 1019 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da11:mps1:0:18:0): WRITE(10). CDB: 2a 0 18 d 6d 22 0 1 0 0 length
>>> 131072 SMID 175 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da7:mps1:0:13:0): WRITE(10). CDB: 2a 0 19 42 62 bc 0 1 0 0 length
>>> 131072 SMID 458 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da10:mps1:0:16:0): WRITE(10). CDB: 2a 0 18 f 4b d2 0 1 0 0 length
>>> 131072 SMID 986 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da3:mps1:0:9:0): WRITE(10). CDB: 2a 0 19 43 f4 50 0 1 0 0 length
>>> 131072 SMID 809 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da2:mps1:0:8:0): WRITE(10). CDB: 2a 0 19 45 4 18 0 1 0 0 length
>>> 131072 SMID 998 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da13:mps1:0:21:0): WRITE(10). CDB: 2a 0 19 30 e4 73 0 1 0 0
>>> length 131072 SMID 489 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da12:mps1:0:19:0): WRITE(10). CDB: 2a 0 18 10 8d 19 0 1 0 0
>>> length 131072 SMID 275 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da14:mps1:0:22:0): WRITE(10). CDB: 2a 0 19 32 e7 0 0 1 0 0 length
>>> 131072 SMID 666 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> (da8:mps1:0:14:0): WRITE(10). CDB: 2a 0 18 13 2b 68 0 1 0 0 length
>>> 131072 SMID 463 terminated ioc 804b scsi 0 state c xfer 0
>>>>>>> _______________________________________________
>>>>>>> freebsd-scsi at freebsd.org mailing list
>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>>>>> To unsubscribe, send any mail to "freebsd-scsi-
>>> unsubscribe at freebsd.org"
>>>>>> _______________________________________________
>>>>>> freebsd-scsi at freebsd.org mailing list
>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>>>> To unsubscribe, send any mail to "freebsd-scsi-
>>> unsubscribe at freebsd.org"
>>>>>> 
>>>>> _______________________________________________
>>>>> freebsd-scsi at freebsd.org mailing list
>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>>> To unsubscribe, send any mail to "freebsd-scsi-
>>> unsubscribe at freebsd.org"
>>>> 
>>>> _______________________________________________
>>>> freebsd-scsi at freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>> To unsubscribe, send any mail to "freebsd-scsi-
>>> unsubscribe at freebsd.org"
>>>> 
>>> _______________________________________________
>>> freebsd-scsi at freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>> 
> 
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>