Avago LSI SAS 3008 & Intel SSD Timeouts

Fri Jun 3 21:49:33 UTC 2016

First thing would be to run gstat with -d to see if you're actually 
stacking up deletes, a symptom of which can be r/w dropping to zero.

If you are seeing significant deletes it could be a FW issue on the drives.

If you can eliminate that you could still be seeing a cabling / 
backplane issue. SSD's stress the signalling much more than HDD's and on 
several occasions we've had to replace backplanes a few times until it 
was fixed.

Other thing to check is FW version on controller and on the drives, are 
you up to date?

On 03/06/2016 21:36, list-news wrote:
> Greetings!
>
> For the last month I've been fighting with a SAS 3008 issues on a 
> SuperMicro 2028TP-DECR (2 node Twin server).  The system is running 
> 10.3-RELEASE #1 r298749
>
> The errors we see on this system look very similar to the errors 
> posted here back in March:
> https://lists.freebsd.org/pipermail/freebsd-scsi/2016-March/007000.html
>
> The system hosts a large postgresql database and an application which 
> spends most of its days calculating log entries on a couple hundred GB 
> of tables.  Initially this system had 2 WD HDD mirrored boot drives, 
> and 8 Intel 730 - 240GB's SSD's in a zfs pool with 4 mirrored vdevs.  
> Shortly after install/setup, a couple of those Intel 730's began 
> showing READ & WRITE errors displayed with zpool status, additionally 
> mpr0 messages were showing drive timeouts when attempting READ(10), 
> WRITE(10) and SYNCHRONIZE CACHE(10), such as this:
>
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
> 5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm 
> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00 
> allocated tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15 
> Aborting command 0xfffffe0000cefb00
> Jun  2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 16
> [...snipped about 50 mores lines for brevity]
>
> As a solution, the immediate thought was the drives were faulty. We 
> decided to replace some of the drives with Intel 480GB DC3610's.  
> After replacing 4 of the drives we noticed the 3610's were also 
> throwing errors.  Read/write error counts displayed with zpool status 
> would climb into the thousands in a night of calculations with dozens 
> of mpr errors in the logs. This was happening to all of the drives in 
> the array.  Some are more likely than others, but given enough time, 
> all drives would display the errors. This provoked some suspicions 
> that this problem was backplane and/or controller related, and the 
> drives were fine (or possibly the drive firmware is the same in the 
> 730's and 3610's and both exhibit this problem, but the backplane did 
> seem more likely at the time).
>
> We moved Node A system sled (cpu/mem/sas/mb) into Node B chassis slot 
> to see if the errors would change at all (thinking maybe some SATA 
> ports were possibly bad on the backplane).  We hit a bit of luck and 
> noticed the boot drives, when inserted into Node B drive sleds 1&2 
> wouldn't fire up and were undetectable by the SAS controller.  
> SuperMicro concluded this was likely a backplane issue and a backplane 
> was replaced.  The new backplane came in, the drives in Node B sleds 1 
> / 2 are now visible by the controller during boot, confirming the 
> backplane did have a problem.  (This is important to mention as it 
> seems an internets suggested fix for similar timeout errors is to 
> replace the backplane and/or cables.)
>
> Unfortunately, the luck ran out, once we put the system back under 
> load, the errors came back.  I rebuilt the pool just using four 3610's 
> as 2x mirrored vdevs and copied all the data over. Here is the pool as 
> of yesterday:
>
>   pool: zp1
>  state: ONLINE
>   scan: resilvered 107M in 0h0m with 0 errors on Thu Jun  2 14:52:43 2016
> config:
>
>         NAME STATE     READ WRITE CKSUM
>         zp1 ONLINE       0     0     0
>           mirror-0 ONLINE       0     0     0
>             da3 ONLINE       0     0     0
>             da5 ONLINE       0     0     0
>           mirror-1 ONLINE       0     0     0
>             da7 ONLINE       0     0     0
>             da8 ONLINE       0     0     0
>
> When performing logging calculations, the system looks like this 
> pretty much continuously (via gstat):
>
> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w %busy Name
>     7   8175   1374   7526    0.5   3000  48931 0.5   65.2| da3
>     6   8337   1542   8025    0.5   2995  48947    0.5   68.1| da5
>     2   8740   1529   8584    0.5   3138  52564    0.4   72.6| da7
>     3   8637   1427   7961    0.5   3136  52564    0.4   65.6| da8
>
> And zpool iostat:
> zp1                                      414G   474G  10.9K 8.94K 
> 50.2M  51.8M
>   mirror                                 207G   237G 5.44K  4.56K 
> 25.0M  26.2M
>     da3                                     -      - 2.50K  2.37K 
> 12.3M  26.2M
>     da5                                     -      - 2.65K  2.37K 
> 13.3M  26.1M
>   mirror                                 207G   237G 5.42K  4.39K 
> 25.3M  25.6M
>     da7                                     -      - 2.62K  2.40K 
> 13.2M  25.6M
>     da8                                     -      - 2.57K  2.40K 
> 12.5M  25.6M
>
> After a couple minutes (usually between 2 and 10) with drive %busy and 
> a cpu load around 70%, the system hangs.
>
> During the hang, one of two scenarios play out (possibly more, but 
> these are the ones officially witnessed):
>
> Scenario 1) Processes continue to function as long as they don't touch 
> the storage.  One drive immediately hits 0 r/w 0 r/s, the rest of the 
> drives quickly fall to 0 r/s 0 w/s (maybe within 1 - 5 seconds), and 
> the lights on the chassis for the drives goes black.  Any commands I 
> enter in the terminal hang after the newline.
>
> During Scenario 1 - gstat output (notice the L(q) column for da7):
> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w %busy Name
>     0      0      0      0    0.0      0      0 0.0    0.0| da3
>     0      0      0      0    0.0      0      0 0.0    0.0| da5
>     8      0      0      0    0.0      0      0 0.0    0.0| da7
>     0      0      0      0    0.0      0      0 0.0    0.0| da8
>
>
> Scenario 2) Some processes continue to function in a read-only state, 
> but all writes for all drives fall to 0.
>
> During Scenario 2 - gstat output (again notice the L(q) column for da3):
> dT: 1.002s  w: 1.000s
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps ms/w   %busy Name
>    69      0      0      0    0.0      0      0 0.0    0.0| da3
>     0   3207   3207  45727    0.2      0      0 0.0   37.8| da5
>     0   2753   2753  24673    0.2      0      0 0.0   23.1| da7
>     0   3027   3027  26425    0.2      0      0 0.0   28.6| da8
>
> About 60 seconds later the system becomes responsive again (I assume 
> kern.cam.da.default_timeout is the 60 second timer, although I haven't 
> changed this to confirm), and a bunch of error logs appear for the 
> drive that had a number in column L(q) > 0 (da7 in scenario 1 last 
> night, da3 in scenario 2 today). Any commands I typed in the terminal 
> finally fire off, the cpu and disk usage quickly climb back to 
> 70-80%+.  If the errors are bad enough (or possibly if two drives hang 
> simultaneously?) it'll show read and/or write errors in zpool status.
>
> Failed attempts to repair: Replaced Backplane. Full memory tests. 
> Upgraded 3610's to the latest firmware. Used Node B in place of Node 
> A.  Flashed the controller with latest IT firmware from SuperMicro.  
> Swapped power supplies.  Pulled one proc off Node A and half the 
> memory, put that into Node B and tried the tests again.  Set 
> kern.cam.da.X.delete_method=DISABLE.  Probably tuned a couple dozen 
> other cam and zfs sysctl/loader variables in testing.
>
> Nothing stopped these errors.
>
> Until I ran this:
> # camcontrol tags daX -N 1
> (where X is 3,5,7 and 8)
>
> With tags set to 1 for each device, all errors for the last 18 hours 
> have ceased (system ran all night at load).  I've not spent the time 
> yet to determine how much that affects performance, but it is 
> definitely working.
>
> To reproduce this bug.  Today I ran:
> # camcontrol tags daX -N 8
>
> On all drives.  A few minutes later, the drive hangs (see gstat output 
> for da3 above in scenario 2, which I had copied just seconds before 
> the errors that follow):
>
> Jun  3 13:34:10 s18 kernel: (da3:mpr0:0:12:0): READ(10). CDB: 28 00 09 
> cc 97 30 00 00 08 00 length 4096 SMID 961 command timeout cm 
> 0xfffffe0000d2ad50 ccb 0xfffff807bfcef800 target 12, handle(0x000d)
> Jun  3 13:34:10 s18 kernel: mpr0: At enclosure level 0, slot 4, 
> connector name (    )
> Jun  3 13:34:10 s18 kernel: mpr0: timedout cm 0xfffffe0000d2ad50 
> allocated tm 0xfffffe0000cdc150
> Jun  3 13:34:10 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
> Aborting command 0xfffffe0000d2ad50
> Jun  3 13:34:10 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 12
> [...snipped about 50 mores lines for brevity]
>
> With 1 tag, gstat still shows L(q) values that jump into the 70's at 
> times, but with what I understand of the source, devstat calculates 
> the queued operations separate from camcontrol's simultaneous 
> dev_openings.   So I assume that's nothing to be concerned about but 
> figured I'd mention it.
>
> # camcontrol tags da7 -v
>
> (pass7:mpr0:0:16:0): dev_openings  0
> (pass7:mpr0:0:16:0): dev_active    1
> (pass7:mpr0:0:16:0): allocated     1
> (pass7:mpr0:0:16:0): queued        0
> (pass7:mpr0:0:16:0): held          0
> (pass7:mpr0:0:16:0): mintags       2
> (pass7:mpr0:0:16:0): maxtags       255
>
> Here is a full error message from last night on da7 (gstat scenario 1 
> above was copied right before these lines). Apologies for the 
> verbosity, but it may be helpful to some, logged with 
> dev.mpr.0.debug_level=0x037F:
>
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
> 5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm 
> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00 
> allocated tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15 
> Aborting command 0xfffffe0000cefb00
> Jun  2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 16
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 
> 24 35 e0 00 00 08 00 length 4096 SMID 562 command timeout cm 
> 0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe0000d0a1a0 for processing by tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 command timeout 
> cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe0000d1e720 for processing by tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 98 00 00 08 00 length 4096 SMID 571 command timeout cm 
> 0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe0000d0ad70 for processing by tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 90 00 00 08 00 length 4096 SMID 540 command timeout cm 
> 0xfffffe0000d084c0 ccb 0xfffff80288e58000 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe0000d084c0 for processing by tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 88 00 00 08 00 length 4096 SMID 817 command timeout cm 
> 0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 target 16, handle(0x0011)
> Jun  2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  2 19:48:42 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe0000d1f050 for processing by tm 0xfffffe0000cdd3b0
> Jun  2 19:48:42 s18 kernel: mpr0: EventReply    :
> Jun  2 19:48:42 s18 kernel: EventDataLength: 2
> Jun  2 19:48:42 s18 kernel: AckRequired: 0
> Jun  2 19:48:42 s18 kernel: Event: SasDiscovery (0x16)
> Jun  2 19:48:42 s18 kernel: EventContext: 0x0
> Jun  2 19:48:42 s18 kernel: Flags: 1<InProgress>
> Jun  2 19:48:42 s18 kernel: ReasonCode: Discovery Started
> Jun  2 19:48:42 s18 kernel: PhysicalPort: 0
> Jun  2 19:48:42 s18 kernel: DiscoveryStatus: 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
> 11 8d fe 50 00 00 10 00 length 8192 SMID 186 completed cm 
> 0xfffffe0000ceb420 ccb 0xfffff80cd1d88800 during recovery ioc 804b 
> scsi 0 state c xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
> 11 8d fe 50 00 00 10 00 length 8192 SMID 186 terminated ioc 804b scsi 
> 0 state c xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 98 00 00 08 00 length 4096 SMID 571 completed timedout cm 
> 0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 during recovery ioc 804b 
> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 
> 98 00 00 08 00 length 4096 SMID 571 terminated ioc 804b scsi 0 state c 
> xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 90 00 00 08 00 length 4096 SMID 540 completed timedout cm 
> 0xfffffe0000d084c0 ccb 0xfffff80288e58000 during recovery ioc 804b 
> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 
> 90 00 00 08 00 length 4096 SMID 540 terminated ioc 804b scsi 0 state c 
> xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 
> cd fc 88 00 00 08 00 length 4096 SMID 817 completed timedout cm 
> 0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 during recovery ioc 804b 
> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 
> 88 00 00 08 00 length 4096 SMID 817 terminated ioc 804b scsi 0 state c 
> xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
> 5a 18 c0 00 00 08 00 length 4096 SMID 240 completed timedout cm 
> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 during recovery ioc 8048 
> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 
> e0 00 00 08 00 length 4096 SMID 562 completed timedout cm 
> 0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 during recovery ioc 804b 
> scsi 0 state c xf    (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 
> e0 00 00 08 00 length 4096 SMID 562 terminated ioc 804b scsi 0 state c 
> xfer 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 completed 
> timedout cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 during recovery 
> ioc 804b scsi 0
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 terminated ioc 
> 804b scsi 0 state c xfer 0
> Jun  2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 abort TaskMID 
> 240 status 0x0 code 0x0 count 7
> Jun  2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 finished 
> recovery after aborting TaskMID 240
> Jun  2 19:48:42 s18 kernel: mpr0: Unfreezing devq for target ID 16
> Jun  2 19:48:42 s18 kernel: mpr0: EventReply    :
> Jun  2 19:48:42 s18 kernel: EventDataLength: 4
> Jun  2 19:48:42 s18 kernel: AckRequired: 0
> Jun  2 19:48:42 s18 kernel: Event: SasTopologyChangeList (0x1c)
> Jun  2 19:48:42 s18 kernel: EventContext: 0x0
> Jun  2 19:48:42 s18 kernel: EnclosureHandle: 0x2
> Jun  2 19:48:42 s18 kernel: ExpanderDevHandle: 0x9
> Jun  2 19:48:42 s18 kernel: NumPhys: 31
> Jun  2 19:48:42 s18 kernel: NumEntries: 1
> Jun  2 19:48:42 s18 kernel: StartPhyNum: 8
> Jun  2 19:48:42 s18 kernel: ExpStatus: Responding (0x3)
> Jun  2 19:48:42 s18 kernel: PhysicalPort: 0
> Jun  2 19:48:42 s18 kernel: PHY[8].AttachedDevHandle: 0x0011
> Jun  2 19:48:42 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
> Jun  2 19:48:42 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
> Jun  2 19:48:42 s18 kernel: mpr0: (73)->(mprsas_fw_work) Working on  
> Event: [16]
> Jun  2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Event Free: [16]
> Jun  2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Working on  
> Event: [1c]
> Jun  2 19:48:42 s18 kernel: mpr0: (75)->(mprsas_fw_work) Event Free: [1c]
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15 
> 5a 18 c0 00 00 08 00
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): CAM status: Command 
> timeout
> Jun  2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): Retrying command
> Jun  2 19:48:43 s18 kernel: mpr0: EventReply    :
> Jun  2 19:48:43 s18 kernel: EventDataLength: 2
> Jun  2 19:48:43 s18 kernel: AckRequired: 0
> Jun  2 19:48:43 s18 kernel: Event: SasDiscovery (0x16)
> Jun  2 19:48:43 s18 kernel: EventContext: 0x0
> Jun  2 19:48:43 s18 kernel: Flags: 0
> Jun  2 19:48:43 s18 kernel: ReasonCode: Discovery Complete
> Jun  2 19:48:43 s18 kernel: PhysicalPort: 0
> Jun  2 19:48:43 s18 kernel: DiscoveryStatus: 0
> Jun  2 19:48:43 s18 kernel: mpr0: (75)->(mprsas_fw_work) Working on  
> Event: [16]
> Jun  2 19:48:43 s18 kernel: mpr0: (76)->(mprsas_fw_work) Event Free: [16]
> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
> 11 8d fe 50 00 00 10 00
> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): CAM status: SCSI Status 
> Error
> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI status: Check 
> Condition
> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): Retrying command (per 
> sense data)
>
> I'm not sure if this is an Intel, CAM, or Avago issue (or something 
> else?).  But I'd be happy to help figure it out if anyone has any 
> suggestions for me to test?
>
> Thanks!
>
> -Kyle
>
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"