Avago LSI SAS 3008 & Intel SSD Timeouts
Steven Hartland
killing at multiplay.co.uk
Fri Jun 3 21:49:33 UTC 2016
First thing would be to run gstat with -d to see if you're actually
stacking up deletes, a symptom of which can be r/w dropping to zero.
If you are seeing significant deletes it could be a FW issue on the drives.
If you can eliminate that you could still be seeing a cabling /
backplane issue. SSD's stress the signalling much more than HDD's and on
several occasions we've had to replace backplanes a few times until it
was fixed.
Other thing to check is FW version on controller and on the drives, are
you up to date?
On 03/06/2016 21:36, list-news wrote:
> Greetings!
>
> For the last month I've been fighting with a SAS 3008 issues on a
> SuperMicro 2028TP-DECR (2 node Twin server). The system is running
> 10.3-RELEASE #1 r298749
>
> The errors we see on this system look very similar to the errors
> posted here back in March:
> https://lists.freebsd.org/pipermail/freebsd-scsi/2016-March/007000.html
>
> The system hosts a large postgresql database and an application which
> spends most of its days calculating log entries on a couple hundred GB
> of tables. Initially this system had 2 WD HDD mirrored boot drives,
> and 8 Intel 730 - 240GB's SSD's in a zfs pool with 4 mirrored vdevs.
> Shortly after install/setup, a couple of those Intel 730's began
> showing READ & WRITE errors displayed with zpool status, additionally
> mpr0 messages were showing drive timeouts when attempting READ(10),
> WRITE(10) and SYNCHRONIZE CACHE(10), such as this:
>
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
> 5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm
> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
> Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
> connector name ( )
> Jun 2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00
> allocated tm 0xfffffe0000cdd3b0
> Jun 2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15
> Aborting command 0xfffffe0000cefb00
> Jun 2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort
> for target ID 16
> [...snipped about 50 mores lines for brevity]
>
> As a solution, the immediate thought was the drives were faulty. We
> decided to replace some of the drives with Intel 480GB DC3610's.
> After replacing 4 of the drives we noticed the 3610's were also
> throwing errors. Read/write error counts displayed with zpool status
> would climb into the thousands in a night of calculations with dozens
> of mpr errors in the logs. This was happening to all of the drives in
> the array. Some are more likely than others, but given enough time,
> all drives would display the errors. This provoked some suspicions
> that this problem was backplane and/or controller related, and the
> drives were fine (or possibly the drive firmware is the same in the
> 730's and 3610's and both exhibit this problem, but the backplane did
> seem more likely at the time).
>
> We moved Node A system sled (cpu/mem/sas/mb) into Node B chassis slot
> to see if the errors would change at all (thinking maybe some SATA
> ports were possibly bad on the backplane). We hit a bit of luck and
> noticed the boot drives, when inserted into Node B drive sleds 1&2
> wouldn't fire up and were undetectable by the SAS controller.
> SuperMicro concluded this was likely a backplane issue and a backplane
> was replaced. The new backplane came in, the drives in Node B sleds 1
> / 2 are now visible by the controller during boot, confirming the
> backplane did have a problem. (This is important to mention as it
> seems an internets suggested fix for similar timeout errors is to
> replace the backplane and/or cables.)
>
> Unfortunately, the luck ran out, once we put the system back under
> load, the errors came back. I rebuilt the pool just using four 3610's
> as 2x mirrored vdevs and copied all the data over. Here is the pool as
> of yesterday:
>
> pool: zp1
> state: ONLINE
> scan: resilvered 107M in 0h0m with 0 errors on Thu Jun 2 14:52:43 2016
> config:
>
> NAME STATE READ WRITE CKSUM
> zp1 ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> da3 ONLINE 0 0 0
> da5 ONLINE 0 0 0
> mirror-1 ONLINE 0 0 0
> da7 ONLINE 0 0 0
> da8 ONLINE 0 0 0
>
> When performing logging calculations, the system looks like this
> pretty much continuously (via gstat):
>
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
> 7 8175 1374 7526 0.5 3000 48931 0.5 65.2| da3
> 6 8337 1542 8025 0.5 2995 48947 0.5 68.1| da5
> 2 8740 1529 8584 0.5 3138 52564 0.4 72.6| da7
> 3 8637 1427 7961 0.5 3136 52564 0.4 65.6| da8
>
> And zpool iostat:
> zp1 414G 474G 10.9K 8.94K
> 50.2M 51.8M
> mirror 207G 237G 5.44K 4.56K
> 25.0M 26.2M
> da3 - - 2.50K 2.37K
> 12.3M 26.2M
> da5 - - 2.65K 2.37K
> 13.3M 26.1M
> mirror 207G 237G 5.42K 4.39K
> 25.3M 25.6M
> da7 - - 2.62K 2.40K
> 13.2M 25.6M
> da8 - - 2.57K 2.40K
> 12.5M 25.6M
>
> After a couple minutes (usually between 2 and 10) with drive %busy and
> a cpu load around 70%, the system hangs.
>
> During the hang, one of two scenarios play out (possibly more, but
> these are the ones officially witnessed):
>
> Scenario 1) Processes continue to function as long as they don't touch
> the storage. One drive immediately hits 0 r/w 0 r/s, the rest of the
> drives quickly fall to 0 r/s 0 w/s (maybe within 1 - 5 seconds), and
> the lights on the chassis for the drives goes black. Any commands I
> enter in the terminal hang after the newline.
>
> During Scenario 1 - gstat output (notice the L(q) column for da7):
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
> 0 0 0 0 0.0 0 0 0.0 0.0| da3
> 0 0 0 0 0.0 0 0 0.0 0.0| da5
> 8 0 0 0 0.0 0 0 0.0 0.0| da7
> 0 0 0 0 0.0 0 0 0.0 0.0| da8
>
>
> Scenario 2) Some processes continue to function in a read-only state,
> but all writes for all drives fall to 0.
>
> During Scenario 2 - gstat output (again notice the L(q) column for da3):
> dT: 1.002s w: 1.000s
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
> 69 0 0 0 0.0 0 0 0.0 0.0| da3
> 0 3207 3207 45727 0.2 0 0 0.0 37.8| da5
> 0 2753 2753 24673 0.2 0 0 0.0 23.1| da7
> 0 3027 3027 26425 0.2 0 0 0.0 28.6| da8
>
> About 60 seconds later the system becomes responsive again (I assume
> kern.cam.da.default_timeout is the 60 second timer, although I haven't
> changed this to confirm), and a bunch of error logs appear for the
> drive that had a number in column L(q) > 0 (da7 in scenario 1 last
> night, da3 in scenario 2 today). Any commands I typed in the terminal
> finally fire off, the cpu and disk usage quickly climb back to
> 70-80%+. If the errors are bad enough (or possibly if two drives hang
> simultaneously?) it'll show read and/or write errors in zpool status.
>
> Failed attempts to repair: Replaced Backplane. Full memory tests.
> Upgraded 3610's to the latest firmware. Used Node B in place of Node
> A. Flashed the controller with latest IT firmware from SuperMicro.
> Swapped power supplies. Pulled one proc off Node A and half the
> memory, put that into Node B and tried the tests again. Set
> kern.cam.da.X.delete_method=DISABLE. Probably tuned a couple dozen
> other cam and zfs sysctl/loader variables in testing.
>
> Nothing stopped these errors.
>
> Until I ran this:
> # camcontrol tags daX -N 1
> (where X is 3,5,7 and 8)
>
> With tags set to 1 for each device, all errors for the last 18 hours
> have ceased (system ran all night at load). I've not spent the time
> yet to determine how much that affects performance, but it is
> definitely working.
>
> To reproduce this bug. Today I ran:
> # camcontrol tags daX -N 8
>
> On all drives. A few minutes later, the drive hangs (see gstat output
> for da3 above in scenario 2, which I had copied just seconds before
> the errors that follow):
>
> Jun 3 13:34:10 s18 kernel: (da3:mpr0:0:12:0): READ(10). CDB: 28 00 09
> cc 97 30 00 00 08 00 length 4096 SMID 961 command timeout cm
> 0xfffffe0000d2ad50 ccb 0xfffff807bfcef800 target 12, handle(0x000d)
> Jun 3 13:34:10 s18 kernel: mpr0: At enclosure level 0, slot 4,
> connector name ( )
> Jun 3 13:34:10 s18 kernel: mpr0: timedout cm 0xfffffe0000d2ad50
> allocated tm 0xfffffe0000cdc150
> Jun 3 13:34:10 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1
> Aborting command 0xfffffe0000d2ad50
> Jun 3 13:34:10 s18 kernel: mpr0: Sending reset from mprsas_send_abort
> for target ID 12
> [...snipped about 50 mores lines for brevity]
>
> With 1 tag, gstat still shows L(q) values that jump into the 70's at
> times, but with what I understand of the source, devstat calculates
> the queued operations separate from camcontrol's simultaneous
> dev_openings. So I assume that's nothing to be concerned about but
> figured I'd mention it.
>
> # camcontrol tags da7 -v
>
> (pass7:mpr0:0:16:0): dev_openings 0
> (pass7:mpr0:0:16:0): dev_active 1
> (pass7:mpr0:0:16:0): allocated 1
> (pass7:mpr0:0:16:0): queued 0
> (pass7:mpr0:0:16:0): held 0
> (pass7:mpr0:0:16:0): mintags 2
> (pass7:mpr0:0:16:0): maxtags 255
>
> Here is a full error message from last night on da7 (gstat scenario 1
> above was copied right before these lines). Apologies for the
> verbosity, but it may be helpful to some, logged with
> dev.mpr.0.debug_level=0x037F:
>
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
> 5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm
> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
> Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
> connector name ( )
> Jun 2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00
> allocated tm 0xfffffe0000cdd3b0
> Jun 2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15
> Aborting command 0xfffffe0000cefb00
> Jun 2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort
> for target ID 16
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14
> 24 35 e0 00 00 08 00 length 4096 SMID 562 command timeout cm
> 0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 target 16, handle(0x0011)
> Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
> connector name ( )
> Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm
> 0xfffffe0000d0a1a0 for processing by tm 0xfffffe0000cdd3b0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 command timeout
> cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 target 16, handle(0x0011)
> Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
> connector name ( )
> Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm
> 0xfffffe0000d1e720 for processing by tm 0xfffffe0000cdd3b0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
> cd fc 98 00 00 08 00 length 4096 SMID 571 command timeout cm
> 0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 target 16, handle(0x0011)
> Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
> connector name ( )
> Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm
> 0xfffffe0000d0ad70 for processing by tm 0xfffffe0000cdd3b0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
> cd fc 90 00 00 08 00 length 4096 SMID 540 command timeout cm
> 0xfffffe0000d084c0 ccb 0xfffff80288e58000 target 16, handle(0x0011)
> Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
> connector name ( )
> Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm
> 0xfffffe0000d084c0 for processing by tm 0xfffffe0000cdd3b0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
> cd fc 88 00 00 08 00 length 4096 SMID 817 command timeout cm
> 0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 target 16, handle(0x0011)
> Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
> connector name ( )
> Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm
> 0xfffffe0000d1f050 for processing by tm 0xfffffe0000cdd3b0
> Jun 2 19:48:42 s18 kernel: mpr0: EventReply :
> Jun 2 19:48:42 s18 kernel: EventDataLength: 2
> Jun 2 19:48:42 s18 kernel: AckRequired: 0
> Jun 2 19:48:42 s18 kernel: Event: SasDiscovery (0x16)
> Jun 2 19:48:42 s18 kernel: EventContext: 0x0
> Jun 2 19:48:42 s18 kernel: Flags: 1<InProgress>
> Jun 2 19:48:42 s18 kernel: ReasonCode: Discovery Started
> Jun 2 19:48:42 s18 kernel: PhysicalPort: 0
> Jun 2 19:48:42 s18 kernel: DiscoveryStatus: 0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00
> 11 8d fe 50 00 00 10 00 length 8192 SMID 186 completed cm
> 0xfffffe0000ceb420 ccb 0xfffff80cd1d88800 during recovery ioc 804b
> scsi 0 state c xfer 0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00
> 11 8d fe 50 00 00 10 00 length 8192 SMID 186 terminated ioc 804b scsi
> 0 state c xfer 0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
> cd fc 98 00 00 08 00 length 4096 SMID 571 completed timedout cm
> 0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 during recovery ioc 804b
> scsi 0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc
> 98 00 00 08 00 length 4096 SMID 571 terminated ioc 804b scsi 0 state c
> xfer 0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
> cd fc 90 00 00 08 00 length 4096 SMID 540 completed timedout cm
> 0xfffffe0000d084c0 ccb 0xfffff80288e58000 during recovery ioc 804b
> scsi 0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc
> 90 00 00 08 00 length 4096 SMID 540 terminated ioc 804b scsi 0 state c
> xfer 0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
> cd fc 88 00 00 08 00 length 4096 SMID 817 completed timedout cm
> 0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 during recovery ioc 804b
> scsi 0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc
> 88 00 00 08 00 length 4096 SMID 817 terminated ioc 804b scsi 0 state c
> xfer 0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
> 5a 18 c0 00 00 08 00 length 4096 SMID 240 completed timedout cm
> 0xfffffe0000cefb00 ccb 0xfffff805dfeca800 during recovery ioc 8048
> scsi 0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35
> e0 00 00 08 00 length 4096 SMID 562 completed timedout cm
> 0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 during recovery ioc 804b
> scsi 0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35
> e0 00 00 08 00 length 4096 SMID 562 terminated ioc 804b scsi 0 state c
> xfer 0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 completed
> timedout cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 during recovery
> ioc 804b scsi 0
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 terminated ioc
> 804b scsi 0 state c xfer 0
> Jun 2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 abort TaskMID
> 240 status 0x0 code 0x0 count 7
> Jun 2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 finished
> recovery after aborting TaskMID 240
> Jun 2 19:48:42 s18 kernel: mpr0: Unfreezing devq for target ID 16
> Jun 2 19:48:42 s18 kernel: mpr0: EventReply :
> Jun 2 19:48:42 s18 kernel: EventDataLength: 4
> Jun 2 19:48:42 s18 kernel: AckRequired: 0
> Jun 2 19:48:42 s18 kernel: Event: SasTopologyChangeList (0x1c)
> Jun 2 19:48:42 s18 kernel: EventContext: 0x0
> Jun 2 19:48:42 s18 kernel: EnclosureHandle: 0x2
> Jun 2 19:48:42 s18 kernel: ExpanderDevHandle: 0x9
> Jun 2 19:48:42 s18 kernel: NumPhys: 31
> Jun 2 19:48:42 s18 kernel: NumEntries: 1
> Jun 2 19:48:42 s18 kernel: StartPhyNum: 8
> Jun 2 19:48:42 s18 kernel: ExpStatus: Responding (0x3)
> Jun 2 19:48:42 s18 kernel: PhysicalPort: 0
> Jun 2 19:48:42 s18 kernel: PHY[8].AttachedDevHandle: 0x0011
> Jun 2 19:48:42 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
> Jun 2 19:48:42 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
> Jun 2 19:48:42 s18 kernel: mpr0: (73)->(mprsas_fw_work) Working on
> Event: [16]
> Jun 2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Event Free: [16]
> Jun 2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Working on
> Event: [1c]
> Jun 2 19:48:42 s18 kernel: mpr0: (75)->(mprsas_fw_work) Event Free: [1c]
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
> 5a 18 c0 00 00 08 00
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): CAM status: Command
> timeout
> Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): Retrying command
> Jun 2 19:48:43 s18 kernel: mpr0: EventReply :
> Jun 2 19:48:43 s18 kernel: EventDataLength: 2
> Jun 2 19:48:43 s18 kernel: AckRequired: 0
> Jun 2 19:48:43 s18 kernel: Event: SasDiscovery (0x16)
> Jun 2 19:48:43 s18 kernel: EventContext: 0x0
> Jun 2 19:48:43 s18 kernel: Flags: 0
> Jun 2 19:48:43 s18 kernel: ReasonCode: Discovery Complete
> Jun 2 19:48:43 s18 kernel: PhysicalPort: 0
> Jun 2 19:48:43 s18 kernel: DiscoveryStatus: 0
> Jun 2 19:48:43 s18 kernel: mpr0: (75)->(mprsas_fw_work) Working on
> Event: [16]
> Jun 2 19:48:43 s18 kernel: mpr0: (76)->(mprsas_fw_work) Event Free: [16]
> Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00
> 11 8d fe 50 00 00 10 00
> Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): CAM status: SCSI Status
> Error
> Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI status: Check
> Condition
> Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI sense: UNIT
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): Retrying command (per
> sense data)
>
> I'm not sure if this is an Intel, CAM, or Avago issue (or something
> else?). But I'd be happy to help figure it out if anyone has any
> suggestions for me to test?
>
> Thanks!
>
> -Kyle
>
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
More information about the freebsd-scsi
mailing list