Avago LSI SAS 3008 & Intel SSD Timeouts
list-news
list-news at mindpackstudios.com
Fri Jun 3 20:37:03 UTC 2016
Greetings!
For the last month I've been fighting with a SAS 3008 issues on a
SuperMicro 2028TP-DECR (2 node Twin server). The system is running
10.3-RELEASE #1 r298749
The errors we see on this system look very similar to the errors posted
here back in March:
https://lists.freebsd.org/pipermail/freebsd-scsi/2016-March/007000.html
The system hosts a large postgresql database and an application which
spends most of its days calculating log entries on a couple hundred GB
of tables. Initially this system had 2 WD HDD mirrored boot drives, and
8 Intel 730 - 240GB's SSD's in a zfs pool with 4 mirrored vdevs.
Shortly after install/setup, a couple of those Intel 730's began showing
READ & WRITE errors displayed with zpool status, additionally mpr0
messages were showing drive timeouts when attempting READ(10), WRITE(10)
and SYNCHRONIZE CACHE(10), such as this:
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm
0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00
allocated tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15
Aborting command 0xfffffe0000cefb00
Jun 2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 16
[...snipped about 50 mores lines for brevity]
As a solution, the immediate thought was the drives were faulty. We
decided to replace some of the drives with Intel 480GB DC3610's. After
replacing 4 of the drives we noticed the 3610's were also throwing
errors. Read/write error counts displayed with zpool status would climb
into the thousands in a night of calculations with dozens of mpr errors
in the logs. This was happening to all of the drives in the array. Some
are more likely than others, but given enough time, all drives would
display the errors. This provoked some suspicions that this problem was
backplane and/or controller related, and the drives were fine (or
possibly the drive firmware is the same in the 730's and 3610's and both
exhibit this problem, but the backplane did seem more likely at the time).
We moved Node A system sled (cpu/mem/sas/mb) into Node B chassis slot to
see if the errors would change at all (thinking maybe some SATA ports
were possibly bad on the backplane). We hit a bit of luck and noticed
the boot drives, when inserted into Node B drive sleds 1&2 wouldn't fire
up and were undetectable by the SAS controller. SuperMicro concluded
this was likely a backplane issue and a backplane was replaced. The new
backplane came in, the drives in Node B sleds 1 / 2 are now visible by
the controller during boot, confirming the backplane did have a
problem. (This is important to mention as it seems an internets
suggested fix for similar timeout errors is to replace the backplane
and/or cables.)
Unfortunately, the luck ran out, once we put the system back under load,
the errors came back. I rebuilt the pool just using four 3610's as 2x
mirrored vdevs and copied all the data over. Here is the pool as of
yesterday:
pool: zp1
state: ONLINE
scan: resilvered 107M in 0h0m with 0 errors on Thu Jun 2 14:52:43 2016
config:
NAME STATE READ WRITE CKSUM
zp1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da3 ONLINE 0 0 0
da5 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
da7 ONLINE 0 0 0
da8 ONLINE 0 0 0
When performing logging calculations, the system looks like this pretty
much continuously (via gstat):
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
7 8175 1374 7526 0.5 3000 48931 0.5 65.2| da3
6 8337 1542 8025 0.5 2995 48947 0.5 68.1| da5
2 8740 1529 8584 0.5 3138 52564 0.4 72.6| da7
3 8637 1427 7961 0.5 3136 52564 0.4 65.6| da8
And zpool iostat:
zp1 414G 474G 10.9K 8.94K
50.2M 51.8M
mirror 207G 237G 5.44K 4.56K
25.0M 26.2M
da3 - - 2.50K 2.37K
12.3M 26.2M
da5 - - 2.65K 2.37K
13.3M 26.1M
mirror 207G 237G 5.42K 4.39K
25.3M 25.6M
da7 - - 2.62K 2.40K
13.2M 25.6M
da8 - - 2.57K 2.40K
12.5M 25.6M
After a couple minutes (usually between 2 and 10) with drive %busy and a
cpu load around 70%, the system hangs.
During the hang, one of two scenarios play out (possibly more, but these
are the ones officially witnessed):
Scenario 1) Processes continue to function as long as they don't touch
the storage. One drive immediately hits 0 r/w 0 r/s, the rest of the
drives quickly fall to 0 r/s 0 w/s (maybe within 1 - 5 seconds), and the
lights on the chassis for the drives goes black. Any commands I enter
in the terminal hang after the newline.
During Scenario 1 - gstat output (notice the L(q) column for da7):
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0| da3
0 0 0 0 0.0 0 0 0.0 0.0| da5
8 0 0 0 0.0 0 0 0.0 0.0| da7
0 0 0 0 0.0 0 0 0.0 0.0| da8
Scenario 2) Some processes continue to function in a read-only state,
but all writes for all drives fall to 0.
During Scenario 2 - gstat output (again notice the L(q) column for da3):
dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
69 0 0 0 0.0 0 0 0.0 0.0| da3
0 3207 3207 45727 0.2 0 0 0.0 37.8| da5
0 2753 2753 24673 0.2 0 0 0.0 23.1| da7
0 3027 3027 26425 0.2 0 0 0.0 28.6| da8
About 60 seconds later the system becomes responsive again (I assume
kern.cam.da.default_timeout is the 60 second timer, although I haven't
changed this to confirm), and a bunch of error logs appear for the drive
that had a number in column L(q) > 0 (da7 in scenario 1 last night, da3
in scenario 2 today). Any commands I typed in the terminal finally fire
off, the cpu and disk usage quickly climb back to 70-80%+. If the
errors are bad enough (or possibly if two drives hang simultaneously?)
it'll show read and/or write errors in zpool status.
Failed attempts to repair: Replaced Backplane. Full memory tests.
Upgraded 3610's to the latest firmware. Used Node B in place of Node A.
Flashed the controller with latest IT firmware from SuperMicro. Swapped
power supplies. Pulled one proc off Node A and half the memory, put
that into Node B and tried the tests again. Set
kern.cam.da.X.delete_method=DISABLE. Probably tuned a couple dozen
other cam and zfs sysctl/loader variables in testing.
Nothing stopped these errors.
Until I ran this:
# camcontrol tags daX -N 1
(where X is 3,5,7 and 8)
With tags set to 1 for each device, all errors for the last 18 hours
have ceased (system ran all night at load). I've not spent the time yet
to determine how much that affects performance, but it is definitely
working.
To reproduce this bug. Today I ran:
# camcontrol tags daX -N 8
On all drives. A few minutes later, the drive hangs (see gstat output
for da3 above in scenario 2, which I had copied just seconds before the
errors that follow):
Jun 3 13:34:10 s18 kernel: (da3:mpr0:0:12:0): READ(10). CDB: 28 00 09
cc 97 30 00 00 08 00 length 4096 SMID 961 command timeout cm
0xfffffe0000d2ad50 ccb 0xfffff807bfcef800 target 12, handle(0x000d)
Jun 3 13:34:10 s18 kernel: mpr0: At enclosure level 0, slot 4,
connector name ( )
Jun 3 13:34:10 s18 kernel: mpr0: timedout cm 0xfffffe0000d2ad50
allocated tm 0xfffffe0000cdc150
Jun 3 13:34:10 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1
Aborting command 0xfffffe0000d2ad50
Jun 3 13:34:10 s18 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 12
[...snipped about 50 mores lines for brevity]
With 1 tag, gstat still shows L(q) values that jump into the 70's at
times, but with what I understand of the source, devstat calculates the
queued operations separate from camcontrol's simultaneous
dev_openings. So I assume that's nothing to be concerned about but
figured I'd mention it.
# camcontrol tags da7 -v
(pass7:mpr0:0:16:0): dev_openings 0
(pass7:mpr0:0:16:0): dev_active 1
(pass7:mpr0:0:16:0): allocated 1
(pass7:mpr0:0:16:0): queued 0
(pass7:mpr0:0:16:0): held 0
(pass7:mpr0:0:16:0): mintags 2
(pass7:mpr0:0:16:0): maxtags 255
Here is a full error message from last night on da7 (gstat scenario 1
above was copied right before these lines). Apologies for the verbosity,
but it may be helpful to some, logged with dev.mpr.0.debug_level=0x037F:
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm
0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00
allocated tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15
Aborting command 0xfffffe0000cefb00
Jun 2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 16
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14
24 35 e0 00 00 08 00 length 4096 SMID 562 command timeout cm
0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d0a1a0
for processing by tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 command timeout cm
0xfffffe0000d1e720 ccb 0xfffff80628c3d800 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d1e720
for processing by tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 98 00 00 08 00 length 4096 SMID 571 command timeout cm
0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d0ad70
for processing by tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 90 00 00 08 00 length 4096 SMID 540 command timeout cm
0xfffffe0000d084c0 ccb 0xfffff80288e58000 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d084c0
for processing by tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 88 00 00 08 00 length 4096 SMID 817 command timeout cm
0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d1f050
for processing by tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: mpr0: EventReply :
Jun 2 19:48:42 s18 kernel: EventDataLength: 2
Jun 2 19:48:42 s18 kernel: AckRequired: 0
Jun 2 19:48:42 s18 kernel: Event: SasDiscovery (0x16)
Jun 2 19:48:42 s18 kernel: EventContext: 0x0
Jun 2 19:48:42 s18 kernel: Flags: 1<InProgress>
Jun 2 19:48:42 s18 kernel: ReasonCode: Discovery Started
Jun 2 19:48:42 s18 kernel: PhysicalPort: 0
Jun 2 19:48:42 s18 kernel: DiscoveryStatus: 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 11
8d fe 50 00 00 10 00 length 8192 SMID 186 completed cm
0xfffffe0000ceb420 ccb 0xfffff80cd1d88800 during recovery ioc 804b scsi
0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 11
8d fe 50 00 00 10 00 length 8192 SMID 186 terminated ioc 804b scsi 0
state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 98 00 00 08 00 length 4096 SMID 571 completed timedout cm
0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 during recovery ioc 804b scsi
0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 98 00
00 08 00 length 4096 SMID 571 terminated ioc 804b scsi 0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 90 00 00 08 00 length 4096 SMID 540 completed timedout cm
0xfffffe0000d084c0 ccb 0xfffff80288e58000 during recovery ioc 804b scsi
0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 90 00
00 08 00 length 4096 SMID 540 terminated ioc 804b scsi 0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 88 00 00 08 00 length 4096 SMID 817 completed timedout cm
0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 during recovery ioc 804b scsi
0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 88 00
00 08 00 length 4096 SMID 817 terminated ioc 804b scsi 0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
5a 18 c0 00 00 08 00 length 4096 SMID 240 completed timedout cm
0xfffffe0000cefb00 ccb 0xfffff805dfeca800 during recovery ioc 8048 scsi
0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 e0 00
00 08 00 length 4096 SMID 562 completed timedout cm 0xfffffe0000d0a1a0
ccb 0xfffff80daeb84000 during recovery ioc 804b scsi 0 state c xf
(da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 e0 00 00 08 00 length
4096 SMID 562 terminated ioc 804b scsi 0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 completed timedout
cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 during recovery ioc 804b
scsi 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 terminated ioc 804b
scsi 0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 abort TaskMID
240 status 0x0 code 0x0 count 7
Jun 2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 finished
recovery after aborting TaskMID 240
Jun 2 19:48:42 s18 kernel: mpr0: Unfreezing devq for target ID 16
Jun 2 19:48:42 s18 kernel: mpr0: EventReply :
Jun 2 19:48:42 s18 kernel: EventDataLength: 4
Jun 2 19:48:42 s18 kernel: AckRequired: 0
Jun 2 19:48:42 s18 kernel: Event: SasTopologyChangeList (0x1c)
Jun 2 19:48:42 s18 kernel: EventContext: 0x0
Jun 2 19:48:42 s18 kernel: EnclosureHandle: 0x2
Jun 2 19:48:42 s18 kernel: ExpanderDevHandle: 0x9
Jun 2 19:48:42 s18 kernel: NumPhys: 31
Jun 2 19:48:42 s18 kernel: NumEntries: 1
Jun 2 19:48:42 s18 kernel: StartPhyNum: 8
Jun 2 19:48:42 s18 kernel: ExpStatus: Responding (0x3)
Jun 2 19:48:42 s18 kernel: PhysicalPort: 0
Jun 2 19:48:42 s18 kernel: PHY[8].AttachedDevHandle: 0x0011
Jun 2 19:48:42 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
Jun 2 19:48:42 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
Jun 2 19:48:42 s18 kernel: mpr0: (73)->(mprsas_fw_work) Working on
Event: [16]
Jun 2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Event Free: [16]
Jun 2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Working on
Event: [1c]
Jun 2 19:48:42 s18 kernel: mpr0: (75)->(mprsas_fw_work) Event Free: [1c]
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
5a 18 c0 00 00 08 00
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): CAM status: Command timeout
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): Retrying command
Jun 2 19:48:43 s18 kernel: mpr0: EventReply :
Jun 2 19:48:43 s18 kernel: EventDataLength: 2
Jun 2 19:48:43 s18 kernel: AckRequired: 0
Jun 2 19:48:43 s18 kernel: Event: SasDiscovery (0x16)
Jun 2 19:48:43 s18 kernel: EventContext: 0x0
Jun 2 19:48:43 s18 kernel: Flags: 0
Jun 2 19:48:43 s18 kernel: ReasonCode: Discovery Complete
Jun 2 19:48:43 s18 kernel: PhysicalPort: 0
Jun 2 19:48:43 s18 kernel: DiscoveryStatus: 0
Jun 2 19:48:43 s18 kernel: mpr0: (75)->(mprsas_fw_work) Working on
Event: [16]
Jun 2 19:48:43 s18 kernel: mpr0: (76)->(mprsas_fw_work) Event Free: [16]
Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 11
8d fe 50 00 00 10 00
Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): CAM status: SCSI Status Error
Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI status: Check Condition
Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI sense: UNIT
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): Retrying command (per
sense data)
I'm not sure if this is an Intel, CAM, or Avago issue (or something
else?). But I'd be happy to help figure it out if anyone has any
suggestions for me to test?
Thanks!
-Kyle
More information about the freebsd-scsi
mailing list