Avago LSI SAS 3008 & Intel SSD Timeouts
list-news
list-news at mindpackstudios.com
Tue Jun 7 19:24:40 UTC 2016
I have additional confirmation that it's not faulty hardware.
I moved the 4 disks that carry the postgresql database over to another
server (same model - TWIN 2028-DECR). Mounted the zpool and fired up my
application.
This server is using a much earlier firmware on the SAS controller.
Different CPU / Memory / etc.
Errors happen within the first couple minutes, and continue every few
minutes (notice time-stamps for each drive timeout every few minutes):
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e
74 79 e0 00 00 08 00 length 4096 SMID 582 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e
74 79 e8 00 00 08 00 length 4096 SMID 1009 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): ATA COMMAND PASS
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length
512 SMID 315 terminated ioc 804b scsi 0 state c xfer 0
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 33
91 5c 68 00 00 08 00 length 4096 SMID 183 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36
f2 39 40 00 00 10 00 length 8192 SMID 446 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 715 terminated ioc 804b
scsi 0 state c xfer 0
Jun 7 13:08:32 s17 kernel: mpr0: Unfreezing devq for target ID 14
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36
ea dc 60 00 00 08 00
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: Command timeout
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e
74 79 e0 00 00 08 00
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: SCSI Status
Error
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI status: Check Condition
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI sense: UNIT
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command (per
sense data)
Jun 7 13:11:08 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 4
Aborting command 0xfffffe0000be0140
Jun 7 13:11:08 s17 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 10
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f6 ee f0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f6 ee d8 00 00 10 00 length 8192 SMID 262 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length
512 SMID 692 terminated ioc 804b scsi 0 state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 19
be 13 a0 00 00 10 00 length 8192 SMID 509 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21
3c 00 d8 00 00 08 00 length 4096 SMID 911 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21
3c 00 d0 00 00 08 00 length 4096 SMID 918 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21
3c 00 c8 00 00 08 00 length 4096 SMID 585 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 297 terminated ioc 804b
scsi 0 state c xfer 0
Jun 7 13:11:08 s17 kernel: mpr0: Unfreezing devq for target ID 10
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 35
26 ca f0 00 00 08 00
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f6 ee f0 00 00 08 00
Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition
Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): Retrying command (per
sense data)
Jun 7 13:13:04 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 5
Aborting command 0xfffffe0000bfcca0
Jun 7 13:13:04 s17 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 10
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length
512 SMID 504 terminated ioc 804b scsi 0 state c xfer 0
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b
8d 99 48 00 00 08 00 length 4096 SMID 677 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13
6b df b8 00 00 10 00 length 8192 SMID 563 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f7 cd a8 00 00 08 00 length 4096 SMID 723 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f7 cd b0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 478 terminated ioc 804b
scsi 0 state c xfer 0
Jun 7 13:13:04 s17 kernel: mpr0: Unfreezing devq for target ID 10
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e
d6 de f0 00 00 08 00
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:05 s17 kernel: mpr0: log_info(0x31120440): originator(PL),
code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): ATA COMMAND PASS
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL),
code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: mpr0: (da2:log_info(0x31120440):
originator(PL), code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: mpr0:0:mpr0: 10:log_info(0x31120440):
originator(PL), code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: 0): mpr0: Retrying command
Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL),
code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00
Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): CAM status: CCB
request completed with an error
Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL),
code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: (da2:mpr0: mpr0:0:log_info(0x31120440):
originator(PL), code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: 10:0): Retrying command
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b
8d 99 48 00 00 08 00
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13
6b df b8 00 00 10 00
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f7 cd a8 00 00 08 00
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f7 cd b0 00 00 08 00
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e
d6 de f0 00 00 08 00
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00
Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition
Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Error 6, Retries exhausted
Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Invalidating pack
Jun 7 13:15:11 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 6
Aborting command 0xfffffe0000c1e960
Jun 7 13:15:11 s17 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 11
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): ATA COMMAND PASS
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length
512 SMID 942 terminated ioc 804b scsi 0 state c xfer 0
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 23
7f 21 c0 00 00 08 00 length 4096 SMID 359 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 31
bb 68 30 00 00 08 00 length 4096 SMID 597 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19
80 02 68 00 00 50 00 length 40960 SMID 786 terminated ioc 804b scsi 0
state c xfer(da3:mpr0:0:11:0): READ(10). CDB: 28 00 22 02 ea 38 00 00 10 00
Jun 7 13:15:12 s17 kernel: 0
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: Command timeout
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19
7e 0d 30 00 00 10 00 length 8192 SMID 602 terminated ioc 804b scsi 0
state c xfer (da3:0
Jun 7 13:15:12 s17 kernel: mpr0:0: (da3:mpr0:0:11:0): SYNCHRONIZE
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 441
terminated ioc 804b scsi 0 sta11:te c xfer 0
Jun 7 13:15:12 s17 kernel: 0): mpr0: Retrying command
Jun 7 13:15:12 s17 kernel: Unfreezing devq for target ID 11
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: SCSI Status Error
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI status: Check Condition
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI sense: UNIT
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): Retrying command (per
sense data)
gstat output:
(I'm guessing I caught this during the da2 error)
#gstat -do
L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps
ms/d o/s ms/o %busy Name
70 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da2
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da3
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da10
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da11
I then set the tags down to 1 for each device:
#camcontrol tags da2 -N 1
#camcontrol tags da3 -N 1
#camcontrol tags da10 -N 1
#camcontrol tags da11 -N 1
And, no errors for the last hour, system still running at full load.
Everything is feeling like an NCQ firmware issue. Intel s3610 says it
supports NCQ in it's SSDs with 32 tags. But I've pulled the errors with
tags set to 8 plenty of times.
(See NCQ line below.)
# camcontrol identify da2
pass2: <INTEL SSDSC2BX480G4 G2010150> ACS-2 ATA SATA 3.x device
pass2: 1200.000MB/s transfers, Command Queueing Enabled
protocol ATA/ATAPI-9 SATA 3.x
device model INTEL SSDSC2BX480G4
firmware revision G2010150
serial number [redacted]
WWN [redacted]
cylinders 16383
heads 16
sectors/track 63
sector size logical 512, physical 4096, offset 0
LBA supported 268435455 sectors
LBA48 supported 937703088 sectors
PIO supported PIO4
DMA supported WDMA2 UDMA6
media RPM non-rotating
Feature Support Enabled Value Vendor
read ahead yes yes
write cache yes yes
flush cache yes yes
overlap no
Tagged Command Queuing (TCQ) no no
Native Command Queuing (NCQ) yes 32 tags
NCQ Queue Management no
NCQ Streaming no
Receive & Send FPDMA Queued no
SMART yes yes
microcode download yes yes
security yes no
power management yes yes
advanced power management no no
automatic acoustic management no no
media status notification no no
power-up in Standby no no
write-read-verify no no
unload yes yes
general purpose logging yes yes
free-fall no no
Data Set Management (DSM/TRIM) yes
DSM - max 512byte blocks yes 4
DSM - deterministic read yes zeroed
Host Protected Area (HPA) yes no 937703088/937703088
HPA - Security no
And it doesn't appear I have any way to deactivate it in firmware.
Which would be a nice test. I did attempt this with no luck:
# camcontrol negotiate da2 -T disable
(pass2:mpr0:0:10:0): transfer speed: 1200.000MB/s
(pass2:mpr0:0:10:0): tagged queueing: enabled
camcontrol: XPT_SET_TRANS_SETTINGS CCB failed
-Kyle
On 6/7/16 12:09 PM, list-news wrote:
> The system is a Twin. In the first post I mentioned this but I
> probably wasn't clear.
>
> The twin unit is this one:
> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm
>
> I've used all components from twin node A and B (cpu / memory /
> mainboard / controller). I still get the errors. The backplane was
> the original thought of concern, and that has been RMA'd and replaced
> - errors continue. I've even swapped out power supplies with another
> identical unit I have here.
>
> In every case the errors continue, until I do this:
> #camcontrol daX -N 1
> (for each drive in the zpool)
>
> Then the errors stop.
>
> The system errors every few minutes while my application is running.
> Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu and
> drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it
> finishes a batch, no errors are reported with -N set to 1. If I set
> tags with -N 255 for each device, errors start again within 5 minutes,
> and continue every 2-5 minutes, until the batch is finished.
>
> -Kyle
>
>> I would try, if possible, to swap the controller.
>>
>>
>>
>>
>>
>>
>> Borja.
>>
>>
>
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
More information about the freebsd-scsi
mailing list