Avago LSI SAS 3008 & Intel SSD Timeouts
Steven Hartland
killing at multiplay.co.uk
Tue Jun 7 19:53:08 UTC 2016
CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO
delete but assuming your running ZFS can you set the following in
loader.conf and see how you get on.
vfs.zfs.trim.enabled=0
Regards
Steve
On 07/06/2016 20:24, list-news wrote:
> I have additional confirmation that it's not faulty hardware.
>
> I moved the 4 disks that carry the postgresql database over to another
> server (same model - TWIN 2028-DECR). Mounted the zpool and fired up
> my application.
>
> This server is using a much earlier firmware on the SAS controller.
> Different CPU / Memory / etc.
>
> Errors happen within the first couple minutes, and continue every few
> minutes (notice time-stamps for each drive timeout every few minutes):
>
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00
> 0e 74 79 e0 00 00 08 00 length 4096 SMID 582 terminated ioc 804b scsi
> 0 state c xfer 0
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00
> 0e 74 79 e8 00 00 08 00 length 4096 SMID 1009 terminated ioc 804b scsi
> 0 state c xfer 0
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): ATA COMMAND PASS
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> length 512 SMID 315 terminated ioc 804b scsi 0 state c xfer 0
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00
> 33 91 5c 68 00 00 08 00 length 4096 SMID 183 terminated ioc 804b scsi
> 0 state c xfer 0
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00
> 36 f2 39 40 00 00 10 00 length 8192 SMID 446 terminated ioc 804b scsi
> 0 state c xfer 0
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SYNCHRONIZE CACHE(10).
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 715 terminated ioc
> 804b scsi 0 state c xfer 0
> Jun 7 13:08:32 s17 kernel: mpr0: Unfreezing devq for target ID 14
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00
> 36 ea dc 60 00 00 08 00
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: Command
> timeout
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00
> 0e 74 79 e0 00 00 08 00
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: SCSI
> Status Error
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI status: Check
> Condition
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI sense: UNIT
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command (per
> sense data)
> Jun 7 13:11:08 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 4
> Aborting command 0xfffffe0000be0140
> Jun 7 13:11:08 s17 kernel: mpr0: Sending reset from mprsas_send_abort
> for target ID 10
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
> f6 ee f0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
> f6 ee d8 00 00 10 00 length 8192 SMID 262 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> length 512 SMID 692 terminated ioc 804b scsi 0 state c xfer 0
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 19
> be 13 a0 00 00 10 00 length 8192 SMID 509 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21
> 3c 00 d8 00 00 08 00 length 4096 SMID 911 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21
> 3c 00 d0 00 00 08 00 length 4096 SMID 918 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21
> 3c 00 c8 00 00 08 00 length 4096 SMID 585 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 297 terminated ioc
> 804b scsi 0 state c xfer 0
> Jun 7 13:11:08 s17 kernel: mpr0: Unfreezing devq for target ID 10
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 35
> 26 ca f0 00 00 08 00
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command
> timeout
> Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
> f6 ee f0 00 00 08 00
> Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status
> Error
> Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check
> Condition
> Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): Retrying command (per
> sense data)
> Jun 7 13:13:04 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 5
> Aborting command 0xfffffe0000bfcca0
> Jun 7 13:13:04 s17 kernel: mpr0: Sending reset from mprsas_send_abort
> for target ID 10
> Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> length 512 SMID 504 terminated ioc 804b scsi 0 state c xfer 0
> Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b
> 8d 99 48 00 00 08 00 length 4096 SMID 677 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13
> 6b df b8 00 00 10 00 length 8192 SMID 563 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
> f7 cd a8 00 00 08 00 length 4096 SMID 723 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
> f7 cd b0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 478 terminated ioc
> 804b scsi 0 state c xfer 0
> Jun 7 13:13:04 s17 kernel: mpr0: Unfreezing devq for target ID 10
> Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e
> d6 de f0 00 00 08 00
> Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command
> timeout
> Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun 7 13:13:05 s17 kernel: mpr0: log_info(0x31120440):
> originator(PL), code(0x12), sub_code(0x0440)
> Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): ATA COMMAND PASS
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL),
> code(0x12), sub_code(0x0440)
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
> completed with an error
> Jun 7 13:13:05 s17 kernel: mpr0: (da2:log_info(0x31120440):
> originator(PL), code(0x12), sub_code(0x0440)
> Jun 7 13:13:05 s17 kernel: mpr0:0:mpr0: 10:log_info(0x31120440):
> originator(PL), code(0x12), sub_code(0x0440)
> Jun 7 13:13:05 s17 kernel: 0): mpr0: Retrying command
> Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL),
> code(0x12), sub_code(0x0440)
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).
> CDB: 35 00 00 00 00 00 00 00 00 00
> Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): CAM status: CCB
> request completed with an error
> Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL),
> code(0x12), sub_code(0x0440)
> Jun 7 13:13:05 s17 kernel: (da2:mpr0: mpr0:0:log_info(0x31120440):
> originator(PL), code(0x12), sub_code(0x0440)
> Jun 7 13:13:05 s17 kernel: 10:0): Retrying command
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b
> 8d 99 48 00 00 08 00
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
> completed with an error
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13
> 6b df b8 00 00 10 00
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
> completed with an error
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
> f7 cd a8 00 00 08 00
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
> completed with an error
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
> f7 cd b0 00 00 08 00
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
> completed with an error
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e
> d6 de f0 00 00 08 00
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
> completed with an error
> Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).
> CDB: 35 00 00 00 00 00 00 00 00 00
> Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status
> Error
> Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check
> Condition
> Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Error 6, Retries exhausted
> Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Invalidating pack
> Jun 7 13:15:11 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 6
> Aborting command 0xfffffe0000c1e960
> Jun 7 13:15:11 s17 kernel: mpr0: Sending reset from mprsas_send_abort
> for target ID 11
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): ATA COMMAND PASS
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> length 512 SMID 942 terminated ioc 804b scsi 0 state c xfer 0
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 23
> 7f 21 c0 00 00 08 00 length 4096 SMID 359 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 31
> bb 68 30 00 00 08 00 length 4096 SMID 597 terminated ioc 804b scsi 0
> state c xfer 0
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19
> 80 02 68 00 00 50 00 length 40960 SMID 786 terminated ioc 804b scsi 0
> state c xfer(da3:mpr0:0:11:0): READ(10). CDB: 28 00 22 02 ea 38 00 00
> 10 00
> Jun 7 13:15:12 s17 kernel: 0
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: Command
> timeout
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19
> 7e 0d 30 00 00 10 00 length 8192 SMID 602 terminated ioc 804b scsi 0
> state c xfer (da3:0
> Jun 7 13:15:12 s17 kernel: mpr0:0: (da3:mpr0:0:11:0): SYNCHRONIZE
> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 441
> terminated ioc 804b scsi 0 sta11:te c xfer 0
> Jun 7 13:15:12 s17 kernel: 0): mpr0: Retrying command
> Jun 7 13:15:12 s17 kernel: Unfreezing devq for target ID 11
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10).
> CDB: 35 00 00 00 00 00 00 00 00 00
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: SCSI Status
> Error
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI status: Check
> Condition
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI sense: UNIT
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): Retrying command (per
> sense data)
>
> gstat output:
> (I'm guessing I caught this during the da2 error)
>
> #gstat -do
> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps
> ms/d o/s ms/o %busy Name
> 70 0 0 0 0.0 0 0 0.0 0 0
> 0.0 0 0.0 0.0| da2
> 0 0 0 0 0.0 0 0 0.0 0 0
> 0.0 0 0.0 0.0| da3
> 0 0 0 0 0.0 0 0 0.0 0 0
> 0.0 0 0.0 0.0| da10
> 0 0 0 0 0.0 0 0 0.0 0 0
> 0.0 0 0.0 0.0| da11
>
>
> I then set the tags down to 1 for each device:
>
> #camcontrol tags da2 -N 1
> #camcontrol tags da3 -N 1
> #camcontrol tags da10 -N 1
> #camcontrol tags da11 -N 1
>
> And, no errors for the last hour, system still running at full load.
>
> Everything is feeling like an NCQ firmware issue. Intel s3610 says it
> supports NCQ in it's SSDs with 32 tags. But I've pulled the errors
> with tags set to 8 plenty of times.
>
> (See NCQ line below.)
>
> # camcontrol identify da2
>
> pass2: <INTEL SSDSC2BX480G4 G2010150> ACS-2 ATA SATA 3.x device
> pass2: 1200.000MB/s transfers, Command Queueing Enabled
> protocol ATA/ATAPI-9 SATA 3.x
> device model INTEL SSDSC2BX480G4
> firmware revision G2010150
> serial number [redacted]
> WWN [redacted]
> cylinders 16383
> heads 16
> sectors/track 63
> sector size logical 512, physical 4096, offset 0
> LBA supported 268435455 sectors
> LBA48 supported 937703088 sectors
> PIO supported PIO4
> DMA supported WDMA2 UDMA6
> media RPM non-rotating
>
> Feature Support Enabled Value Vendor
> read ahead yes yes
> write cache yes yes
> flush cache yes yes
> overlap no
> Tagged Command Queuing (TCQ) no no
> Native Command Queuing (NCQ) yes 32 tags
> NCQ Queue Management no
> NCQ Streaming no
> Receive & Send FPDMA Queued no
> SMART yes yes
> microcode download yes yes
> security yes no
> power management yes yes
> advanced power management no no
> automatic acoustic management no no
> media status notification no no
> power-up in Standby no no
> write-read-verify no no
> unload yes yes
> general purpose logging yes yes
> free-fall no no
> Data Set Management (DSM/TRIM) yes
> DSM - max 512byte blocks yes 4
> DSM - deterministic read yes zeroed
> Host Protected Area (HPA) yes no 937703088/937703088
> HPA - Security no
>
> And it doesn't appear I have any way to deactivate it in firmware.
> Which would be a nice test. I did attempt this with no luck:
> # camcontrol negotiate da2 -T disable
> (pass2:mpr0:0:10:0): transfer speed: 1200.000MB/s
> (pass2:mpr0:0:10:0): tagged queueing: enabled
> camcontrol: XPT_SET_TRANS_SETTINGS CCB failed
>
> -Kyle
>
>
> On 6/7/16 12:09 PM, list-news wrote:
>> The system is a Twin. In the first post I mentioned this but I
>> probably wasn't clear.
>>
>> The twin unit is this one:
>> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm
>>
>> I've used all components from twin node A and B (cpu / memory /
>> mainboard / controller). I still get the errors. The backplane was
>> the original thought of concern, and that has been RMA'd and replaced
>> - errors continue. I've even swapped out power supplies with another
>> identical unit I have here.
>>
>> In every case the errors continue, until I do this:
>> #camcontrol daX -N 1
>> (for each drive in the zpool)
>>
>> Then the errors stop.
>>
>> The system errors every few minutes while my application is running.
>> Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu
>> and drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it
>> finishes a batch, no errors are reported with -N set to 1. If I set
>> tags with -N 255 for each device, errors start again within 5
>> minutes, and continue every 2-5 minutes, until the batch is finished.
>>
>> -Kyle
>>
>>> I would try, if possible, to swap the controller.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Borja.
>>>
>>>
>>
>> _______________________________________________
>> freebsd-scsi at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
>
>
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
More information about the freebsd-scsi
mailing list