Avago LSI SAS 3008 & Intel SSD Timeouts
list-news
list-news at mindpackstudios.com
Mon Jun 6 22:19:13 UTC 2016
System was running solid all weekend with camcontrol tags set to 1 for
each device, zero errors.
Last week I did try
*# sysctl kern.cam.da.X.delete_method=**DISABLE*
for each drive, but it still threw errors.
Also, I did try out bio_delete_disable earlier today:
*# camcontrol tags daX -N 255*
(Firstly resetting tags back to 255 for each device, as they are
currently at 1.)
*# sysctl vfs.zfs.vdev.bio_delete_disable=1*
(a few minutes later)
Jun 6 12:28:36 s18 kernel: (da2:mpr0:0:12:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 577 command timeout cm
0xfffffe0001351550 ccb 0xfffff804e78e3800 target 12, handle(0x000c)
Jun 6 12:28:36 s18 kernel: mpr0: At enclosure level 0, slot 4,
connector name ( )
Jun 6 12:28:36 s18 kernel: mpr0: timedout cm 0xfffffe0001351550
allocated tm 0xfffffe0001322150
Jun 6 12:28:36 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1
Aborting command 0xfffffe0001351550
Jun 6 12:28:36 s18 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 12
Jun 6 12:28:36 s18 kernel: (da2:mpr0:0:12:0): READ(10). CDB: 28 00 18
45 1c c0 00 00 08 00 length 4096 SMID 583 command timeout cm
0xfffffe0001351d30 ccb 0xfffff806b9556800 target 12, handle(0x000c)
Jun 6 12:28:36 s18 kernel: mpr0: At enclosure level 0, slot 4,
connector name ( )
Jun 6 12:28:36 s18 kernel: mpr0: queued timedout cm 0xfffffe0001351d30
for processing by tm 0xfffffe0001322150
...
During the 60 second hang:
*# gstat -do*
L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps
ms/d o/s ms/o %busy Name
70 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da2
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da4
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da6
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da7
Also during the 60 second hang:
*# camcontrol tags da3 -v*
(pass2:mpr0:0:12:0): dev_openings 248
(pass2:mpr0:0:12:0): dev_active 7
(pass2:mpr0:0:12:0): allocated 7
(pass2:mpr0:0:12:0): queued 0
(pass2:mpr0:0:12:0): held 0
(pass2:mpr0:0:12:0): mintags 2
(pass2:mpr0:0:12:0): maxtags 255
Also during the 60 second hang:
*# sysctl dev.mpr*
dev.mpr.0.spinup_wait_time: 3
dev.mpr.0.chain_alloc_fail: 0
dev.mpr.0.enable_ssu: 1
dev.mpr.0.max_chains: 2048
dev.mpr.0.chain_free_lowwater: 2022
dev.mpr.0.chain_free: 2048
dev.mpr.0.io_cmds_highwater: 71
dev.mpr.0.io_cmds_active: 4
dev.mpr.0.driver_version: 09.255.01.00-fbsd
dev.mpr.0.firmware_version: 10.00.03.00
dev.mpr.0.disable_msi: 0
dev.mpr.0.disable_msix: 0
dev.mpr.0.debug_level: 895
dev.mpr.0.%parent: pci1
dev.mpr.0.%pnpinfo: vendor=0x1000 device=0x0097 subvendor=0x15d9
subdevice=0x0808 class=0x010700
dev.mpr.0.%location: pci0:1:0:0 handle=\_SB_.PCI0.BR1A.H000
dev.mpr.0.%driver: mpr
dev.mpr.0.%desc: Avago Technologies (LSI) SAS3008
dev.mpr.%parent:
Something else that may be of consideration: I ran fio & bonnie++ for
about an hour of heavy io (with tags still set to 255 drive busy showing
90-100%). No errors. I fire up my application (threaded Java/Postgres
application), and within minutes:
*# gstat -do*
dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps
ms/d o/s ms/o %busy Name
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da2
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da4
71 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da6
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da7
*Error:*
Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 30
65 13 90 00 00 10 00
Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status Error
Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check Condition
Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per
sense data)
...
*And again 2 minutes later:*
Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): WRITE(10). CDB: 2a 00 21
66 63 58 00 00 10 00
Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): CAM status: SCSI Status Error
Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI status: Check Condition
Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI sense: UNIT
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): Retrying command (per
sense data)
...
*And again 3 minutes later:*
Jun 6 13:41:29 s18 kernel: (da7:mpr0:0:18:0): WRITE(10). CDB: 2a 00 33
44 b5 b8 00 00 10 00
...
*#camcontrol tags daX -N **1*
(And now, after 15 minutes, zero errors.)
In putting some thoughts to this, which may or may not be off base
(please feel free to correct me btw), I've noticed the following:
1) There doesn't seem to be any indication as to what causes the drive
to time-out. The command that fails in the error log is one of the
following: READ(10), WRITE(10), ATA COMMAND PASS THROUGH(16), and
SYNCHRONIZE CACHE(10). As I understand it, that was the command being
executed, timed-out, and retried, not what potentially caused the drive
lock-up.
2) When my application is run, it hammers postgres pretty hard, and when
postgres is running I get the errors. FIO & Bonnie++ doesn't give me
errors; daily use of the system doesn't give me errors. I'm assuming
postgresql is sending far more of a certain type of command to the io
subsystem than those other applications, and the first command that
comes to mind is fsync.
3) I turned fsync off in postgresql.conf (I'm brave for science!!) then
ran my application again with tags at 255, 100% cpu load, 70-80% drive
busy%.
*1.5 hours later at full load - finally, a single timeout:*
Jun 6 16:31:33 s18 kernel: (da2:mpr0:0:12:0): READ(10). CDB: 28 00 2d
50 1b 78 00 00 08 00 length 4096 SMID 556 command timeout cm
0xfffffe000134f9c0 ccb 0xfffff83aa5b25000 target 12, handle(0x000c)
I ran it for another 20 minutes with no additional timeouts.
I assume the fsync command turns into a zfs -> cam -> SYNCHRONIZE CACHE
command for each device. And postgres is sending this command
considerably more often than a typical application (at least with fsync
turned on in postgresql.conf), which would explain why when fsync is
turned off or minimal fsyncs are being sent (ie typical system usage),
the error is rare. Yet, when fsync is being sent repeatedly, the errors
start happening every few minutes. The only reason I can think why
setting tags to 1 eliminates the errors entirely must have something to
do with Intel drives not handling parallel commands from cam when one
(or more) of the commands are SYNCHRONIZE CACHE. Thoughts?
Thanks,
-Kyle
On 6/6/16 3:42 AM, Borja Marcos wrote:
>> On 03 Jun 2016, at 23:49, Steven Hartland <killing at multiplay.co.uk> wrote:
>>
>> First thing would be to run gstat with -d to see if you're actually stacking up deletes, a symptom of which can be r/w dropping to zero.
>>
>> If you are seeing significant deletes it could be a FW issue on the drives.
> Hmm. I’ve suffered that badly with Intel P3500 NVMe drives, which suffer at least from a driver problem: trims are not coalesced.
> However I didn’t experience command timeouts. Reads and, especially, writes, stalled badly.
>
> A quick test for trim related trouble is setting the sysctl variable vfs.zfs.vdev.bio_delete_disable to 1. It doesn´t require
> a reboot and you can quickly compare results.
>
> In my case, a somewhat similar problem in an IBM server was caused by a faulty LSI3008 card it seems. As I didn´t have spare LSI3008 cards
> at the time I replaced it by a LSI2008 and everything works perfectly. Before anyone chimes in suggesting card incompatibility of some sort,
> I have a twin system with a LSI3008 working like a charm. ;)
>
>
>
>
>
>
>
> Borja.
>
>
> _______________________________________________
> freebsd-scsi at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe at freebsd.org"
More information about the freebsd-scsi
mailing list