Device timeouts(?) with LSI SAS3008 on mpr(4)
Yamagi Burmeister
lists at yamagi.org
Mon Jul 27 10:02:42 UTC 2015
Hello,
let me appologise for my late answer. My colleagues were in vacation
and I had no time to pursue this problem. But now another round:
- da0 and da1 are 80G Intel DC S3500 SSDs
- All other devices are 800G Intel DC S3700 SSDs
kern.cam.da.X.delete_max seem very high for all devices. Forcing the
tuneable to a very conservative value of 65536 helps, no timeout in 72
hours and no measurable performance impact. So my guess is:
- ZFS tries to TRIM too many blocks in one operation
- The SSD blocks for some time while processing the TRIM command
- The controller thinks that the SSD crashed and sends a reset
I did some tests with the attached tool. I'm able to reproduce the
timeouts when "enough" data was written to the device. The question is
what's "enough" data. Sometimes 50G are enough and sometimes 75G can be
trimmed without any problem.
Nevertheless. A lower kern.cam.da.X.delete_max value helps to work
around the problem. And everything else is just speculation. So:
Problem solved. Thank you for your help and input.
Regards,
Yamagi
On Mon, 13 Jul 2015 10:54:41 +0100
Steven Hartland <killing at multiplay.co.uk> wrote:
> I assume da0 and da1 are a different disk then?
>
> With regards your disk setup are all of you disks SSD's if so why do you
> have separate log and cache devices?
>
> One thing you could try is to limit the delete size.
>
> kern.geom.dev.delete_max_sectors limits the single request size allowed
> by geom but then individual requests can be built back up in cam so I
> don't think this will help you too much.
>
> Instead I would try limiting the individual device delete_max, so add
> one line per disk into /boot/loader.conf of the form:
> kern.cam.da.X.delete_max=1073741824
>
> You can actually change these on the fly using sysctl, but in order to
> catch an cleanup done on boot loader.conf is the best place to tune them
> permanently.
>
> I've attached a little c util which you can use to do direct disk
> deletes if you have a spare disk you can play with.
>
> Be aware that most controller optimise delete's out if they know the
> cells are empty hence you do need to have written data to the sectors
> each time you test a delete.
>
> As the requests go through geom anything over
> kern.geom.dev.delete_max_sectors will be split but then may well be
> recombined in CAM.
>
> Another relevant setting is vfs.zfs.vdev.trim_max_active which can be
> used to limit the number of outstanding geom delete requests to the each
> device.
>
> Oh one other thing, it would be interesting to see the output from
> camcontrol identify <device> e.g.
> camcontrol identify da8
> camcontrol identify da0
>
> Regards
> Steve
>
> On 13/07/2015 10:25, Yamagi Burmeister wrote:
> > On Mon, 13 Jul 2015 10:13:32 +0100
> > Steven Hartland <killing at multiplay.co.uk> wrote:
> >
> >> What do you see from:
> >> sysctl -a | grep -E '(delete|trim)'
> > % sysctl -a | grep -E '(delete|trim)'
> > kern.geom.dev.delete_max_sectors: 262144
> > kern.cam.da.1.delete_max: 8589803520
> > kern.cam.da.1.delete_method: ATA_TRIM
> > kern.cam.da.8.delete_max: 12884705280
> > kern.cam.da.8.delete_method: ATA_TRIM
> > kern.cam.da.9.delete_max: 12884705280
> > kern.cam.da.9.delete_method: ATA_TRIM
> > kern.cam.da.3.delete_max: 12884705280
> > kern.cam.da.3.delete_method: ATA_TRIM
> > kern.cam.da.12.delete_max: 12884705280
> > kern.cam.da.12.delete_method: ATA_TRIM
> > kern.cam.da.7.delete_max: 12884705280
> > kern.cam.da.7.delete_method: ATA_TRIM
> > kern.cam.da.2.delete_max: 12884705280
> > kern.cam.da.2.delete_method: ATA_TRIM
> > kern.cam.da.11.delete_max: 12884705280
> > kern.cam.da.11.delete_method: ATA_TRIM
> > kern.cam.da.6.delete_max: 12884705280
> > kern.cam.da.6.delete_method: ATA_TRIM
> > kern.cam.da.10.delete_max: 12884705280
> > kern.cam.da.10.delete_method: ATA_TRIM
> > kern.cam.da.5.delete_max: 12884705280
> > kern.cam.da.5.delete_method: ATA_TRIM
> > kern.cam.da.0.delete_max: 8589803520
> > kern.cam.da.0.delete_method: ATA_TRIM
> > kern.cam.da.4.delete_max: 12884705280
> > kern.cam.da.4.delete_method: ATA_TRIM
> > vfs.zfs.trim.max_interval: 1
> > vfs.zfs.trim.timeout: 30
> > vfs.zfs.trim.txg_delay: 32
> > vfs.zfs.trim.enabled: 1
> > vfs.zfs.vdev.trim_max_pending: 10000
> > vfs.zfs.vdev.bio_delete_disable: 0
> > vfs.zfs.vdev.trim_max_active: 64
> > vfs.zfs.vdev.trim_min_active: 1
> > vfs.zfs.vdev.trim_on_init: 1
> > kstat.zfs.misc.arcstats.deleted: 289783817
> > kstat.zfs.misc.zio_trim.failed: 431
> > kstat.zfs.misc.zio_trim.unsupported: 0
> > kstat.zfs.misc.zio_trim.success: 6457142235
> > kstat.zfs.misc.zio_trim.bytes: 88207753330688
> >
> >
> >> Also while your seeing time-outs what does the output from gstat -d -p
> >> look like?
> > I'll try to get that data but it may take a while.
> >
> > Thank you,
> > Yamagi
> >
>
--
Homepage: www.yamagi.org
XMPP: yamagi at yamagi.org
GnuPG/GPG: 0xEFBCCBCB
More information about the freebsd-scsi
mailing list