ZFS and NVMe, trim caused stalling
Borja Marcos
borjam at sarenet.es
Tue May 17 09:35:04 UTC 2016
> On 17 May 2016, at 11:09, Steven Hartland <killing at multiplay.co.uk> wrote:
>
>> I understand that, but I don’t think it’s a good that ZFS depends blindly on a driver feature such
>> as that. Of course, it’s great to exploit it.
>>
>> I have also noticed that ZFS has a good throttling mechanism for write operations. A similar
>> mechanism should throttle trim requests so that trim requests don’t clog the whole system.
> It already does.
I see that there’s a limit to the number of active TRIM requests, but not an explicit delay such
as it’s applied to write requests. So, even with a single maximum active TRIM request,
it seems that TRIM wins.
>>
>>> I’d be extremely hesitant to tossing away TRIMs. They are actually quite important for
>>> the FTL in the drive’s firmware to proper manage the NAND wear. More free space always
>>> reduces write amplification. It tends to go as 1 / freespace, so simply dropping them on
>>> the floor should be done with great reluctance.
>> I understand. I was wondering about choosing the lesser between two evils. A 15 minute
>> I/O stall (I deleted 2 TB of data, that’s a lot, but not so unrealistic) or settings trims aside
>> during the peak activity.
>>
>> I see that I was wrong on that, as a throttling mechanism would be more than enough probably,
>> unless the system is close to running out of space.
>>
>> I’ve filed a bug report anyway. And copying to -stable.
>>
>>
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209571
>>
> TBH it sounds like you may have badly behaved HW, we've used ZFS + TRIM and for years on large production boxes and while we're seen slow down we haven't experienced the total lockups you're describing.
I am using ZFS+TRIM on SATA SSD disks for a very long time. Actually, a single SSD I tried at home can TRIM at around 2 GB/s.
Warner Losh told me that the nvd driver is not currently coalescing the TRIMs, which is a disadvantage compared to the ada driver, which does.
> The graphs on you're ticket seem to indicate peak throughput of 250MB/s which is extremely slow for standard SSD's let alone NVMe ones and when you add in the fact you have 10 well it seems like something is VERY wrong.
The pool is a raidz2 vdev with 10 P3500 NVMe disks. That graph is the throughput of just one of the disks (the other 9 graphs are identical). Bonnie++
reports around 1.7 Gigabytes/s writing “intelligently”, 1 GB/s “rewriting” and almost 2 GB/s “reading intelligently” which, as far as I know, is more or less
reasonable.
The really slow part are the TRIM requests. When destroying the files (it’s four concurrent bonnie++ tasks writing a total of 2 Terabytes)
> I just did a quick test on our DB box here creating and then deleting a 2G file as you describe and I couldn't even spot the delete in the general noise it was so quick to process and that's a 6 disk machine with P3700’s.
Totalling 2 TB? In my case it was FOUR files, 512 GB each.
I’m realy puzzled,
Borja.
More information about the freebsd-stable
mailing list