Right-sizing the geli thread pool
Alan Somers
asomers at freebsd.org
Tue Jul 14 16:37:12 UTC 2020
On Fri, Jul 10, 2020 at 5:13 PM Alan Somers <asomers at freebsd.org> wrote:
> On Fri, Jul 10, 2020 at 11:55 AM Mark Johnston <markj at freebsd.org> wrote:
>
>> On Fri, Jul 10, 2020 at 05:55:50AM +0200, Mateusz Guzik wrote:
>> > On 7/9/20, Alan Somers <asomers at freebsd.org> wrote:
>> > > Currently, geli creates a separate thread pool for each provider, and
>> by
>> > > default each thread pool contains one thread per cpu. On a large
>> server
>> > > with many encrypted disks, that can balloon into a very large number
>> of
>> > > threads! I have a patch in progress that switches from per-provider
>> thread
>> > > pools to a single thread pool for the entire module. Happily, I see
>> read
>> > > IOPs increase by up to 60%. But to my surprise, write IOPs
>> _decreases_ by
>> > > up to 25%. dtrace suggests that the CPU usage is dominated by the
>> > > vmem_free call in biodone, as in the below stack.
>> > >
>> > > kernel`lock_delay+0x32
>> > > kernel`biodone+0x88
>> > > kernel`g_io_deliver+0x214
>> > > geom_eli.ko`g_eli_write_done+0xf6
>> > > kernel`g_io_deliver+0x214
>> > > kernel`md_kthread+0x275
>> > > kernel`fork_exit+0x7e
>> > > kernel`0xffffffff8104784e
>> > >
>> > > I only have one idea for how to improve things from here. The geli
>> thread
>> > > pool is still fed by a single global bio queue. That could cause
>> cache
>> > > thrashing, if bios get moved between cores too often. I think a
>> superior
>> > > design would be to use a separate bio queue for each geli thread, and
>> use
>> > > work-stealing to balance them. However,
>> > >
>> > > 1) That doesn't explain why this change benefits reads more than
>> writes,
>> > > and
>> > > 2) work-stealing is hard to get right, and I can't find any examples
>> in the
>> > > kernel.
>> > >
>> > > Can anybody offer tips or code for implementing work stealing? Or any
>> > > other suggestions about why my write performance is suffering? I
>> would
>> > > like to get this change committed, but not without resolving that
>> issue.
>> > >
>> >
>> > I can't comment on revamping the design, but:
>> >
>> > void
>> > vmem_free(vmem_t *vm, vmem_addr_t addr, vmem_size_t size)
>> > {
>> > qcache_t *qc;
>> > MPASS(size > 0);
>> >
>> > if (size <= vm->vm_qcache_max &&
>> > __predict_true(addr >= VMEM_ADDR_QCACHE_MIN)) {
>> > qc = &vm->vm_qcache[(size - 1) >> vm->vm_quantum_shift];
>> > uma_zfree(qc->qc_cache, (void *)addr);
>> > } else
>> > vmem_xfree(vm, addr, size);
>> > }
>> >
>> > What sizes are being passed here? Or more to the point, is it feasible
>> > to bump qcache to stick to uma in this call? If lock contention is
>> > indeed coming from vmem_xfree this change would get rid of the problem
>> > without having to rework anything.
>>
>> We would have to enable the quantum cache in the transient KVA arena.
>> This itself should not have many downsides on platforms with plenty of
>> KVA, but it only solves the immediate problem: before freeing the KVA
>> biodone() has to perform a global TLB shootdown, and the quantum cache
>> doesn't help at all with that.
>>
>> kib's suggestion of using sf_buf(9) to transiently map crypto(9)
>> payloads in software crypto drivers that require a mapping, and using
>> unmapped cryptop requests for hardware drivers that do not, sounds like
>> the right solution. cryptop structures can already handle multiple data
>> container types, like uios and mbufs, so it should be possible to also
>> support vm_page arrays in OCF like we do for unmapped BIOs, and let
>> crypto(9) drivers create transient mappings when necessary.
>>
>> > For read performance, while it is nice there is a win, it may still be
>> > less than it should. I think it is prudent to get a flamegraph from
>> > both cases.
>>
>
> And, it works! Using sf_buf was a good suggestion, because the write path
> only needs to access the user data for a short amount of time, in a
> CPU-pinned thread. So I can create my sf_buf and immediately free it. I
> didn't even need to change crypto(9). Writes are much faster now. I
> haven't incorporated sf_buf into the read path yet, though.
>
> Preliminary benchmarks:
>
> Write IOPs READ IOPs
> baseline 66908 60351
> kern.geom.eli.threads=1 156719 144117
> kern.geom.eli.threads=1, direct dispatch 205326 201867
> direct dispatch, shared thread pool 156631 226874
> direct dispatch, shared thread pool, sf_buf 432573 247275
>
> -Alan
>
Ok, I got ahead of myself. Actually, it _doesn't_ work. I was initially
elated by how much extra speed I got by using sf_bufs. But even so, writes
are still slower (and reads faster) when I use a shared thread pool. I'll
work on committing the sf_buf changes, then try to debug the shared thread
pool performance degradation again.
-Alan
More information about the freebsd-hackers
mailing list