Right-sizing the geli thread pool

Thu Jul 9 21:26:54 UTC 2020

Currently, geli creates a separate thread pool for each provider, and by
default each thread pool contains one thread per cpu.  On a large server
with many encrypted disks, that can balloon into a very large number of
threads!  I have a patch in progress that switches from per-provider thread
pools to a single thread pool for the entire module.  Happily, I see read
IOPs increase by up to 60%.  But to my surprise, write IOPs _decreases_ by
up to 25%.  dtrace suggests that the CPU usage is dominated by the
vmem_free call in biodone, as in the below stack.

              kernel`lock_delay+0x32
              kernel`biodone+0x88
              kernel`g_io_deliver+0x214
              geom_eli.ko`g_eli_write_done+0xf6
              kernel`g_io_deliver+0x214
              kernel`md_kthread+0x275
              kernel`fork_exit+0x7e
              kernel`0xffffffff8104784e

I only have one idea for how to improve things from here.  The geli thread
pool is still fed by a single global bio queue.  That could cause cache
thrashing, if bios get moved between cores too often.  I think a superior
design would be to use a separate bio queue for each geli thread, and use
work-stealing to balance them.  However,

1) That doesn't explain why this change benefits reads more than writes, and
2) work-stealing is hard to get right, and I can't find any examples in the
kernel.

Can anybody offer tips or code for implementing work stealing?  Or any
other suggestions about why my write performance is suffering?  I would
like to get this change committed, but not without resolving that issue.

-Alan