Re: Stressing malloc(9)
- Reply: Alan Somers : "Re: Stressing malloc(9)"
- In reply to: Mark Johnston : "Re: Stressing malloc(9)"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 21 Apr 2024 23:47:41 UTC
On Sun, Apr 21, 2024 at 10:09 AM Mark Johnston <markj@freebsd.org> wrote: > > On Sat, Apr 20, 2024 at 11:23:41AM -0600, Alan Somers wrote: > > On Sat, Apr 20, 2024 at 9:07 AM Mark Johnston <markj@freebsd.org> wrote: > > > > > > On Fri, Apr 19, 2024 at 04:23:51PM -0600, Alan Somers wrote: > > > > TLDR; > > > > How can I create a workload that causes malloc(9)'s performance to plummet? > > > > > > > > Background: > > > > I recently witnessed a performance problem on a production server. > > > > Overall throughput dropped by over 30x. dtrace showed that 60% of the > > > > CPU time was dominated by lock_delay as called by three functions: > > > > printf (via ctl_worker_thread), g_eli_alloc_data, and > > > > g_eli_write_done. One thing those three have in common is that they > > > > all use malloc(9). Fixing the problem was as simple as telling CTL to > > > > stop printing so many warnings, by tuning > > > > kern.cam.ctl.time_io_secs=100000. > > > > > > > > But even with CTL quieted, dtrace still reports ~6% of the CPU cycles > > > > in lock_delay via g_eli_alloc_data. So I believe that malloc is > > > > limiting geli's performance. I would like to try replacing it with > > > > uma(9). > > > > > > What is the size of the allocations that g_eli_alloc_data() is doing? > > > malloc() is a pretty thin layer over UMA for allocations <= 64KB. > > > Larger allocations are handled by a different path (malloc_large()) > > > which goes directly to the kmem_* allocator functions. Those functions > > > are very expensive: they're serialized by global locks and need to > > > update the pmap (and perform TLB shootdowns when memory is freed). > > > They're not meant to be used at a high rate. > > > > In my benchmarks so far, 512B. In the real application the size is > > mostly between 4k and 16k, and it's always a multiple of 4k. But it's > > sometimes great enough to use malloc_large, and it's those > > malloc_large calls that account for the majority of the time spent in > > g_eli_alloc_data. lockstat shows that malloc_large, as called by > > g_elI_alloc_data, sometimes blocks for multiple ms. > > > > But oddly, if I change the parameters so that g_eli_alloc_data > > allocates 128kB, I still don't see malloc_large getting called. And > > both dtrace and vmstat show that malloc is mostly operating on 512B > > allocations. But dtrace does confirm that g_eli_alloc_data is being > > called with 128kB arguments. Maybe something is getting inlined? > > malloc_large() is annotated __noinline, for what it's worth. > > > I > > don't understand how this is happening. I could probably figure out > > if I recompile with some extra SDT probes, though. > > What is g_eli_alloc_sz on your system? 33kiB. That's larger than I expected. When I use a larger blocksize in my benchmark, then I do indeed see malloc_large activity, and 11% of the CPU is spend in g_eli_alloc_data. I guess I'll add some UMA zones for this purpose. I'll try 256k and 512k zones, rounding up allocations as necessary. Thanks for the tip. > > > > My first guess would be that your production workload was hitting this > > > path, and your benchmarks are not. If you have stack traces or lock > > > names from DTrace, that would help validate this theory, in which case > > > using UMA to cache buffers would be a reasonable solution. > > > > Would that require creating an extra UMA zone for every possible geli > > allocation size above 64kB? > > Something like that. Or have a zone of maxphys-sized buffers (actually > I think it needs to be slightly larger than that?) and accept the > corresponding waste, given that these allocations are short-lived. This > is basically what g_eli_alloc_data() already does. > > > > > But on a non-production server, none of my benchmark workloads causes > > > > g_eli_alloc_data to break a sweat. I can't get its CPU consumption to > > > > rise higher than 0.5%. And that's using the smallest sector size and > > > > block size that I can. > > > > > > > > So my question is: does anybody have a program that can really stress > > > > malloc(9)? I'd like to run it in parallel with my geli benchmarks to > > > > see how much it interferes. > > > > > > > > -Alan > > > >