Re: CFT: snmalloc as libc malloc
- In reply to: Mateusz Guzik : "Re: CFT: snmalloc as libc malloc"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 09 Feb 2023 21:34:38 UTC
The memcpy debacle aside, I can confirm that single-threaded the new malloc does appear faster in naive tests from will-it-scale: $ cpuset -l 10,80,82 -- ./malloc2_threads -n -t 2 testcase:malloc/free of 1kB before: min:97812514 max:97849385 total:195661899 min:97819901 max:97857131 total:195677032 min:97789741 max:97833562 total:195623303 after: min:115613762 max:124855002 total:240468764 min:115636562 max:124807148 total:240443710 min:115778776 max:124784220 total:240562996 that said, if anyone is to performed a serious test, the stock memcpy needs to be used to rule it out as a factor. The one shipped with snmalloc will happen to be faster for certain sizes and that may skew whatever evaluation -- that speed increase (and in fact higher) is achievable without snmalloc. On 2/9/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > On 2/9/23, David Chisnall <theraven@freebsd.org> wrote: >> On 9 Feb 2023, at 19:15, Mateusz Guzik <mjguzik@gmail.com> wrote: >>> >>> it fails to build for me: >>> >>> /usr/src/lib/libc/stdlib/snmalloc/malloc.cc:35:10: fatal error: >>> 'override/jemalloc_compat.cc' file not found >>> #include "override/jemalloc_compat.cc" >>> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> 1 error generated. >>> --- malloc.o --- >>> *** [malloc.o] Error code 1 >>> >>> make[4]: stopped in /usr/src/lib/libc >>> /usr/src/lib/libc/stdlib/snmalloc/memcpy.cc:25:10: fatal error: >>> 'global/memcpy.h' file not found >>> #include <global/memcpy.h> >>> ^~~~~~~~~~~~~~~~~ >>> 1 error generated. >>> --- memcpy.o --- >>> *** [memcpy.o] Error code 1 >> >> This looks as if you haven’t got the submodule? Is there anything in >> contrib/snmalloc? >> > > indeed, a pilot error > >>> anyway, I wanted to say I find the memcpy thing incredibly suspicious. >>> I found one article in >>> https://github.com/microsoft/snmalloc/blob/main/docs/security/GuardedMemcpy.md >>> which benches it and that made it even more suspicious. How did the >>> benched memcpy look like inside? >> >> Perhaps you could share what you are suspicious about? I don’t really >> know >> how to respond to something so vague. The document you linked to has the >> benchmark that we used (though the graphs in it appear to be based on an >> older version of the memcpy). The PR that added PowerPC tuning has some >> additional graphs of measurements. >> >> If you compile the memcpy file, you can see the assembly. The C++ >> provides >> a set of building blocks for producing efficient memcpy implementations. > > First and foremost, perhaps I should clear up that I have no opinion > on snmalloc or it replacing jemalloc. I'm here only about the memcpy > thing. > > Inspecting assembly is what I intended to do, but got the compilation > problem. > > So, as someone who worked on memcpy previously, I note the variant > currently implemented in libc is pessimal for sizes > 16 bytes because > it does not use SIMD. I do have plans to rectify that, but ENOTIME. > > I also note CPUs are incredibly picky when it comes to starting point > of the routine. The officialy recommended alignment of 16 bytes is > basically a tradeoff between total binary size and performance. Should > you move the routine at different 16 bytes intervals you will easily > find big variation in performance, depending on how big of an > alignment you ended up with. > > I tried to compile the benchmark but got bested by c++. I don't know > the lang and I don't want to fight it. > > $ pwd > /usr/src/contrib/snmalloc/src > $ c++ -I. test/perf/memcpy/memcpy.cc > [snip] > ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57:26: > error: no template named 'is_integral_v' in namespace 'std'; did you > mean 'is_integral'? > static_assert(std::is_integral_v<T>, "Type must be integral"); > ~~~~~^~~~~~~~~~~~~ > is_integral > > and tons of other errors. I did buildworld + installworld. > > I'm trying to say that if you end up with different funcs depending on > bounds checking and you only align them to 16 bytes, the benchmark is > most likely inaccurate if only for this reason. > >> The fastest on x86 is roughly: >> >> - A jump table of power for small sizes that do power-of-two-sized small >> copies (double-word, word, half-word, and byte) to perform the copy. > > Last I checked this was not true. The last slow thing to do was to > branch on few sizes and handle them with overlapping stores. Roughly > what memcpy in libc is doing now. > > Jump table aside, you got all sizes "spelled out", that is just > avoidable impact on icache. While overlapping stores do come with some > penalty, it is cheaper than the above combined. > > I don't have numbers nor bench code handy, but if you insist on > contesting the above I'll be glad to provide them. > >> - A vectorised copy for medium-sized copies using a loop of SSE copies. > > Depends on what you mean by medium and which simd instructions, but > yes, they should be used here. > >> - rep movsb for large copies. > > There are still old cpus here and there which don't support ERMS. They > are very negatively impacted the above and should roll with rep stosq > instead. > > I see the code takes some care to align the target buffer. That's > good, but not necessary on CPUs with FSRM. > > All that said, I have hard time believing the impact of bounds > checking on short copies is around 20% or so. The code to do it looks > super slow (not that I know to do better). > > People like to claim short sizes are inlined by the compiler, but > that's only true if the size is known at compilation time, which it > often is not. Then they claim they are rare. > > To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@ 19758 > 1 |@@@@@@@@ 218420 > 2 |@@ 67670 > 4 |@@@@ 103914 > 8 |@@@@@@@@@@@ 301157 > 16 |@@@@@@@@@@ 293812 > 32 |@@ 57954 > 64 |@ 23818 > 128 | 13344 > 256 |@ 18507 > 512 | 6342 > 1024 | 1710 > 2048 | 627 > 4096 | 398 > 8192 | 34 > 16384 | 10 > 32768 | 6 > 65536 | 7 > 131072 | 4 > 262144 | 1 > 524288 | 1 > 1048576 | 0 > > obtained with this bad boy: > dtrace -n 'pid$target::memcpy:entry { @ = quantize(arg2); }' -c "cc > -target x86_64-unknown-freebsd14.0 > --sysroot=/usr/obj/usr/src/amd64.amd64/tmp > -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe > -fno-strict-aliasing -g -nostdinc -I. -I/usr/src/sys > -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt > -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h > -fno-common -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer > -MD -MF.depend.vfs_cache.o -MTvfs_cache.o > -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include > -fdebug-prefix-map=./x86=/usr/src/sys/x86/include > -fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel > -mno-red-zone -mno-mmx -mno-sse -msoft-float > -fno-asynchronous-unwind-tables -ffreestanding -fwrapv > -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes > -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef > -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ > -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas > -Wno-error=tautological-compare -Wno-error=empty-body > -Wno-error=parentheses-equality -Wno-error=unused-function > -Wno-error=pointer-sign -Wno-error=shift-negative-value > -Wno-address-of-packed-member -Wno-error=array-parameter > -Wno-error=deprecated-non-prototype -Wno-error=strict-prototypes > -Wno-error=unused-but-set-variable -Wno-format-zero-length -mno-aes > -mno-avx -std=iso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c" > > tl;dr if this goes in, the fate of memcpy thing will need to be > handled separtely > -- > Mateusz Guzik <mjguzik gmail.com> > -- Mateusz Guzik <mjguzik gmail.com>