Re: CFT: snmalloc as libc malloc
- In reply to: Mateusz Guzik : "Re: CFT: snmalloc as libc malloc"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 13 Feb 2023 12:48:03 UTC
Thanks, that's good feedback. I didn't separate these originally because I didn't want a huge testing matrix, but I've now done so. I have split snmalloc into two options in the latest commit in that branch: - WITH_SNMALLOC (defaults on) enables snmalloc - WITH_SNMALLOC_MEMCPY (defaults on if snmalloc is enabled) - WITH_MALLOC_PRODUCTION (I believe this is off by default for -CURRENT, on for other branches) sets both snmalloc and jemalloc into the recommended mode for release builds It's good to compare snmalloc vs jemalloc in WITH_MALLOC_PRODUCTION modes. I'd be interested in seeing snmalloc *without* the production knobs against jemalloc *with*, because if we have a bit of performance headroom we could use it for better security features. I'd also like to see comparisons of WITH_SNMALLOC_MEMCPY with different settings of the LIBC_BOUNDS_CHECKS environment variable for any workloads that people care about. David On 09/02/2023 22:07, Mateusz Guzik wrote: > I'm going to write a proper reply some time this week. > > For now what matters is that this CFT mixes up different things: > - snmalloc > - memcpy as shipped with snmalloc > - bound checks for said memcpy > > Seeing as stock memcpy is deficient and the one provided by snmalloc > is not a factor when deciding whether to switch to snmalloc, imo this > should be reposted. I also note that at least on head stock jemalloc > ships with debug, which again people run with by default unless they > explicitly disabled it. > > So a better CFT would include instructions: > - how to make sure jemalloc does not have debug > - how to build snmalloc without affecting memcpy (in fact, it would be > best if the branch for now shipped *without* custom memcpy so that one > does not have to check what hapepned) > > frankly i don't know how to check the jemalloc thing, but I can spend > some time to find out if that helps > > On 2/9/23, David Chisnall <theraven@freebsd.org> wrote: >> On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote: >>> >>> First and foremost, perhaps I should clear up that I have no opinion >>> on snmalloc or it replacing jemalloc. I'm here only about the memcpy >>> thing. >>> >>> Inspecting assembly is what I intended to do, but got the compilation >>> problem. >> >> If you just want to look at the memcpy, you might do better using the >> upstream (CMake) build system, which builds a shim library that you can >> disassemble and look at. >> >>> So, as someone who worked on memcpy previously, I note the variant >>> currently implemented in libc is pessimal for sizes > 16 bytes because >>> it does not use SIMD. I do have plans to rectify that, but ENOTIME. >> >> That’s one of the benefits of the C++ implementation. We were able to try a >> few dozen variations in a morning. Writing a single one of those in >> assembly would take a similar amount of time. >> >> We were inspired by the automemcpy paper, which demonstrated that you can >> write memcpy implementations tuned to specific workloads in higher-level >> languages and get better performance. >> >>> I also note CPUs are incredibly picky when it comes to starting point >>> of the routine. The officialy recommended alignment of 16 bytes is >>> basically a tradeoff between total binary size and performance. Should >>> you move the routine at different 16 bytes intervals you will easily >>> find big variation in performance, depending on how big of an >>> alignment you ended up with. >> >> Yes, that’s what our test does. It does a large number of different copies >> with different sizes and alignments. >> >>> I tried to compile the benchmark but got bested by c++. I don't know >>> the lang and I don't want to fight it. >>> >>> $ pwd >>> /usr/src/contrib/snmalloc/src >>> $ c++ -I. test/perf/memcpy/memcpy.cc >>> [snip] >>> ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57:26: >>> error: no template named 'is_integral_v' in namespace 'std'; did you >>> mean 'is_integral'? >>> static_assert(std::is_integral_v<T>, "Type must be integral"); >>> ~~~~~^~~~~~~~~~~~~ >>> is_integral >> >> I think the only thing missing is -std=c++17. >> >>> I'm trying to say that if you end up with different funcs depending on >>> bounds checking and you only align them to 16 bytes, the benchmark is >>> most likely inaccurate if only for this reason. >> >> I’m not sure I follow this, sorry. >> >>>> The fastest on x86 is roughly: >>>> >>>> - A jump table of power for small sizes that do power-of-two-sized small >>>> copies (double-word, word, half-word, and byte) to perform the copy. >>> >>> Last I checked this was not true. The last slow thing to do was to >>> branch on few sizes and handle them with overlapping stores. Roughly >>> what memcpy in libc is doing now. >>> >>> Jump table aside, you got all sizes "spelled out", that is just >>> avoidable impact on icache. While overlapping stores do come with some >>> penalty, it is cheaper than the above combined. >> >> They’re not all spelled out, because a lot of them are subsets of the >> others. The compiler does a *very* good job of optimising this. The >> generated assembly was a lot more complex than anything I could write by >> hand. >> >>> I don't have numbers nor bench code handy, but if you insist on >>> contesting the above I'll be glad to provide them. >>> >>>> - A vectorised copy for medium-sized copies using a loop of SSE copies. >>> >>> Depends on what you mean by medium and which simd instructions, but >>> yes, they should be used here. >> >> This could probably do with more tuning, but all of this is configurable >> with a couple of template parameters. If it’s useful to have more variants >> for different microarchitectures then it’s a trivial tweak to generate >> them. >> >>>> - rep movsb for large copies. >>> >>> There are still old cpus here and there which don't support ERMS. They >>> are very negatively impacted the above and should roll with rep stosq >>> instead. >> >> We tested on some microarchitectures without FRMS but didn’t compare with >> rep stosq as an alternative. The rep movsb is inline assembly >> (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162bf1ee81196/src/snmalloc/global/memcpy.h#L351) >> so it would be quite easy to compare. Very happy to take patches that make >> things faster! >> >> PowerPC doesn’t have an equivalent of rep movsb, so the PowerPC version >> doesn’t have an equivalent codepath. >> >> Each tuned version is a dozen lines of code (plus a lot of comments to >> explain *why* it does things a particular way), so it’s very easy to add >> different versions with different tuning. >> >>> I see the code takes some care to align the target buffer. That's >>> good, but not necessary on CPUs with FSRM. >> >> It doesn’t hurt though, on the sizes where it’s actually beneficial to use >> the rep sequence the cost of alignment is negligible. >> >>> All that said, I have hard time believing the impact of bounds >>> checking on short copies is around 20% or so. The code to do it looks >>> super slow (not that I know to do better). >> >> The last time I looked at the disassembly, the code for the bounds check >> touched three cache lines and has two branches. It fits well in a >> superscalar pipeline so adds a handful of cycles. This is basically in the >> noise for large copies but is double-digit percentage overhead for small >> ones. >> >>> People like to claim short sizes are inlined by the compiler, but >>> that's only true if the size is known at compilation time, which it >>> often is not. Then they claim they are rare. >> >> I agree. >> >>> To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c: >>> value ------------- Distribution ------------- count >>> -1 | 0 >>> 0 |@ 19758 >>> 1 |@@@@@@@@ 218420 >>> 2 |@@ 67670 >>> 4 |@@@@ 103914 >>> 8 |@@@@@@@@@@@ 301157 >>> 16 |@@@@@@@@@@ 293812 >>> 32 |@@ 57954 >>> 64 |@ 23818 >>> 128 | 13344 >>> 256 |@ 18507 >>> 512 | 6342 >>> 1024 | 1710 >>> 2048 | 627 >>> 4096 | 398 >>> 8192 | 34 >>> 16384 | 10 >>> 32768 | 6 >>> 65536 | 7 >>> 131072 | 4 >>> 262144 | 1 >>> 524288 | 1 >>> 1048576 | 0 >>> >>> obtained with this bad boy: >>> dtrace -n 'pid$target::memcpy:entry { @ = quantize(arg2); }' -c "cc >>> -target x86_64-unknown-freebsd14.0 >>> --sysroot=/usr/obj/usr/src/amd64.amd64/tmp >>> -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe >>> -fno-strict-aliasing -g -nostdinc -I. -I/usr/src/sys >>> -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt >>> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h >>> -fno-common -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer >>> -MD -MF.depend.vfs_cache.o -MTvfs_cache.o >>> -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include >>> -fdebug-prefix-map=./x86=/usr/src/sys/x86/include >>> -fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel >>> -mno-red-zone -mno-mmx -mno-sse -msoft-float >>> -fno-asynchronous-unwind-tables -ffreestanding -fwrapv >>> -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes >>> -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef >>> -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ >>> -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas >>> -Wno-error=tautological-compare -Wno-error=empty-body >>> -Wno-error=parentheses-equality -Wno-error=unused-function >>> -Wno-error=pointer-sign -Wno-error=shift-negative-value >>> -Wno-address-of-packed-member -Wno-error=array-parameter >>> -Wno-error=deprecated-non-prototype -Wno-error=strict-prototypes >>> -Wno-error=unused-but-set-variable -Wno-format-zero-length -mno-aes >>> -mno-avx -std=iso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c” >> >> That’s a really helpful datapoint, thanks! That distribution matches roughly >> what we’ve seen - the number of zero-byte memcpy calls surprised me the >> first time I saw it, but the automemcpy paper reported something similar. >> >> If you can also extract the alignments from these then it would be fairly >> easy to modify the benchmark program to use those distributions. >> >> My original version extended the FreeBSD assembly memcpy with a call to a >> function that did the bounds checks, but that had *much* worse performance. >> We could insert assembly to do the bounds checks, but that’s hard to update >> if we change the metadata layout in malloc. >> >> We could modify the ifunc resolvers to use the current memcpy if you don’t >> care about bounds checks. I did a tiny amount of testing (mostly clang and >> lld) in this configuration and it was slower than using the snmalloc memcpy, >> but I didn’t do enough benchmarking to be confident of that. Hence the >> CFT. >> >> David >> >> > >