Re: CFT: snmalloc as libc malloc
- Reply: David Chisnall : "Re: CFT: snmalloc as libc malloc"
- In reply to: David Chisnall : "Re: CFT: snmalloc as libc malloc"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 09 Feb 2023 22:07:45 UTC
I'm going to write a proper reply some time this week. For now what matters is that this CFT mixes up different things: - snmalloc - memcpy as shipped with snmalloc - bound checks for said memcpy Seeing as stock memcpy is deficient and the one provided by snmalloc is not a factor when deciding whether to switch to snmalloc, imo this should be reposted. I also note that at least on head stock jemalloc ships with debug, which again people run with by default unless they explicitly disabled it. So a better CFT would include instructions: - how to make sure jemalloc does not have debug - how to build snmalloc without affecting memcpy (in fact, it would be best if the branch for now shipped *without* custom memcpy so that one does not have to check what hapepned) frankly i don't know how to check the jemalloc thing, but I can spend some time to find out if that helps On 2/9/23, David Chisnall <theraven@freebsd.org> wrote: > On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> First and foremost, perhaps I should clear up that I have no opinion >> on snmalloc or it replacing jemalloc. I'm here only about the memcpy >> thing. >> >> Inspecting assembly is what I intended to do, but got the compilation >> problem. > > If you just want to look at the memcpy, you might do better using the > upstream (CMake) build system, which builds a shim library that you can > disassemble and look at. > >> So, as someone who worked on memcpy previously, I note the variant >> currently implemented in libc is pessimal for sizes > 16 bytes because >> it does not use SIMD. I do have plans to rectify that, but ENOTIME. > > That’s one of the benefits of the C++ implementation. We were able to try a > few dozen variations in a morning. Writing a single one of those in > assembly would take a similar amount of time. > > We were inspired by the automemcpy paper, which demonstrated that you can > write memcpy implementations tuned to specific workloads in higher-level > languages and get better performance. > >> I also note CPUs are incredibly picky when it comes to starting point >> of the routine. The officialy recommended alignment of 16 bytes is >> basically a tradeoff between total binary size and performance. Should >> you move the routine at different 16 bytes intervals you will easily >> find big variation in performance, depending on how big of an >> alignment you ended up with. > > Yes, that’s what our test does. It does a large number of different copies > with different sizes and alignments. > >> I tried to compile the benchmark but got bested by c++. I don't know >> the lang and I don't want to fight it. >> >> $ pwd >> /usr/src/contrib/snmalloc/src >> $ c++ -I. test/perf/memcpy/memcpy.cc >> [snip] >> ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57:26: >> error: no template named 'is_integral_v' in namespace 'std'; did you >> mean 'is_integral'? >> static_assert(std::is_integral_v<T>, "Type must be integral"); >> ~~~~~^~~~~~~~~~~~~ >> is_integral > > I think the only thing missing is -std=c++17. > >> I'm trying to say that if you end up with different funcs depending on >> bounds checking and you only align them to 16 bytes, the benchmark is >> most likely inaccurate if only for this reason. > > I’m not sure I follow this, sorry. > >>> The fastest on x86 is roughly: >>> >>> - A jump table of power for small sizes that do power-of-two-sized small >>> copies (double-word, word, half-word, and byte) to perform the copy. >> >> Last I checked this was not true. The last slow thing to do was to >> branch on few sizes and handle them with overlapping stores. Roughly >> what memcpy in libc is doing now. >> >> Jump table aside, you got all sizes "spelled out", that is just >> avoidable impact on icache. While overlapping stores do come with some >> penalty, it is cheaper than the above combined. > > They’re not all spelled out, because a lot of them are subsets of the > others. The compiler does a *very* good job of optimising this. The > generated assembly was a lot more complex than anything I could write by > hand. > >> I don't have numbers nor bench code handy, but if you insist on >> contesting the above I'll be glad to provide them. >> >>> - A vectorised copy for medium-sized copies using a loop of SSE copies. >> >> Depends on what you mean by medium and which simd instructions, but >> yes, they should be used here. > > This could probably do with more tuning, but all of this is configurable > with a couple of template parameters. If it’s useful to have more variants > for different microarchitectures then it’s a trivial tweak to generate > them. > >>> - rep movsb for large copies. >> >> There are still old cpus here and there which don't support ERMS. They >> are very negatively impacted the above and should roll with rep stosq >> instead. > > We tested on some microarchitectures without FRMS but didn’t compare with > rep stosq as an alternative. The rep movsb is inline assembly > (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162bf1ee81196/src/snmalloc/global/memcpy.h#L351) > so it would be quite easy to compare. Very happy to take patches that make > things faster! > > PowerPC doesn’t have an equivalent of rep movsb, so the PowerPC version > doesn’t have an equivalent codepath. > > Each tuned version is a dozen lines of code (plus a lot of comments to > explain *why* it does things a particular way), so it’s very easy to add > different versions with different tuning. > >> I see the code takes some care to align the target buffer. That's >> good, but not necessary on CPUs with FSRM. > > It doesn’t hurt though, on the sizes where it’s actually beneficial to use > the rep sequence the cost of alignment is negligible. > >> All that said, I have hard time believing the impact of bounds >> checking on short copies is around 20% or so. The code to do it looks >> super slow (not that I know to do better). > > The last time I looked at the disassembly, the code for the bounds check > touched three cache lines and has two branches. It fits well in a > superscalar pipeline so adds a handful of cycles. This is basically in the > noise for large copies but is double-digit percentage overhead for small > ones. > >> People like to claim short sizes are inlined by the compiler, but >> that's only true if the size is known at compilation time, which it >> often is not. Then they claim they are rare. > > I agree. > >> To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@ 19758 >> 1 |@@@@@@@@ 218420 >> 2 |@@ 67670 >> 4 |@@@@ 103914 >> 8 |@@@@@@@@@@@ 301157 >> 16 |@@@@@@@@@@ 293812 >> 32 |@@ 57954 >> 64 |@ 23818 >> 128 | 13344 >> 256 |@ 18507 >> 512 | 6342 >> 1024 | 1710 >> 2048 | 627 >> 4096 | 398 >> 8192 | 34 >> 16384 | 10 >> 32768 | 6 >> 65536 | 7 >> 131072 | 4 >> 262144 | 1 >> 524288 | 1 >> 1048576 | 0 >> >> obtained with this bad boy: >> dtrace -n 'pid$target::memcpy:entry { @ = quantize(arg2); }' -c "cc >> -target x86_64-unknown-freebsd14.0 >> --sysroot=/usr/obj/usr/src/amd64.amd64/tmp >> -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe >> -fno-strict-aliasing -g -nostdinc -I. -I/usr/src/sys >> -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt >> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h >> -fno-common -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer >> -MD -MF.depend.vfs_cache.o -MTvfs_cache.o >> -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include >> -fdebug-prefix-map=./x86=/usr/src/sys/x86/include >> -fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel >> -mno-red-zone -mno-mmx -mno-sse -msoft-float >> -fno-asynchronous-unwind-tables -ffreestanding -fwrapv >> -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes >> -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef >> -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ >> -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas >> -Wno-error=tautological-compare -Wno-error=empty-body >> -Wno-error=parentheses-equality -Wno-error=unused-function >> -Wno-error=pointer-sign -Wno-error=shift-negative-value >> -Wno-address-of-packed-member -Wno-error=array-parameter >> -Wno-error=deprecated-non-prototype -Wno-error=strict-prototypes >> -Wno-error=unused-but-set-variable -Wno-format-zero-length -mno-aes >> -mno-avx -std=iso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c” > > That’s a really helpful datapoint, thanks! That distribution matches roughly > what we’ve seen - the number of zero-byte memcpy calls surprised me the > first time I saw it, but the automemcpy paper reported something similar. > > If you can also extract the alignments from these then it would be fairly > easy to modify the benchmark program to use those distributions. > > My original version extended the FreeBSD assembly memcpy with a call to a > function that did the bounds checks, but that had *much* worse performance. > We could insert assembly to do the bounds checks, but that’s hard to update > if we change the metadata layout in malloc. > > We could modify the ifunc resolvers to use the current memcpy if you don’t > care about bounds checks. I did a tiny amount of testing (mostly clang and > lld) in this configuration and it was slower than using the snmalloc memcpy, > but I didn’t do enough benchmarking to be confident of that. Hence the > CFT. > > David > > -- Mateusz Guzik <mjguzik gmail.com>