Re: CFT: snmalloc as libc malloc
- Reply: Mateusz Guzik : "Re: CFT: snmalloc as libc malloc"
- Reply: David Chisnall : "Re: CFT: snmalloc as libc malloc"
- Reply: Konstantin Belousov : "Re: CFT: snmalloc as libc malloc"
- In reply to: David Chisnall : "Re: CFT: snmalloc as libc malloc"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 09 Feb 2023 20:53:34 UTC
On 2/9/23, David Chisnall <theraven@freebsd.org> wrote: > On 9 Feb 2023, at 19:15, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> it fails to build for me: >> >> /usr/src/lib/libc/stdlib/snmalloc/malloc.cc:35:10: fatal error: >> 'override/jemalloc_compat.cc' file not found >> #include "override/jemalloc_compat.cc" >> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> 1 error generated. >> --- malloc.o --- >> *** [malloc.o] Error code 1 >> >> make[4]: stopped in /usr/src/lib/libc >> /usr/src/lib/libc/stdlib/snmalloc/memcpy.cc:25:10: fatal error: >> 'global/memcpy.h' file not found >> #include <global/memcpy.h> >> ^~~~~~~~~~~~~~~~~ >> 1 error generated. >> --- memcpy.o --- >> *** [memcpy.o] Error code 1 > > This looks as if you haven’t got the submodule? Is there anything in > contrib/snmalloc? > indeed, a pilot error >> anyway, I wanted to say I find the memcpy thing incredibly suspicious. >> I found one article in >> https://github.com/microsoft/snmalloc/blob/main/docs/security/GuardedMemcpy.md >> which benches it and that made it even more suspicious. How did the >> benched memcpy look like inside? > > Perhaps you could share what you are suspicious about? I don’t really know > how to respond to something so vague. The document you linked to has the > benchmark that we used (though the graphs in it appear to be based on an > older version of the memcpy). The PR that added PowerPC tuning has some > additional graphs of measurements. > > If you compile the memcpy file, you can see the assembly. The C++ provides > a set of building blocks for producing efficient memcpy implementations. First and foremost, perhaps I should clear up that I have no opinion on snmalloc or it replacing jemalloc. I'm here only about the memcpy thing. Inspecting assembly is what I intended to do, but got the compilation problem. So, as someone who worked on memcpy previously, I note the variant currently implemented in libc is pessimal for sizes > 16 bytes because it does not use SIMD. I do have plans to rectify that, but ENOTIME. I also note CPUs are incredibly picky when it comes to starting point of the routine. The officialy recommended alignment of 16 bytes is basically a tradeoff between total binary size and performance. Should you move the routine at different 16 bytes intervals you will easily find big variation in performance, depending on how big of an alignment you ended up with. I tried to compile the benchmark but got bested by c++. I don't know the lang and I don't want to fight it. $ pwd /usr/src/contrib/snmalloc/src $ c++ -I. test/perf/memcpy/memcpy.cc [snip] ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57:26: error: no template named 'is_integral_v' in namespace 'std'; did you mean 'is_integral'? static_assert(std::is_integral_v<T>, "Type must be integral"); ~~~~~^~~~~~~~~~~~~ is_integral and tons of other errors. I did buildworld + installworld. I'm trying to say that if you end up with different funcs depending on bounds checking and you only align them to 16 bytes, the benchmark is most likely inaccurate if only for this reason. > The fastest on x86 is roughly: > > - A jump table of power for small sizes that do power-of-two-sized small > copies (double-word, word, half-word, and byte) to perform the copy. Last I checked this was not true. The last slow thing to do was to branch on few sizes and handle them with overlapping stores. Roughly what memcpy in libc is doing now. Jump table aside, you got all sizes "spelled out", that is just avoidable impact on icache. While overlapping stores do come with some penalty, it is cheaper than the above combined. I don't have numbers nor bench code handy, but if you insist on contesting the above I'll be glad to provide them. > - A vectorised copy for medium-sized copies using a loop of SSE copies. Depends on what you mean by medium and which simd instructions, but yes, they should be used here. > - rep movsb for large copies. There are still old cpus here and there which don't support ERMS. They are very negatively impacted the above and should roll with rep stosq instead. I see the code takes some care to align the target buffer. That's good, but not necessary on CPUs with FSRM. All that said, I have hard time believing the impact of bounds checking on short copies is around 20% or so. The code to do it looks super slow (not that I know to do better). People like to claim short sizes are inlined by the compiler, but that's only true if the size is known at compilation time, which it often is not. Then they claim they are rare. To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c: value ------------- Distribution ------------- count -1 | 0 0 |@ 19758 1 |@@@@@@@@ 218420 2 |@@ 67670 4 |@@@@ 103914 8 |@@@@@@@@@@@ 301157 16 |@@@@@@@@@@ 293812 32 |@@ 57954 64 |@ 23818 128 | 13344 256 |@ 18507 512 | 6342 1024 | 1710 2048 | 627 4096 | 398 8192 | 34 16384 | 10 32768 | 6 65536 | 7 131072 | 4 262144 | 1 524288 | 1 1048576 | 0 obtained with this bad boy: dtrace -n 'pid$target::memcpy:entry { @ = quantize(arg2); }' -c "cc -target x86_64-unknown-freebsd14.0 --sysroot=/usr/obj/usr/src/amd64.amd64/tmp -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe -fno-strict-aliasing -g -nostdinc -I. -I/usr/src/sys -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h -fno-common -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -MD -MF.depend.vfs_cache.o -MTvfs_cache.o -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include -fdebug-prefix-map=./x86=/usr/src/sys/x86/include -fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse -msoft-float -fno-asynchronous-unwind-tables -ffreestanding -fwrapv -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas -Wno-error=tautological-compare -Wno-error=empty-body -Wno-error=parentheses-equality -Wno-error=unused-function -Wno-error=pointer-sign -Wno-error=shift-negative-value -Wno-address-of-packed-member -Wno-error=array-parameter -Wno-error=deprecated-non-prototype -Wno-error=strict-prototypes -Wno-error=unused-but-set-variable -Wno-format-zero-length -mno-aes -mno-avx -std=iso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c" tl;dr if this goes in, the fate of memcpy thing will need to be handled separtely -- Mateusz Guzik <mjguzik gmail.com>