Re: CFT: snmalloc as libc malloc

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Thu, 09 Feb 2023 22:07:45 UTC
I'm going to write a proper reply some time this week.

For now what matters is that this CFT mixes up different things:
- snmalloc
- memcpy as shipped with snmalloc
- bound checks for said memcpy

Seeing as stock memcpy is deficient and the one provided by snmalloc
is not a factor when deciding whether to switch to snmalloc, imo this
should be reposted. I also note that at least on head stock jemalloc
ships with debug, which again people run with by default unless they
explicitly disabled it.

So a better CFT would include instructions:
- how to make sure jemalloc does not have debug
- how to build snmalloc without affecting memcpy (in fact, it would be
best if the branch for now shipped *without* custom memcpy so that one
does not have to check what hapepned)

frankly i don't know how to check the jemalloc thing, but I can spend
some time to find out if that helps

On 2/9/23, David Chisnall <theraven@freebsd.org> wrote:
> On 9 Feb 2023, at 20:53, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> First and foremost, perhaps I should clear up that I have no opinion
>> on snmalloc or it replacing jemalloc. I'm here only about the memcpy
>> thing.
>>
>> Inspecting assembly is what I intended to do, but got the compilation
>> problem.
>
> If you just want to look at the memcpy, you might do better using the
> upstream (CMake) build system, which builds a shim library that you can
> disassemble and look at.
>
>> So, as someone who worked on memcpy previously, I note the variant
>> currently implemented in libc is pessimal for sizes > 16 bytes because
>> it does not use SIMD. I do have plans to rectify that, but ENOTIME.
>
> That’s one of the benefits of the C++ implementation.  We were able to try a
> few dozen variations in a morning.  Writing a single one of those in
> assembly would take a similar amount of time.
>
> We were inspired by the automemcpy paper, which demonstrated that you can
> write memcpy implementations tuned to specific workloads in higher-level
> languages and get better performance.
>
>> I also note CPUs are incredibly picky when it comes to starting point
>> of the routine. The officialy recommended alignment of 16 bytes is
>> basically a tradeoff between total binary size and performance. Should
>> you move the routine at different 16 bytes intervals you will easily
>> find big variation in performance, depending on how big of an
>> alignment you ended up with.
>
> Yes, that’s what our test does.  It does a large number of different copies
> with different sizes and alignments.
>
>> I tried to compile the benchmark but got bested by c++. I don't know
>> the lang and I don't want to fight it.
>>
>> $ pwd
>> /usr/src/contrib/snmalloc/src
>> $ c++ -I. test/perf/memcpy/memcpy.cc
>> [snip]
>> ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h:57:26:
>> error: no template named 'is_integral_v' in namespace 'std'; did you
>> mean 'is_integral'?
>>      static_assert(std::is_integral_v<T>, "Type must be integral");
>>                    ~~~~~^~~~~~~~~~~~~
>>                         is_integral
>
> I think the only thing missing is -std=c++17.
>
>> I'm trying to say that if you end up with different funcs depending on
>> bounds checking and you only align them to 16 bytes, the benchmark is
>> most likely inaccurate if only for this reason.
>
> I’m not sure I follow this, sorry.
>
>>> The fastest on x86 is roughly:
>>>
>>> - A jump table of power for small sizes that do power-of-two-sized small
>>> copies (double-word, word, half-word, and byte) to perform the copy.
>>
>> Last I checked this was not true. The last slow thing to do was to
>> branch on few sizes and handle them with overlapping stores. Roughly
>> what memcpy in libc is doing now.
>>
>> Jump table aside, you got all sizes "spelled out", that is just
>> avoidable impact on icache. While overlapping stores do come with some
>> penalty, it is cheaper than the above combined.
>
> They’re not all spelled out, because a lot of them are subsets of the
> others.  The compiler does a *very* good job of optimising this.  The
> generated assembly was a lot more complex than anything I could write by
> hand.
>
>> I don't have numbers nor bench code handy, but if you insist on
>> contesting the above I'll be glad to provide them.
>>
>>> - A vectorised copy for medium-sized copies using a loop of SSE copies.
>>
>> Depends on what you mean by medium and which simd instructions, but
>> yes, they should be used here.
>
> This could probably do with more tuning, but all of this is configurable
> with a couple of template parameters.  If it’s useful to have more variants
> for different microarchitectures then it’s a trivial tweak to generate
> them.
>
>>> - rep movsb for large copies.
>>
>> There are still old cpus here and there which don't support ERMS. They
>> are very negatively impacted the above and should roll with rep stosq
>> instead.
>
> We tested on some microarchitectures without FRMS but didn’t compare with
> rep stosq as an alternative.  The rep movsb is inline assembly
> (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d6162bf1ee81196/src/snmalloc/global/memcpy.h#L351)
> so it would be quite easy to compare.  Very happy to take patches that make
> things faster!
>
> PowerPC doesn’t have an equivalent of rep movsb, so the PowerPC version
> doesn’t have an equivalent codepath.
>
> Each tuned version is a dozen lines of code (plus a lot of comments to
> explain *why* it does things a particular way), so it’s very easy to add
> different versions with different tuning.
>
>> I see the code takes some care to align the target buffer. That's
>> good, but not necessary on CPUs with FSRM.
>
> It doesn’t hurt though, on the sizes where it’s actually beneficial to use
> the rep sequence the cost of alignment is negligible.
>
>> All that said, I have hard time believing the impact of bounds
>> checking on short copies is around 20% or so. The code to do it looks
>> super slow (not that I know to do better).
>
> The last time I looked at the disassembly, the code for the bounds check
> touched three cache lines and has two branches.  It fits well in a
> superscalar pipeline so adds a handful of cycles.  This is basically in the
> noise for large copies but is double-digit percentage overhead for small
> ones.
>
>> People like to claim short sizes are inlined by the compiler, but
>> that's only true if the size is known at compilation time, which it
>> often is not. Then they claim they are rare.
>
> I agree.
>
>> To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c:
>>           value  ------------- Distribution ------------- count
>>              -1 |                                         0
>>               0 |@                                        19758
>>               1 |@@@@@@@@                                 218420
>>               2 |@@                                       67670
>>               4 |@@@@                                     103914
>>               8 |@@@@@@@@@@@                              301157
>>              16 |@@@@@@@@@@                               293812
>>              32 |@@                                       57954
>>              64 |@                                        23818
>>             128 |                                         13344
>>             256 |@                                        18507
>>             512 |                                         6342
>>            1024 |                                         1710
>>            2048 |                                         627
>>            4096 |                                         398
>>            8192 |                                         34
>>           16384 |                                         10
>>           32768 |                                         6
>>           65536 |                                         7
>>          131072 |                                         4
>>          262144 |                                         1
>>          524288 |                                         1
>>         1048576 |                                         0
>>
>> obtained with this bad boy:
>> dtrace -n 'pid$target::memcpy:entry { @ = quantize(arg2); }' -c "cc
>> -target x86_64-unknown-freebsd14.0
>> --sysroot=/usr/obj/usr/src/amd64.amd64/tmp
>> -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe
>> -fno-strict-aliasing  -g -nostdinc  -I. -I/usr/src/sys
>> -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt
>> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h
>> -fno-common    -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer
>> -MD  -MF.depend.vfs_cache.o -MTvfs_cache.o
>> -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include
>> -fdebug-prefix-map=./x86=/usr/src/sys/x86/include
>> -fdebug-prefix-map=./i386=/usr/src/sys/i386/include -mcmodel=kernel
>> -mno-red-zone -mno-mmx -mno-sse -msoft-float
>> -fno-asynchronous-unwind-tables -ffreestanding -fwrapv
>> -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes
>> -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef
>> -Wno-pointer-sign -D__printf__=__freebsd_kprintf__
>> -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas
>> -Wno-error=tautological-compare -Wno-error=empty-body
>> -Wno-error=parentheses-equality -Wno-error=unused-function
>> -Wno-error=pointer-sign -Wno-error=shift-negative-value
>> -Wno-address-of-packed-member -Wno-error=array-parameter
>> -Wno-error=deprecated-non-prototype -Wno-error=strict-prototypes
>> -Wno-error=unused-but-set-variable -Wno-format-zero-length   -mno-aes
>> -mno-avx  -std=iso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c”
>
> That’s a really helpful datapoint, thanks! That distribution matches roughly
> what we’ve seen - the number of zero-byte memcpy calls surprised me the
> first time I saw it, but the automemcpy paper reported something similar.
>
> If you can also extract the alignments from these then it would be fairly
> easy to modify the benchmark program to use those distributions.
>
> My original version extended the FreeBSD assembly memcpy with a call to a
> function that did the bounds checks, but that had *much* worse performance.
> We could insert assembly to do the bounds checks, but that’s hard to update
> if we change the metadata layout in malloc.
>
> We could modify the ifunc resolvers to use the current memcpy if you don’t
> care about bounds checks.  I did a tiny amount of testing (mostly clang and
> lld) in this configuration and it was slower than using the snmalloc memcpy,
> but I didn’t do enough benchmarking to be confident of that.  Hence the
> CFT.
>
> David
>
>


-- 
Mateusz Guzik <mjguzik gmail.com>