From nobody Thu Feb 09 22:07:45 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PCWFd60D0z3nrmP for ; Thu, 9 Feb 2023 22:07:49 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-oa1-x2a.google.com (mail-oa1-x2a.google.com [IPv6:2001:4860:4864:20::2a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PCWFd3HdMz4LGr; Thu, 9 Feb 2023 22:07:49 +0000 (UTC) (envelope-from mjguzik@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-oa1-x2a.google.com with SMTP id 586e51a60fabf-16aca2628c6so2222535fac.7; Thu, 09 Feb 2023 14:07:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HLqBNCcGg6OMsRZUjjXYbQ3cTYQGpJoZs+s2jA67Lb4=; b=E6A2Gs6s6iirgR+fEuU9FWkAI1N4oqhW6XCoaGISiGrbaJ06iSpEAnO13Qsgr0kl8w b+1nM3mfF+oMm9Crz7wd9xpwLqb6TxuhX8EZ+AeIxpjmY7rol2QOo/7hqP8gs+q2PIEa xmSqcc2xCIuGu17VgfQda274J81d07Ci8HP1h7k0Gpa7b5S+M0eykO/kJ+hxRldA0U0Z 3RrNOSbopfxb/9e4TdYuDKFpl+lPj02LHBOgizTDYcwlY9QmLmDHChxrD3VJE7TNi/bS WQ0NcmKLDPXlI58Qrh4IyXaCS5izyd1yydg9xYjsJu4NKe2RDHfKWIMadN3C81LRq5Eh K4fA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HLqBNCcGg6OMsRZUjjXYbQ3cTYQGpJoZs+s2jA67Lb4=; b=v/YYZGWsGOLOB13Ggqs271jCBzRZBqsWZecRhEYwYAvWRi0ET9+Kr9IxuJ9HchkuWH v/aySeGD6hXYGIiH1X8uld2fjWXEojKe/Cm05nYN92cKEduTnGX4IyeRC2EscySgPb/G 0IY8HZmIsJJBXLr5wifEDnWdOnELlsIHr6V4fGvQBHcvdlAcNZ4FlntpiArkU2Jec536 6Vz0PPiBiWHqGZ/uWiZ/zQweDYrE3W3hvJItUCSAZ11Riv6Vgk8tpj9QNzlzKt7R7RhE eP6tERRKMUNGglq/Piep8D5XRtMf9P+OUZk78mx478DhCvjM8fnojYS2LIAUiQRi3u3c OpBw== X-Gm-Message-State: AO0yUKV15T0c1KSFEgrMDO3HTZ8YAJtD7tXarZqhR1tGyf59nUMnKvb2 WdT49JjTmiJeNoxKYMc8tHnNGJMTtr3/A9QAGExzkuTq X-Google-Smtp-Source: AK7set/ydB6Rqy5dGx+/QSDSBZkYG/buYsOmiTBEi2ULnHrYChktEpiJxLxMkva9wR0lOd2WVMWDXesh1b2qPZQ1+So= X-Received: by 2002:a05:6870:1257:b0:16a:9099:3868 with SMTP id 23-20020a056870125700b0016a90993868mr808877oao.81.1675980465905; Thu, 09 Feb 2023 14:07:45 -0800 (PST) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Received: by 2002:ac9:6c92:0:b0:4b3:d953:974c with HTTP; Thu, 9 Feb 2023 14:07:45 -0800 (PST) In-Reply-To: <2836EF0C-8B4B-441C-927B-3796457A3865@FreeBSD.org> References: <2f3dcda0-5135-290a-2dff-683b2e9fe271@FreeBSD.org> <2836EF0C-8B4B-441C-927B-3796457A3865@FreeBSD.org> From: Mateusz Guzik Date: Thu, 9 Feb 2023 23:07:45 +0100 Message-ID: Subject: Re: CFT: snmalloc as libc malloc To: David Chisnall Cc: freebsd-hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 4PCWFd3HdMz4LGr X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2001:4860:4864::/48, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N I'm going to write a proper reply some time this week. For now what matters is that this CFT mixes up different things: - snmalloc - memcpy as shipped with snmalloc - bound checks for said memcpy Seeing as stock memcpy is deficient and the one provided by snmalloc is not a factor when deciding whether to switch to snmalloc, imo this should be reposted. I also note that at least on head stock jemalloc ships with debug, which again people run with by default unless they explicitly disabled it. So a better CFT would include instructions: - how to make sure jemalloc does not have debug - how to build snmalloc without affecting memcpy (in fact, it would be best if the branch for now shipped *without* custom memcpy so that one does not have to check what hapepned) frankly i don't know how to check the jemalloc thing, but I can spend some time to find out if that helps On 2/9/23, David Chisnall wrote: > On 9 Feb 2023, at 20:53, Mateusz Guzik wrote: >> >> First and foremost, perhaps I should clear up that I have no opinion >> on snmalloc or it replacing jemalloc. I'm here only about the memcpy >> thing. >> >> Inspecting assembly is what I intended to do, but got the compilation >> problem. > > If you just want to look at the memcpy, you might do better using the > upstream (CMake) build system, which builds a shim library that you can > disassemble and look at. > >> So, as someone who worked on memcpy previously, I note the variant >> currently implemented in libc is pessimal for sizes > 16 bytes because >> it does not use SIMD. I do have plans to rectify that, but ENOTIME. > > That=E2=80=99s one of the benefits of the C++ implementation. We were ab= le to try a > few dozen variations in a morning. Writing a single one of those in > assembly would take a similar amount of time. > > We were inspired by the automemcpy paper, which demonstrated that you can > write memcpy implementations tuned to specific workloads in higher-level > languages and get better performance. > >> I also note CPUs are incredibly picky when it comes to starting point >> of the routine. The officialy recommended alignment of 16 bytes is >> basically a tradeoff between total binary size and performance. Should >> you move the routine at different 16 bytes intervals you will easily >> find big variation in performance, depending on how big of an >> alignment you ended up with. > > Yes, that=E2=80=99s what our test does. It does a large number of differ= ent copies > with different sizes and alignments. > >> I tried to compile the benchmark but got bested by c++. I don't know >> the lang and I don't want to fight it. >> >> $ pwd >> /usr/src/contrib/snmalloc/src >> $ c++ -I. test/perf/memcpy/memcpy.cc >> [snip] >> ./snmalloc/global/../backend/../backend_helpers/../mem/../ds_core/bits.h= :57:26: >> error: no template named 'is_integral_v' in namespace 'std'; did you >> mean 'is_integral'? >> static_assert(std::is_integral_v, "Type must be integral"); >> ~~~~~^~~~~~~~~~~~~ >> is_integral > > I think the only thing missing is -std=3Dc++17. > >> I'm trying to say that if you end up with different funcs depending on >> bounds checking and you only align them to 16 bytes, the benchmark is >> most likely inaccurate if only for this reason. > > I=E2=80=99m not sure I follow this, sorry. > >>> The fastest on x86 is roughly: >>> >>> - A jump table of power for small sizes that do power-of-two-sized smal= l >>> copies (double-word, word, half-word, and byte) to perform the copy. >> >> Last I checked this was not true. The last slow thing to do was to >> branch on few sizes and handle them with overlapping stores. Roughly >> what memcpy in libc is doing now. >> >> Jump table aside, you got all sizes "spelled out", that is just >> avoidable impact on icache. While overlapping stores do come with some >> penalty, it is cheaper than the above combined. > > They=E2=80=99re not all spelled out, because a lot of them are subsets of= the > others. The compiler does a *very* good job of optimising this. The > generated assembly was a lot more complex than anything I could write by > hand. > >> I don't have numbers nor bench code handy, but if you insist on >> contesting the above I'll be glad to provide them. >> >>> - A vectorised copy for medium-sized copies using a loop of SSE copies. >> >> Depends on what you mean by medium and which simd instructions, but >> yes, they should be used here. > > This could probably do with more tuning, but all of this is configurable > with a couple of template parameters. If it=E2=80=99s useful to have mor= e variants > for different microarchitectures then it=E2=80=99s a trivial tweak to gen= erate > them. > >>> - rep movsb for large copies. >> >> There are still old cpus here and there which don't support ERMS. They >> are very negatively impacted the above and should roll with rep stosq >> instead. > > We tested on some microarchitectures without FRMS but didn=E2=80=99t comp= are with > rep stosq as an alternative. The rep movsb is inline assembly > (https://github.com/microsoft/snmalloc/blob/4370a23f3e5f1d1d06967b1e0d616= 2bf1ee81196/src/snmalloc/global/memcpy.h#L351) > so it would be quite easy to compare. Very happy to take patches that ma= ke > things faster! > > PowerPC doesn=E2=80=99t have an equivalent of rep movsb, so the PowerPC v= ersion > doesn=E2=80=99t have an equivalent codepath. > > Each tuned version is a dozen lines of code (plus a lot of comments to > explain *why* it does things a particular way), so it=E2=80=99s very easy= to add > different versions with different tuning. > >> I see the code takes some care to align the target buffer. That's >> good, but not necessary on CPUs with FSRM. > > It doesn=E2=80=99t hurt though, on the sizes where it=E2=80=99s actually = beneficial to use > the rep sequence the cost of alignment is negligible. > >> All that said, I have hard time believing the impact of bounds >> checking on short copies is around 20% or so. The code to do it looks >> super slow (not that I know to do better). > > The last time I looked at the disassembly, the code for the bounds check > touched three cache lines and has two branches. It fits well in a > superscalar pipeline so adds a handful of cycles. This is basically in t= he > noise for large copies but is double-digit percentage overhead for small > ones. > >> People like to claim short sizes are inlined by the compiler, but >> that's only true if the size is known at compilation time, which it >> often is not. Then they claim they are rare. > > I agree. > >> To illustrate why that's bogus, here is clang 15 compiling vfs_cache.c: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@ 19758 >> 1 |@@@@@@@@ 218420 >> 2 |@@ 67670 >> 4 |@@@@ 103914 >> 8 |@@@@@@@@@@@ 301157 >> 16 |@@@@@@@@@@ 293812 >> 32 |@@ 57954 >> 64 |@ 23818 >> 128 | 13344 >> 256 |@ 18507 >> 512 | 6342 >> 1024 | 1710 >> 2048 | 627 >> 4096 | 398 >> 8192 | 34 >> 16384 | 10 >> 32768 | 6 >> 65536 | 7 >> 131072 | 4 >> 262144 | 1 >> 524288 | 1 >> 1048576 | 0 >> >> obtained with this bad boy: >> dtrace -n 'pid$target::memcpy:entry { @ =3D quantize(arg2); }' -c "cc >> -target x86_64-unknown-freebsd14.0 >> --sysroot=3D/usr/obj/usr/src/amd64.amd64/tmp >> -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -c -O2 -pipe >> -fno-strict-aliasing -g -nostdinc -I. -I/usr/src/sys >> -I/usr/src/sys/contrib/ck/include -I/usr/src/sys/contrib/libfdt >> -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h >> -fno-common -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer >> -MD -MF.depend.vfs_cache.o -MTvfs_cache.o >> -fdebug-prefix-map=3D./machine=3D/usr/src/sys/amd64/include >> -fdebug-prefix-map=3D./x86=3D/usr/src/sys/x86/include >> -fdebug-prefix-map=3D./i386=3D/usr/src/sys/i386/include -mcmodel=3Dkerne= l >> -mno-red-zone -mno-mmx -mno-sse -msoft-float >> -fno-asynchronous-unwind-tables -ffreestanding -fwrapv >> -fstack-protector -Wall -Wnested-externs -Wstrict-prototypes >> -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef >> -Wno-pointer-sign -D__printf__=3D__freebsd_kprintf__ >> -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas >> -Wno-error=3Dtautological-compare -Wno-error=3Dempty-body >> -Wno-error=3Dparentheses-equality -Wno-error=3Dunused-function >> -Wno-error=3Dpointer-sign -Wno-error=3Dshift-negative-value >> -Wno-address-of-packed-member -Wno-error=3Darray-parameter >> -Wno-error=3Ddeprecated-non-prototype -Wno-error=3Dstrict-prototypes >> -Wno-error=3Dunused-but-set-variable -Wno-format-zero-length -mno-aes >> -mno-avx -std=3Diso9899:1999 -Werror /usr/src/sys/kern/vfs_cache.c=E2= =80=9D > > That=E2=80=99s a really helpful datapoint, thanks! That distribution matc= hes roughly > what we=E2=80=99ve seen - the number of zero-byte memcpy calls surprised = me the > first time I saw it, but the automemcpy paper reported something similar. > > If you can also extract the alignments from these then it would be fairly > easy to modify the benchmark program to use those distributions. > > My original version extended the FreeBSD assembly memcpy with a call to a > function that did the bounds checks, but that had *much* worse performanc= e. > We could insert assembly to do the bounds checks, but that=E2=80=99s hard= to update > if we change the metadata layout in malloc. > > We could modify the ifunc resolvers to use the current memcpy if you don= =E2=80=99t > care about bounds checks. I did a tiny amount of testing (mostly clang a= nd > lld) in this configuration and it was slower than using the snmalloc memc= py, > but I didn=E2=80=99t do enough benchmarking to be confident of that. Hen= ce the > CFT. > > David > > --=20 Mateusz Guzik