head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
Mark Johnston
markj at freebsd.org
Wed Sep 25 17:03:02 UTC 2019
On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 wrote:
> Note: I have access to only one FreeBSD amd64 context, and
> it is also my only access to a NUMA context: 2 memory
> domains. A Threadripper 1950X context. Also: I have only
> a head FreeBSD context on any architecture, not 12.x or
> before. So I have limited compare/contrast material.
>
> I present the below basically to ask if the NUMA handling
> has been validated, or if it is going to be, at least for
> contexts that might apply to ThreadRipper 1950X and
> analogous contexts. My results suggest they are not (or
> libc++'s now times get messed up such that it looks like
> NUMA mishandling since this is based on odd benchmark
> results that involve mean time for laps, using a median
> of such across multiple trials).
>
> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
> 1950X got got expected results on Fedora but odd ones on
> FreeBSD. The benchmark is a variation on the old HINT
> benchmark, spanning the old multi-threading variation. I
> later tried Fedora because the FreeBSD results looked odd.
> The other architectures I tried FreeBSD benchmarking with
> did not look odd like this. (powerpc64 on a old PowerMac 2
> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
> Ed. For these I used 4 threads, not more.)
>
> I tend to write in terms of plots made from the data instead
> of the raw benchmark data.
>
> FreeBSD testing based on:
> cpuset -l0-15 -n prefer:1
> cpuset -l16-31 -n prefer:1
>
> Fedora 30 testing based on:
> numactl --preferred 1 --cpunodebind 0
> numactl --preferred 1 --cpunodebind 1
>
> While I have more results, I reference primarily DSIZE
> and ISIZE being unsigned long long and also both being
> unsigned long as examples. Variations in results are not
> from the type differences for any LP64 architectures.
> (But they give an idea of benchmark variability in the
> test context.)
>
> The Fedora results solidly show the bandwidth limitation
> of using one memory controller. They also show the latency
> consequences for the remote memory domain case vs. the
> local memory domain case. There is not a lot of
> variability between the examples of the 2 type-pairs used
> for Fedora.
>
> Not true for FreeBSD on the 1950X:
>
> A) The latency-constrained part of the graph looks to
> normally be using the local memory domain when
> -l0-15 is in use for 8 threads.
>
> B) Both the -l0-15 and the -l16-31 parts of the
> graph for 8 threads that should be bandwidth
> limited show mostly examples that would have to
> involve both memory controllers for the bandwidth
> to get the results shown as far as I can tell.
> There is also wide variability ranging between the
> expected 1 controller result and, say, what a 2
> controller round-robin would be expected produce.
>
> C) Even the single threaded result shows a higher
> result for larger total bytes for the kernel
> vectors. Fedora does not.
>
> I think that (B) is the most solid evidence for
> something being odd.
The implication seems to be that your benchmark program is using pages
from both domains despite a policy which preferentially allocates pages
from domain 1, so you would first want to determine if this is actually
what's happening. As far as I know we currently don't have a good way
of characterizing per-domain memory usage within a process.
If your benchmark uses a large fraction of the system's memory, you
could use the vm.phys_free sysctl to get a sense of how much memory from
each domain is free. Another possibility is to use DTrace to trace the
requested domain in vm_page_alloc_domain_after(). For example, the
following DTrace one-liner counts the number of pages allocated per
domain by ls(1):
# dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls"
...
0 71
1 72
# dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls"
...
1 143
# dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:0 ls"
...
0 143
This approach might not work for various reasons depending on how
exactly your benchmark program works.
More information about the freebsd-amd64
mailing list