head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?

Wed Sep 25 17:03:02 UTC 2019

On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 wrote:
> Note: I have access to only one FreeBSD amd64 context, and
> it is also my only access to a NUMA context: 2 memory
> domains. A Threadripper 1950X context. Also: I have only
> a head FreeBSD context on any architecture, not 12.x or
> before. So I have limited compare/contrast material.
> 
> I present the below basically to ask if the NUMA handling
> has been validated, or if it is going to be, at least for
> contexts that might apply to ThreadRipper 1950X and
> analogous contexts. My results suggest they are not (or
> libc++'s now times get messed up such that it looks like
> NUMA mishandling since this is based on odd benchmark
> results that involve mean time for laps, using a median
> of such across multiple trials).
> 
> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
> 1950X got got expected  results on Fedora but odd ones on
> FreeBSD. The benchmark is a variation on the old HINT
> benchmark, spanning the old multi-threading variation. I
> later tried Fedora because the FreeBSD results looked odd.
> The other architectures I tried FreeBSD benchmarking with
> did not look odd like this. (powerpc64 on a old PowerMac 2
> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
> Ed. For these I used 4 threads, not more.)
> 
> I tend to write in terms of plots made from the data instead
> of the raw benchmark data.
> 
> FreeBSD testing based on:
> cpuset -l0-15  -n prefer:1
> cpuset -l16-31 -n prefer:1
> 
> Fedora 30 testing based on:
> numactl --preferred 1 --cpunodebind 0
> numactl --preferred 1 --cpunodebind 1
> 
> While I have more results, I reference primarily DSIZE
> and ISIZE being unsigned long long and also both being
> unsigned long as examples. Variations in results are not
> from the type differences for any LP64 architectures.
> (But they give an idea of benchmark variability in the
> test context.)
> 
> The Fedora results solidly show the bandwidth limitation
> of using one memory controller. They also show the latency
> consequences for the remote memory domain case vs. the
> local memory domain case. There is not a lot of
> variability between the examples of the 2 type-pairs used
> for Fedora.
> 
> Not true for FreeBSD on the 1950X:
> 
> A) The latency-constrained part of the graph looks to
>    normally be using the local memory domain when
>    -l0-15 is in use for 8 threads.
> 
> B) Both the -l0-15 and the -l16-31 parts of the
>    graph for 8 threads that should be bandwidth
>    limited show mostly examples that would have to
>    involve both memory controllers for the bandwidth
>    to get the results shown as far as I can tell.
>    There is also wide variability ranging between the
>    expected 1 controller result and, say, what a 2
>    controller round-robin would be expected produce.
> 
> C) Even the single threaded result shows a higher
>    result for larger total bytes for the kernel
>    vectors. Fedora does not.
> 
> I think that (B) is the most solid evidence for
> something being odd.

The implication seems to be that your benchmark program is using pages
from both domains despite a policy which preferentially allocates pages
from domain 1, so you would first want to determine if this is actually
what's happening.  As far as I know we currently don't have a good way
of characterizing per-domain memory usage within a process.

If your benchmark uses a large fraction of the system's memory, you
could use the vm.phys_free sysctl to get a sense of how much memory from
each domain is free.  Another possibility is to use DTrace to trace the
requested domain in vm_page_alloc_domain_after().  For example, the
following DTrace one-liner counts the number of pages allocated per
domain by ls(1):

# dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls"
...
	0               71
	1               72
# dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls"
...
	1              143
# dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:0 ls"
...
	0              143

This approach might not work for various reasons depending on how
exactly your benchmark program works.