[Bug 279901] glibc-2.39-2 and above on the host segfault

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 18 Dec 2024 11:24:45 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=279901

--- Comment #39 from Florian Weimer <fweimer@redhat.com> ---
(In reply to Konstantin Belousov from comment #37)
> Do you see which CPUID leaf causes the trouble?

Let me try based on attachment 255708. The maximum leaf is 0x80000023 according
to this:

x86.processor[0x0].cpuid.eax[0x80000000].eax=0x80000023

Ordinarly, handle_amd in sysdeps/x86/dl-cacheinfo.h would use the modern way
for obtaining cache details, using leaf 0x8000001D:

x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x0].eax=0x121
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x0].ebx=0x3f
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x0].ecx=0x0
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x0].edx=0x0
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x1].eax=0x143
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x1].ebx=0x3f
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x1].ecx=0x0
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x1].edx=0x0
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x2].eax=0x163
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x2].ebx=0x3f
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x2].ecx=0x0
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x2].edx=0x0
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x3].eax=0x3ffc100
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x3].ebx=0x0
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x3].ecx=0x0
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x3].edx=0x0
x86.processor[0x0].cpuid.subleaf_eax[0x8000001d].ecx[0x3].until_ecx=0x1ff

L3 cache data is subleaf 3. We have a safety check that requires ECX != 0, in
case hypervisors do not fill in this information, which is happening here. We
fall back to the legacy way of obtaining cache size.  That uses leaf 0x80000006
for L3 cache information:

x86.processor[0x0].cpuid.eax[0x80000006].eax=0x48002200
x86.processor[0x0].cpuid.eax[0x80000006].ebx=0x68004200
x86.processor[0x0].cpuid.eax[0x80000006].ecx=0x2006140
x86.processor[0x0].cpuid.eax[0x80000006].edx=0x8009140

The base L3 cache size is 2 * (EDX & 0x3ffc0000), so 256 MIB. This is not
unreasonable for an EPYC system, and it's probably right.

However, that number could be a per-socket number, and the way we use this
number for tuning, we need a per-thread amount. We adjust this per leaf
0x80000008. The thread count is in (ECX & 0xff) + 1:

x86.processor[0x0].cpuid.eax[0x80000008].eax=0x3030
x86.processor[0x0].cpuid.eax[0x80000008].ebx=0x7
x86.processor[0x0].cpuid.eax[0x80000008].ecx=0x0
x86.processor[0x0].cpuid.eax[0x80000008].edx=0x10007

So we get 1, and there is no per-thread scale-down. (I think the hypervisor
should expose a more realistic count here?)

If the CPU family is at least 0x17, we assume that the number is measured per
core complex. And that comes again from leaf 0x8000001d, subleaf 3, but this
time register EAX. It's computed as (EAX >> 14 & 0xfff) + 1. This evaluates to
4096 here, and I think this is the bug. This CCX count is just way too high.
Based on the available information, the glibc code assumes that there are 4096
instances of 256 MiB caches, which translates to 1 TiB of L3 cache (per thread,
but the thread count is 1).

-- 
You are receiving this mail because:
You are the assignee for the bug.