Page fault in _mca_init during startup

Alan Somers asomers at freebsd.org
Fri Feb 5 16:01:40 UTC 2021


On Fri, Feb 5, 2021 at 7:41 AM Konstantin Belousov <kostikbel at gmail.com>
wrote:

> On Thu, Feb 04, 2021 at 07:53:09PM -0700, Alan Somers wrote:
> > On Thu, Feb 4, 2021 at 7:40 PM Konstantin Belousov <kostikbel at gmail.com>
> > wrote:
> >
> > > On Thu, Feb 04, 2021 at 07:01:30PM -0700, Alan Somers wrote:
> > > > On Thu, Feb 4, 2021 at 5:59 PM Konstantin Belousov <
> kostikbel at gmail.com>
> > > > wrote:
> > > > > Do you have INVARIANTS enabled?  If not, I am curious if enabling
> them
> > > > > would convert that rare page fault into rare "CPU %d has more MC
> banks"
> > > > > assert.
> > > > >
> > > > > Also might be the output of the
> > > > > # for x in $(jot $(sysctl -n hw.ncpu) 0) ; do cpucontrol -m 0x179
> > > > > /dev/cpuctl$x; done
> > > > > command will show the issue (0x179 is the MCG_CAP MSR).
> > > > > You need to load cpuctl(4) if it is not loaded yet.
> > > > >
> > > >
> > > > I don't have INVARIANTS enabled, and I can't enable it on the
> production
> > > > servers.  However, I can turn those three KASSERTs into VERIFYs and
> see
> > > > what happens.  Here is what your command shows on the server that
> > > panicked:
> > > > $ for x in $(jot $(sysctl -n hw.ncpu) 0) ; do sudo cpucontrol -m
> 0x179
> > > > /dev/cpuctl$x; done | uniq -c
> > > >   16 MSR 0x179: 0x00000000 0x0f000c14
> > > >   16 MSR 0x179: 0x00000000 0x0f000814
> > >
> > > It probably explains it, but it would be more telling if you left the
> > > output as is, so that we can see which CPUs have MCG_CMCI_P (10) bit
> set.
> > >
> >
> > I didn't sort them, so the first 16 have bit 10 set and the second 16
> > don't.
> >
> >
> > >
> > > I suspect that your machine has two sockets, and processor in one
> socket
> > > has CPUs reporting MCG_CMCI_P, while other processor does not. Your SMP
> > > is not quite symmetric, perhaps processors were from different bins?
>

I found 2 other servers that exhibit the same problem: the first 16 cores
have bit 10 set and the second 16 don't.  All 3 have dual Xeon Gold 6142
CPUs and SuperMicro X11DPU motherboards with BIOS revision 5.12.  I have
other examples of X11DPU motherboards that don't exhibit the problem, but
they all have both different CPUs and different BIOS revisions.  So I can't
be sure whether the bug follows the CPU model or the BIOS version.


> > >
> >
> > Could be.  Is there some MSR that reports a more specific version number?
> There are CPUID %eax=1 values returned in %eax, but then it requires
> some interpretation.
>         # cpucontrol -i 1 /dev/cpuctl$x
> for $x iterating over the cpus.
>

Apart from the Local APIC ID field, that returns the same value for all
processors.

Your second patch doesn't cause any obvious problems on my dev system.


More information about the freebsd-stable mailing list