kernel MCA messages
Ronald Klop
ronald-freebsd8 at klop.yi.org
Tue Aug 24 06:28:25 UTC 2010
On Mon, 23 Aug 2010 14:20:35 +0200, John Baldwin <jhb at freebsd.org> wrote:
> On Monday, August 23, 2010 2:44:38 am Andriy Gapon wrote:
>> on 23/08/2010 05:05 Dan Langille said the following:
>> > On 8/22/2010 9:18 PM, Dan Langille wrote:
>> >> What does this mean?
>> >>
>> >> kernel: MCA: Bank 4, Status 0x940c4001fe080813
>> >> kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000
>> >> kernel: MCA: Vendor "AuthenticAMD", ID 0xf5a, APIC ID 0
>> >> kernel: MCA: CPU 0 COR BUSLG Source RD Memory
>> >> kernel: MCA: Address 0x7ff6b0
>> >>
>> >> FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43
>> >
>> > And another one:
>> >
>> > kernel: MCA: Bank 4, Status 0x9459c0014a080813
>> > kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000
>> > kernel: MCA: Vendor "AuthenticAMD", ID 0xf5a, APIC ID 0
>> > kernel: MCA: CPU 0 COR BUSLG Source RD Memory
>> > kernel: MCA: Address 0x7ff670
>>
>> I believe that you get correctable RAM ECC errors, but not entirely
>> sure.
>> There is mcelog utility that decodes such messages into human-friendly
>> descriptions.
>> The utility is available on Linux-based systems.
>> John Baldwin has a port of it to FreeBSD, but it seems to be WIP and is
>> private
>> so far. Wait and watch John posting decoded text in this thread :-)
>
> It is not private, it is in //depot/projects/mcelog/... in p4. It is
> not a
> complete port yet though (doesn't support the daemon and client modes for
> example).
>
> Details for these errors:
>
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 0 4 northbridge
> ADDR 7ff6b0
> Northbridge RAM Chipkill ECC error
> Chipkill ECC syndrome = fe18
> bit32 = err cpu0
> bit46 = corrected ecc error
> bus error 'local node origin, request didn't time out
> generic read mem transaction
> memory access, level generic'
> STATUS 940c4001fe080813 MCGSTATUS 0
> MCGCAP 105 APICID 0 SOCKETID 0
> CPUID Vendor AMD Family 15 Model 5
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 0 4 northbridge
> ADDR 7ff670
> Northbridge RAM Chipkill ECC error
> Chipkill ECC syndrome = 4ab3
> bit32 = err cpu0
> bit46 = corrected ecc error
> bus error 'local node origin, request didn't time out
> generic read mem transaction
> memory access, level generic'
> STATUS 9459c0014a080813 MCGSTATUS 0
> MCGCAP 105 APICID 0 SOCKETID 0
> CPUID Vendor AMD Family 15 Model 5
>
> As Andriy guessed, I believe both of these are corrected ECC errors. You
> can likely ignore them as a low rate of corrected ECC errors is not
> unexpected.
>
Hi,
A little off topic, but what is 'a low rate of corrected ECC errors'? At
work one machine has them like ones per day, but runs ok. Is ones per day
much?
Ronald.
More information about the freebsd-stable
mailing list