kernel MCA messages
Andriy Gapon
avg at icyb.net.ua
Wed Aug 25 07:11:39 UTC 2010
on 25/08/2010 02:38 Jeremy Chadwick said the following:
> On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote:
>> On 8/22/2010 9:18 PM, Dan Langille wrote:
>>> What does this mean?
>>>
>>> kernel: MCA: Bank 4, Status 0x940c4001fe080813
>>> kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000
>>> kernel: MCA: Vendor "AuthenticAMD", ID 0xf5a, APIC ID 0
>>> kernel: MCA: CPU 0 COR BUSLG Source RD Memory
>>> kernel: MCA: Address 0x7ff6b0
>>>
>>> FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43
>>
>> FYI, these are occurring every hour, almost to the second. e.g.
>> xx:56:yy, where yy is 09, 10, or 11.
>>
>> Checking logs, I don't see anything that correlates with this point
>> in the hour (i.e 56 minutes past) that doesn't also occur at other
>> times.
>>
>> It seems very odd to occur so regularly.
I still think that everything of essence has already been said in this thread.
> 1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all
Bank 4 here is MCA reporting bank, it has nothing to do with RAM slots.
Currently on FreeBSD we don't have a standard tool to map physical address to
DRAM module, but I am sure that there could be some ways to do it.
> the DIMMs just to be sure? Do this and see if the problem goes
> away. If not, no harm done, and you've narrowed it down.
>
> 2) What exact manufacturer and model of motherboard is this? If
> you can provide a link to a User Manual that would be great.
>
> 3) Please go into your system BIOS and find where "ECC ChipKill"
> options are available (likely under a Memory, Chipset, or
> Northbridge section). Please write down and provide here all
> of the options and what their currently selected values are.
>
> 4) Please make sure you're running the latest system BIOS. I've seen
> on certain Rackable AMD-based systems where Northbridge-related
> features don't work quite right (at least with Solaris), resulting
> in atrocious memory performance on the system. A BIOS upgrade
> solved the problem.
>
> There's a ChipKill feature called "ECC BG Scrubbing" that's vague in
> definition, given that it's a "background memory scrub" that happens at
> intervals which are unknown to me. Maybe 60 minutes? I don't know.
> This is why I ask question #3.
>
> For John and other devs: I assume the decoded MCA messages indicate with
> absolute certainty that the ECC error is coming from external DRAM and
> not, say, bad L1 or L2 cache?
Have you read the decoded message?
Please re-read it.
I still recommend reading at least the summary of the RAM ECC research article
to make your own judgment about need to replace DRAM.
--
Andriy Gapon
More information about the freebsd-stable
mailing list