kernel MCA messages
John Baldwin
jhb at freebsd.org
Wed Aug 25 12:27:45 UTC 2010
On Wednesday, August 25, 2010 7:01:19 am Andriy Gapon wrote:
> on 25/08/2010 13:41 Dan Langille said the following:
> > On 8/25/2010 3:11 AM, Andriy Gapon wrote:
> >
> >> Have you read the decoded message?
> >> Please re-read it.
> >>
> >> I still recommend reading at least the summary of the RAM ECC research article
> >> to make your own judgment about need to replace DRAM.
> >
> > Andriy: What is your interpretation of the decoded message? What is your view on
> > replacing DRAM? What do you conclude from the summary?
>
> Most likely you have a small defect in one of your memory modules, perhaps a
> "stuck" bit. You will be getting correctable ECC errors for that module.
> Eventually you might get uncorrectable error. That may happen soon or it may
> never happen during lifetime of that modules.
>
> As that study has demonstrated a significant percentage of systems and modules
> report at least one correctable ECC error. ECC correctable errors at present
> correlate with correctable ECC errors in the future. They also correlate with
> uncorrectable errors in the future. But percentage of systems developing
> uncorrectable errors is significantly smaller, so chances of false positives are
> substantial.
>
> You should decide whether you want to replace the module (if you can pinpoint it)
> or all modules depending on your resources (money, etc), importance of service
> that the server in question provides (allowable downtime and cost of it and
> fault-tolerance of a larger system, of which the server may be a part (e.g. it may
> have a standby server for failover).
>
> I think that most of what I've just said was kind of obvious from the start.
> The important bit from that study is that ECC errors are not as random and as rare
> as was previously thought, and they can be attributed to a number of factors like
> manufacturing defects, layout of memory lanes on motherboard, etc.
A while back I found a slide deck from some Intel presentation that claimed
that a modern 4GB DIMM should average 18 corrected errors a month. Your
rate is a bit higher than that, but corrected ECC errors are not that
unexpected.
--
John Baldwin
More information about the freebsd-stable
mailing list