ECC support

Dieter BSD dieterbsd at gmail.com
Wed Sep 16 17:56:53 UTC 2015


Andriy:
>> Assuming that a board does have the necessary connections but
>> the firmware does not have ECC support, is there some reason that
>> ECC support could not be added to the OS instead of the firmware?
>
> Yes, there is.  The memory controller is programmed by the code that
> runs from ROM and uses no RAM (or the CPU cache is used as the RAM).
> Once the real RAM gets used it's too late to reprogram the DRAM controller.

Perhaps one of the several bootloader stages could get itelf into
CPU cache, program the memory controller, then load and execute the
next stage or the OS?

Jim:
> Replacing the data in memory would require processing overhead
> that could accumulate and significantly diminish system performance.

If it only replaces data when there is a correctable error,
and the errors are occasional soft errors, the effect on
performance should be minimal.  If there is a hard error,
you would want to replace the defective memory before you get
an additional error and it becomes uncorrectable.

> If the error occurred because of random events and isn't a defect in
> the memory, the memory address will be cleaned of the error when the
> data is overwritten with other data.

If and when new data gets written to that location.  If that location
contains info that never changes, such as kernel text, the bad bit will
never get fixed.

> memory, without the extra complexity of the controller, is 12.5% more
> expensive.   This <80><99>t a huge impact at 8GB, (<80><99>ll need
> another 1GB of RAM), but at 1024GB <80><99>ll need another 128GB,
> and that much ram still costs enough that your wallet <80><99>t be happy.

It is 12.5% in both cases.  How much does it cost to have undetected
errors in your data?  How much does it cost when an Interstate
bridge collapses?  How much does it cost when one of NASA's missions
fails?  How much does it cost when your pharmacy receives a
prescription with an error in the dose?

> the MRC setup on Intel and AMD is both complex and proprietary

One wonders why the secrecy.  AMD has been much more open than many
(most?) chipmakers.  They even forced the ATI people to document
how to program their chips.  I don't see a lot of companies popping up
making competing chips.  #include standard joke: "How do you make a small
fortune in chipmaking?  Start with a very large fortune."  I can't
see what secret would be revealed by saying "set bit 7 of register 4
to 1 to enable ECC".

> Intel Red Book

So the secret books are red this week, yawn.  I remember the nightmare
of the merced orange books and the brain damaged "features" the chips had.
Not recommended.  I'm interested in chips that work correctly, hence the
interest in ECC and AMD.  Looked for ARM boards with ECC but didn't find
any.  Is the Sparc stuff any more reliable than it used to be?  Other
arch choices?

> The MRC setup code is a binary blob for otherwise open source boot
> firmware such as Coreboot.

So the libreboot people are forced to work on reverse engineering
these blobs?  :-(

Don:
> I don't think the current APU parts support ECC.

According to wikipedia, socket FM2+ does not support ECC. :-(
Kabini has support for ECC.  And Berlin, (and I assume Toronto) but
word is that Berlin and Toronto are basically dead. :-(
I think Carrizo and Turion are supposed to support ECC?  There really
ought to be a list of which CPUs/APUs/sockets/boards do or do not
support ECC.

> My experience is that many ASUS motherboard support ECC RAM and
> usually document that fact.  Also many Gigabyte mother boards also
> support ECC RAM, but don't document it.

>From what I've been reading, both Asus and Gigabyte make good boards.
I've seen reviews that complained about Gigabyte's firmware.
http://www.xbitlabs.com/articles/mainboards/display/gigabyte-ga-990fxa-ud5_8.html
I've also seen claims that the firmware bricked boards.
Reviewers like Asus' firmware.  I've seen complaints about Asus's support,
and their website has significant problems.

The firmware on my Tyan board is crap, and they refused to tell me
how much power it needs.  Which means I don't know how much other stuff
I can run from the same P/S.  It should have *way* more power than needed,
but experience says "not enough", so I added a 2nd p/s for the disk farm
and suddenly had fewer problems.  The 2 p/s setup does allow powercycling
the mainboard (because of the crappy firmware) without powercycling the disks.

Given my experience with the Tyan board, and the apparent lack of
FLOSS firmware for recent boards, I'm not real excited about the
Gigabyte boards.  Asus has a couple of AMD3+ boards that I could
probably live with, if their website actually had things like
lists of exactly which CPUs and memory are approved, and firmware
updates, ... But there are also applications could use a lower wattage
solution.

Anyone have opinions on other mainboard companies?  ECS?  Asrock?
MSI?  Zotac?  Others?

Don:
> +MCA: Bank 4, Status 0x944a400096080a13
> +MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
> +MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0
> +MCA: CPU 0 COR BUSLG Responder RD Memory
> +MCA: Address 0x213e98b10
> +MCA: Bank 4, Status 0xd44a400096080a13
> +MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
> +MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0
> +MCA: CPU 0 COR OVER BUSLG Responder RD Memory
> +MCA: Address 0x213e98b10

Chris:
> MCA: Bank 1, Status 0x9400000000000151
> MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
> MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2
>
> MCA: Address 0x81cc0e9f0
>
> Kind of freaky. I've never had this error on this board before.
> On others tho.
>
> Try a search for MCA instead.

Is there a decoder ring for those messages?  I don't recall seeing
messages like that, although I wasn't looking for them, and they
don't leap out at you screaming ERROR! ERROR!  Digital Unix had its
problems, but at least the error messages were fairly clear.
Something like "single bit memory error at address 0x12345..."
A simple edit to sys/x86/x86/mca.c
   s/printf("UNCOR ");/printf("Uncorrectable ");/
   s/printf("COR ");/printf("Correctable ");/
would make the messages at least slightly more meaningful to a viewer
who isn't intimently(sp) familiar with the mca.  Which most people aren't.
I used to maintain code that dealt with a memory controller, and
used a hardware circuit to inject errors into a memory board.
But looking at those messages doesn't tell me anything beyond
"Something happened, maybe I should grep through the source
code for clues about those messages."  Looking at the source
doesn't add much, you'd need documentation for the mca.
Which most people aren't going to have.  And you'd need a lot
of time to figure it out.

# find /var/log | xargs bzgrep -i mca
found no error messages.

I seem to be buried under a mountain of boards that would be useful,
if only they supported ECC. (and had firmware that actually works...)
And I'm hardly the only one.  So how do we fix this?
Lobby AMD (and other chipmakers) to include ECC support in *all* memory
controllers and sockets?  It isn't like they have to redesign the logic
for every chip, they only need one design per memory width.  Lobby AMD
to publish documentation on how to program the memory controller?
Lobby the companies that make boards?


More information about the freebsd-hackers mailing list