ECC support
Jim Thompson
jim at netgate.com
Tue Sep 15 21:52:34 UTC 2015
ECC is implemented by a ‘hashing’ algorithm that works on eight (8) bytes (64 bits) at a time, and places the result into an 8-bit ECC ‘word’.
Errors are corrected "on-the-fly," corrected data is almost never placed back in memory. If the same corrupt data is read again, the correction process is repeated. Replacing the data in memory would require processing overhead that could accumulate and significantly diminish system performance. If the error occurred because of random events and isn't a defect in the memory, the memory address will be cleaned of the error when the data is overwritten with other data.
In terms of expense, at a minimum, where you had 8 bytes to make up a memory system, you will now have 9 (to hold the extra 8 bits). This means your memory, without the extra complexity of the controller, is 12.5% more expensive. This isn’t a huge impact at 8GB, (you’ll need another 1GB of RAM), but at 1024GB you’ll need another 128GB, and that much ram still costs enough that your wallet won’t be happy.
The memory controller has to be able to run the ECC algorithm on every read, *and* supply the corrected data as needed, within the cycle time of the read. If you involve software in this path, the performance your machine will be glacial.
Yes, the firmware has to program the memory controller. “Program a few registers” is all you need, only the MRC setup on Intel and AMD is both complex and proprietary. Good luck getting the
details for this. This is “Intel Red Book” territory, and you’ll need to be an employee with a need to know. The MRC setup code is a binary blob for otherwise open source boot firmware such as Coreboot.
Others have answered (in the positive) about the OS reporting ECC errors on FreeBSD.
Jim
> On Sep 15, 2015, at 3:53 PM, Dieter BSD <dieterbsd at gmail.com> wrote:
>
> Many of AMD's CPU/APU parts support ECC memory. Not just the top of the
> line parts, but also many of the less expensive, less power hungry parts.
> However, many (most?) of the boards for these chips do not support ECC,
> or at least do not admit to it. They specify "non-ECC memory".
>
> Obviously there have to be connections between the memory controller and
> the memory for the extra bits. Aside from a little extra time for the
> board designer to add a few traces to the wire list, this would not
> raise the cost of the board. Despite this I have read that some boards
> lack the necessary traces.
>
> Does the firmware have to do anything to support ECC? Program a few
> registers in the memory controller perhaps? A few boards have FLOSS
> firmware available, so this code could be added, but most boards do not
> have firmware sources available.
>
> Assuming that a board does have the necessary connections but
> the firmware does not have ECC support, is there some reason that
> ECC support could not be added to the OS instead of the firmware?
> I grepped through FreeBSD 8.2 and 10.1 sources but couldn't find
> anything that looked relevant. Also did not find any code that
> reported ECC errors, other than one device. Perhaps I missed it?
>
> I've been running machines with ECC for 15-20 years and have never seen
> a report of an ECC error from either NetBSD or FreeBSD. I have seen
> reports of ECC errors from Digital Unix. And remember getting panics
> due to parity errors on machines before ECC. So I'm thinking that
> the BSDs must ignore hardware reports of single bit ECC errors. :-(
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
More information about the freebsd-hackers
mailing list