bad RAM? prove it with a crash dump?

Andrew Duane aduane at juniper.net
Thu May 6 13:11:42 UTC 2010


owner-freebsd-hackers at freebsd.org wrote:
> On Thu, 6 May 2010, Boris Kochergin wrote:
> 
>> My experience with bad memory is that if it causes the machine to
>> crash, it won't always happen while the machine is running the same
>> process (or kernel thread)--so look for it crashing in a wide
>> variety of places--and upon inspection of the core dump, a pointer
>> somewhere will be pointing to garbage.
> ============
> 
> so really i'd need to collect two or more crash dumps, and if they
> point to different addresses then i can reasonably say the RAM is bad?
> 
> thanks...

It's not just that they point to different addresses, it is garbage in many completely independent places. For example, pulling bad registers/return addresses off the stack, or garbage in random unrelated buffers/structures/pointers. On the other hand, if you often have garbage in some structure's "foo" pointer, that indicates a problem (maybe locking) in how your code manages setting that foo pointer. It's a subtle difference.

It is also useful to make sure that the garbage itself is different. As mentioned before, a single bit error in an otherwise valid value, or maybe a missing/scrambled byte, these are good indications of memory problems. If random places are often overwritten with something else, that could just be another piece of misbehaving code that is writing someplace it shouldn't. I've often found code that writes some buffer into e.g. a piece of memory it no longer owns that looks like memory corruption until you realize the garbage is always something specific like a vnode structure.

/Andrew



More information about the freebsd-hackers mailing list