Re: MCE: Does this look possibly like a slot issue?
- In reply to: Rodney W. Grimes: "Re: MCE: Does this look possibly like a slot issue?"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 21 Jun 2022 16:52:50 UTC
Completely agree with you, Rodney. The LGA on the motherboard can be bent very easy when moving so I wanted to recommend this last. Larry, as Rodney mentioned, it's more or less your last option. This is likely the CPU and not the module itself. There is still a small chance that is motherboard/slot related, a way you can determine this is by swapping the CPU's slot 0 <----> slot 1 and seeing if the error moves. As I mentioned though, be very cautious. I don't want you to be in a worse-off state. I would reseat the problem CPU socket before swapping the CPUs. Best regards, Richard Gallamore On Tue, Jun 21, 2022 at 9:06 AM Rodney W. Grimes < freebsd-rwg@gndrsh.dnsmgr.net> wrote: > > > > > > Swapped 2 DIMMS, now we wait for the ZFS ARC to fill and start using all > > the memory. > > Depending on the results of that one thing that is often overlooked > when trying to trouble shoot memory systems in modern Intel systems > is the fact that the DIMM now talks directly to the CPU chip that > has the memory controller built into it. THUS these "slot" related > ECC/Parity/blowup errors can actually be the CPU and/or the CPU > socket and/or the seating of the CPU in the socket. > > So if the error sticks with the DIMM slot and not the DIMM > module the next thing I would try would be a CPU chip reseat, > including a good inspection of the socket for for a damaged > pin. Also look at the lands on the CPU chip itself, and you > can even try swaping CPU chips to see if it follows the > CPU or the socket, much as you do with a DIMM. > > > > > > On 06/20/2022 7:59 pm, Larry Rosenman wrote: > > > > > SuperMicro X8DTN+ > > > > > > 2 Processors, 6-core/12-Thread. CPU: Intel(R) Xeon(R) CPU > > > E5645 @ 2.40GHz (2400.20-MHz K8-class CPU) > > > > > > I'll bring it down and swap DIMMS around > > > > > > On 06/20/2022 7:57 pm, Ultima wrote: > > > > > > Hey Larry, > > > > > > One red flag I am seeing is that the error is being produced on > > > the same CPU/bank with each error you have provided so far. > > > > > > Can you try and follow my original recommendation and swap > > > currently installed DIMM with the problem DIMM slot and see > > > if anything changes? > > > > > > Can you also provide the motherboard model? Also, do you > > > have multiple CPUs installed in this system? > > > > > > Best regards, > > > Richard Gallamore > > > > > > On Mon, Jun 20, 2022 at 5:41 PM Larry Rosenman <ler@lerctr.org> wrote: > > > > > > Yes and Yes. > > > > > > On 06/20/2022 7:37 pm, Ultima wrote: > > > > > > Are you sure that the module you replaced it with was good? > > > Are you sure you replaced the correct module? > > > > > > Best regards, > > > Richard Gallamore > > > > > > On Mon, Jun 20, 2022 at 5:23 PM Larry Rosenman <ler@lerctr.org> wrote: > > > > > > I'm seeing them constantly: > > > > > > root@freenas[~]# mcelog --dmi > > > Hardware event. This is not a software error. > > > MCE 0 > > > CPU 22 BANK 8 TSC 20aab486464a > > > MISC ac29890200046444 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 44 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > WARNING: SMBIOS data is often unreliable. Take with a grain of salt! > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 1 > > > CPU 22 BANK 8 TSC 296dfcc82582 > > > MISC ac29890200041381 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 81 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 2 > > > CPU 22 BANK 8 TSC 2a5604a6a070 > > > MISC ac29890200044281 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory ECC error occurred during scrub > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 81 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 88000040000200cf MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > Hardware event. This is not a software error. > > > MCE 3 > > > CPU 22 BANK 8 TSC 31e141418eb8 > > > MISC ac29890200046a4a ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 4a > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 4 > > > CPU 22 BANK 8 TSC 3a014afee106 > > > MISC ac29890200046646 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 46 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 5 > > > CPU 22 BANK 8 TSC 41d1dbef1a6a > > > MISC ac29890200046141 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 41 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 6 > > > CPU 22 BANK 8 TSC 4a1b1ecef446 > > > MISC ac29890200046a4a ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 4a > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 7 > > > CPU 22 BANK 8 TSC 527bc27db776 > > > MISC ac29890200040386 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 86 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > Hardware event. This is not a software error. > > > MCE 8 > > > CPU 22 BANK 8 TSC 5aa4ecdd795a > > > MISC ac29890200046646 ADDR ee2f6e800 > > > TIME 1655770989 Mon Jun 20 19:23:09 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 46 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > root@freenas[~]# > > > > > > and I replaced the DIMM yesterday :( > > > > > > On 06/20/2022 7:19 pm, Ultima wrote: > > > > > > Hey Larry, > > > > > > It is possible it's the motherboard itself, but it's rare. The way I > > > would determine this is to swap the DIMM module with another > > > populated slot on the motherboard and see if the error migrated > > > to the new slot or not. Also, this error doesn't necessarily mean > > > there is a problem that needs to be addressed. If you have been > > > running the system for many months and you see ECC errors a > > > handful of times, it can probably be safely ignored. > > > > > > Best regards, > > > Richard Gallamore > > > > > > On Mon, Jun 20, 2022 at 3:14 PM Larry Rosenman <ler@lerctr.org> > wrote: > > > I've gotten a BUNCH of these on my TrueNAS server. I've replaced this > > > DIMM a couple of times, and still the MCE's continue. > > > Is it possible it's Motherboard slot issue? > > > > > > Hardware event. This is not a software error. > > > MCE 8 > > > CPU 22 BANK 8 TSC 5aa4ecdd795a > > > MISC ac29890200046646 ADDR ee2f6e800 > > > TIME 1655762472 Mon Jun 20 17:01:12 2022 > > > MCG status: > > > Memory read ECC error > > > Memory corrected error count (CORE_ERR_CNT): 1 > > > Memory transaction Tracker ID (RTId): 46 > > > Memory DIMM ID of error: 0 > > > Memory channel ID of error: 1 > > > Memory ECC syndrome: ac298902 > > > STATUS 8c0000400001009f MCGSTATUS 0 > > > MCGCAP 1c09 APICID 34 SOCKETID 0 > > > CPUID Vendor Intel Family 6 Model 44 Step 2 > > > DDR3 DIMM 800 Mhz Other Width 72 Data Width 64 Size 4 GB > > > Device Locator: P2-DIMM2C > > > Bank Locator: BANK14 > > > Manufacturer: Hyundai > > > Serial Number: 40F3C20F > > > Asset Tag: > > > Part Number: HMT151R7BFR4C-H9 > > > > > > -- > > > Larry Rosenman http://www.lerctr.org/~ler > > > Phone: +1 214-642-9640 E-Mail: ler@lerctr.org > > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106 > > > > -- > > Larry Rosenman http://www.lerctr.org/~ler > > Phone: +1 214-642-9640 E-Mail: ler@lerctr.org > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106 > > > > -- > > Larry Rosenman http://www.lerctr.org/~ler > > Phone: +1 214-642-9640 E-Mail: ler@lerctr.org > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106 > > > > -- > > Larry Rosenman http://www.lerctr.org/~ler > > Phone: +1 214-642-9640 E-Mail: ler@lerctr.org > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106 > > > > -- > > Larry Rosenman http://www.lerctr.org/~ler > > Phone: +1 214-642-9640 E-Mail: ler@lerctr.org > > US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106 > > -- > Rod Grimes > rgrimes@freebsd.org >