RE: PowerMac G5 crashes with "instruction storage interrupt" on recent 13
Date: Sat, 10 Sep 2022 01:41:26 UTC
I have now tried to compare the dmesgs and sysctl of a good kernel (built at 9171b8068b92 with the workaround applied) and a recent bad kernel with the workaround applied as well. The main differences comparing dmesg output, where the dash prefix is for the good kernel and the plus prefix is for the bad kernel: ----- -bus_dmamem_alloc failed to align memory properly. -firewire0: 2 nodes, maxhop <= 1 cable IRM irm(1) (me) +firewire0: 2 nodes, maxhop <= 1 Not IRM capable irm(-1) +pci1:5:4:0: VPD data does not start with ident (0x8) +pci1:5:4:0: failed to read VPD data. +pci1:5:4:0: no valid vpd ident found +pci1:5:4:1: VPD data does not start with ident (0x8) +pci1:5:4:1: failed to read VPD data. +pci1:5:4:1: no valid vpd ident found +WARNING: Current temperature (CPU A0 DIODE TEMP: 916.0 C) exceeds critical temperature (90.0 C); count=1 ----- Note here that the temperature measured seems obviously wrong once the fans spin up like crazy. And soon after this, count grows too high and the machine shuts down by itself. Looking at differences for all sysctls that mention “temp”: ----- dev.ds1631.0.%pnpinfo: name=temp-monitor compat=ds1631 -dev.ds1631.0.sensor.mlb_inlet_amb.temp: 27.5C +dev.ds1631.0.sensor.mlb_inlet_amb.temp: 29.6C dev.ds1775.0.%pnpinfo: name=temp-monitor compat=ds1775 -dev.ds1775.0.sensor.drive_bay.temp: 26.5C +dev.ds1775.0.sensor.drive_bay.temp: 29.5C dev.max6690.0.%pnpinfo: name=temp-monitor compat=max6690 -dev.max6690.0.sensor.backside.temp: 36.1C -dev.max6690.0.sensor.kodiak_diode.temp: 48.7C +dev.max6690.0.sensor.backside.temp: 42.2C +dev.max6690.0.sensor.kodiak_diode.temp: 55.2C dev.max6690.1.%pnpinfo: name=temp-monitor compat=max6690 -dev.max6690.1.sensor.tunnel.temp: 31.2C -dev.max6690.1.sensor.tunnel_heatsink.temp: 33.7C +dev.max6690.1.sensor.tunnel.temp: 34.7C +dev.max6690.1.sensor.tunnel_heatsink.temp: 39.0C -dev.smusat.0.cpu_a0_diode_temp: 34.2C -dev.smusat.0.cpu_a1_diode_temp: 35.0C kstat.zfs.misc.arcstats.arc_tempreserve: 0 ----- The fact that dev.smusat.* is gone from the “bad” kernel seems suspicious, but smusat0 is detected properly in both kernels according to dmesg… Any thoughts? I can try to bisect this as well, but there are 1500+ changes to sort through so this will take a while. Thanks! From: Justin Hibbits<mailto:jhibbits@FreeBSD.org> Sent: Friday, September 9, 2022 12:12 To: Julio Merino<mailto:julio@meroh.net> Cc: freebsd-ppc@freebsd.org<mailto:freebsd-ppc@freebsd.org> Subject: Re: PowerMac G5 crashes with "instruction storage interrupt" on recent 13 That seems bizarre. There haven't been any changes to the controller thread (powermac_thermal.c) in more than 7 years. Are there any problems with sensors? I tested the change I made back in 2015 on my dual core G5, with the intent that it would ramp the fans up sooner (non-linear), and back them down with hysteresis. So when there's load that raises the temperature significantly it will ramp the fans up as quickly as it can, hitting 100% fan long before it can reach maximum temperature. - Justin On Fri, 9 Sep 2022 19:01:06 +0000 Julio Merino <julio@meroh.net> wrote: > Ah, thanks for the workaround. I applied it on top of 9171b8068b92 > and the kernel was able to boot successfully – and it seems stable so > far. > > However, if I apply the hack on top of stable/13’s HEAD, there is > still the issue of the fans going crazy at the slightest increase in > CPU load but they do drop back down to quiet when the load subsumes. > (For example, a simple “git log” in /usr/src makes the fan spin up > within a couple of seconds and they stop soon after that.) Any ideas > on where this might come from? > > > From: Justin Hibbits<mailto:jhibbits@FreeBSD.org> > Sent: Friday, September 9, 2022 09:09 > To: Julio Merino<mailto:julio@meroh.net> > Cc: freebsd-ppc@freebsd.org<mailto:freebsd-ppc@freebsd.org> > Subject: Re: PowerMac G5 crashes with "instruction storage interrupt" > on recent 13 > > Hi Julio, > > 971cb62e0b23 is the likely culprit. Alfredo has a patch at > https://reviews.freebsd.org/D36234 that you can use until the problem > is solved. The alternative is you could build everything into the > kernel instead of using modules. > > The problem appears to be in either lld or the kernel linker. > > - Justin > > On Fri, 9 Sep 2022 16:00:33 +0000 > Julio Merino <julio@meroh.net> wrote: > > > Armed with a lot of patience, I was able to bisect where the crashes > > are coming from. They seem to be due to these three consecutive and > > related commits (because the first one broke the build and required > > two extra fixes for powerpc’s GENERIC64 to build): > > > > 9171b8068b92 cpuset: Fix the KASAN and KMSAN builds > > 01f281d0ee52 Fix the build after 47a57144 > > 971cb62e0b23 cpuset: Byte swap cpuset for compat32 on big endian > > architectures > > > > Any idea on how to look into these crashes further? > > > > Thank you! > > > > > > From: Julio Merino<mailto:julio@meroh.net> > > Sent: Sunday, July 31, 2022 07:45 > > To: freebsd-ppc@freebsd.org<mailto:freebsd-ppc@freebsd.org> > > Subject: PowerMac G5 crashes with "instruction storage interrupt" on > > recent 13 > > > > Hi all, > > > > I have a PowerMac G5 that’s running an old build of FreeBSD 13 > > stable (from around October of last year) that I’m trying to > > upgrade to recent stable/13. > > > > Booting into a new kernel brings two issues: the first is that the > > fans spin up to jet engine levels right before transferring control > > to userspace. An old patch I have locally to mitigate this (which I > > got from whichever outstanding bug exists for this in the bug > > tracker) doesn’t seem to work any longer. > > > > The second is that the kernel crashes (apparently) as soon as it > > tries to mount a ZFS pool during early stages of the boot process, > > but after successfully transferring control to userspace. Typing > > this from a photo of the crash so omitting details that I think > > aren’t going to be relevant here, like addresses, here is what I > > get: > > > > ---- > > Setting hostid: … > > ZFS filesystem version: 5 > > ZFS storage pool version: features support (500) > > > > Fatal kernel trap: > > > > Exception = 0x400 (instruction storage interrupt) > > … > > pid = 64, comm = zpool > > > > panic: instruction storage interrupt trap > > cpuid = 1 > > time = … > > KDB: stack backtrace: > > #0 kdb_backtrace > > #1 vpanic > > #2 panic > > #3 trap > > #4 powerpc_interrupt > > Uptime: 7s > > ---- > > > > Any thoughts about what I could look into? Any “recent” commits that > > you think may be at fault? > > > > Thanks! > > >