PowerMac G5 spurious sensor readings
Matthew Rezny
mrezny at hexaneinc.com
Fri Jan 18 04:38:42 UTC 2013
On Thu 13/01/17 21:59 , Matthew Rezny wrote::
>I have a G5 of the first model (PowerMac7,2) on which I've been using FreeBSD/ppc64 for over a year. Today, it suddenly rebooted. Not the first time by any means, but this is the first time I found the following log message:
>Jan 17 17:32:19 powermac kernel: WARNING: Current temperature (MLB MAX6690 AMB:127.8 C) exceeds critical temperature (80.0 C)! Shutting down!
>
>This is the first time I have seen such a message. After reboot, that sensor shows a temperature near 30C, which seems appropriate. The reading of 127.8C looks suspiciously like a max value. My only guess is there was a bad read that resulted in
>the sensor value going over the threshold. That raises a question in my mind as to whether there is any filtering or sanity checking of the data. Could a single bad read cause the threshold to be exceeded and trigger shutdown immediately, or would
>the excessive value have to be returned from that sensor multiple times for it to be believed an acted upon?
>
>$ uname -a
>FreeBSD powermac 9.1-RC1 FreeBSD 9.1-RC1 #0: Thu Aug 16 00:43:39 UTC 2012 root at anacreon.physics.wisc.edu:/usr/obj/usr/src/sys/GENERIC64 powerpc
>
>The build is a bit old, though I wouldn't expect too much change to the code in question since then. I will update to 9.1-RELEASE or -STABLE in the next few days, but as this is a problem that has happened once in over a year, I wouldn't call it
>resolved just by a quick failure to reproduce after updating.
>
>I was already planning to do an update after the box has completed it's current task. I noticed a problem with excessive output causing the console to hang. A couple days ago I found the machine apparently hung in that the keyboard and mouse were
>not responsive, but I found it was still alive on the network and I could ssh in to reboot. The only clues were no buffer space for dmesg to output anything before reboot, and a rather full /var/log/messages file which had exhausted the drive.
>Under the same workload (and after freeing some drive space), the problem reoccurred in a matter of hours, but this time with me watching. While running ddrescue against a drive with some bad sectors, read errors flood the console in spurts. When
>some dozens of read errors are displayed at once, the console scrolls whole pages by in a fraction of a second, and then goes dead. Messages that should go to console are not shown on screen but are in the log. Attempts to switch virtual console or
>to reboot are not successful, but ssh access continues to work and the box is clearly still processing other workloads. The only sign of life from the console are the messages about flushing buffers just before completion of the reboot commanded
>via ssh.
>
Just a few hours later, it strikes again.
Jan 17 23:06:11 powermac kernel: WARNING: Current temperature (MLB MAX6690 AMB: 127.0 C) exceeds critical temperature (80.0 C)! Shutting down!
I took a peek in smu.c and powermac_thermal.c. In the former, smu_sensor_read() has a check for an error returned from smu_run_cmd() but no checks on the returned data. In the later, pmac_therm_manage_fans() invokes smu_sensor_read() and considers the returned value as valid if greater than zero. No other sanity checks are performed.
Looking at the datasheet[1] for max6690, I see that 127C is the maximum readable temperature, which is represented as 01111111. The value 10000000 is documented as representing a diode fault. As there is no upper range check, the diode fault condition will be interpreted as slightly over 127C. I think it would be appropriate to treat as invalid any raw sensor value with the MSB set. Additionally, the check on line 105 of pmac_therm_manage_fans should really be "if (temp >= 0)" rather than just "if (temp > 0)" as a value of 0 is a valid value for zero degrees and all actual errors are represented as a value of -1.
I have not looked at the datasheets for other relevant sensors, but being that there are no range checks in any of the cases in smu_sensor_read(), I currently consider them all suspect pending review.
[1] http://datasheets.maximintegrated.com/en/ds/MAX6690.pdf (Page 11, Table 2)
More information about the freebsd-ppc
mailing list