PowerMac G5 spurious sensor readings

Justin Hibbits chmeeedalf at gmail.com
Sun Oct 12 23:51:27 UTC 2014


On Fri, 22 Feb 2013 22:09:18 +0100
Matthew Rezny <mrezny at hexaneinc.com> wrote:

> On Fri, 18 Jan 2013 05:38:30 +0100
> Matthew Rezny <mrezny at hexaneinc.com> wrote:
> 
> > On Thu 13/01/17 21:59 , Matthew Rezny  wrote::
> > 
> > >I have a G5 of the first model (PowerMac7,2) on which I've been
> > >using FreeBSD/ppc64 for over a year. Today, it suddenly rebooted.
> > >Not the first time by any means, but this is the first time I found
> > >the following log message: Jan 17 17:32:19 powermac kernel:
> > >WARNING: Current temperature (MLB MAX6690 AMB:127.8 C) exceeds
> > >critical temperature (80.0 C)! Shutting down!
> > >
> > >This is the first time I have seen such a message. After reboot,
> > >that sensor shows a temperature near 30C, which seems appropriate.
> > >The reading of 127.8C looks suspiciously like a max value. My only
> > >guess is there was a bad read that resulted in the sensor value
> > >going over the threshold. That raises a question in my mind as to
> > >whether there is any filtering or sanity checking of the data.
> > >Could a single bad read cause the threshold to be exceeded and
> > >trigger shutdown immediately, or would the excessive value have to
> > >be returned from that sensor multiple times for it to be believed
> > >an acted upon?
> > >
> > >$ uname -a
> > >FreeBSD powermac 9.1-RC1 FreeBSD 9.1-RC1 #0: Thu Aug 16 00:43:39
> > >UTC 2012
> > >root at anacreon.physics.wisc.edu:/usr/obj/usr/src/sys/GENERIC64
> > >powerpc
> > >
> > >The build is a bit old, though I wouldn't expect too much change to
> > >the code in question since then. I will update to 9.1-RELEASE or
> > >-STABLE in the next few days, but as this is a problem that has
> > >happened once in over a year, I wouldn't call it resolved just by a
> > >quick failure to reproduce after updating.
> > >
> > >I was already planning to do an update after the box has completed
> > >it's current task. I noticed a problem with excessive output
> > >causing the console to hang. A couple days ago I found the machine
> > >apparently hung in that the keyboard and mouse were not responsive,
> > >but I found it was still alive on the network and I could ssh in to
> > >reboot. The only clues were no buffer space for dmesg to output
> > >anything before reboot, and a rather full /var/log/messages file
> > >which had exhausted the drive. Under the same workload (and after
> > >freeing some drive space), the problem reoccurred in a matter of
> > >hours, but this time with me watching. While running ddrescue
> > >against a drive with some bad sectors, read errors flood the
> > >console in spurts. When some dozens of read errors are displayed
> > >at once, the console scrolls whole pages by in a fraction of a
> > >second, and then goes dead. Messages that should go to console are
> > >not shown on screen but are in the log. Attempts to switch virtual
> > >console or to reboot are not successful, but ssh access continues
> > >to work and the box is clearly still processing other workloads.
> > >The only sign of life from the console are the messages about
> > >flushing buffers just before completion of the reboot commanded
> > >via ssh.
> > >
> > 
> > Just a few hours later, it strikes again.
> > Jan 17 23:06:11 powermac kernel: WARNING: Current temperature (MLB
> > MAX6690 AMB: 127.0 C) exceeds critical temperature (80.0 C)!
> > Shutting down!
> > 
> > I took a peek in smu.c and powermac_thermal.c. In the former,
> > smu_sensor_read() has a check for an error returned from
> > smu_run_cmd() but no checks on the returned data. In the later,
> > pmac_therm_manage_fans() invokes smu_sensor_read() and considers the
> > returned value as valid if greater than zero. No other sanity checks
> > are performed.
> > 
> > Looking at the datasheet[1] for max6690, I see that 127C is the
> > maximum readable temperature, which is represented as 01111111. The
> > value 10000000 is documented as representing a diode fault. As there
> > is no upper range check, the diode fault condition will be
> > interpreted as slightly over 127C. I think it would be appropriate
> > to treat as invalid any raw sensor value with the MSB set.
> > Additionally, the check on line 105 of pmac_therm_manage_fans
> > should really be "if (temp >= 0)" rather than just "if (temp > 0)"
> > as a value of 0 is a valid value for zero degrees and all actual
> > errors are represented as a value of -1.
> > 
> > I have not looked at the datasheets for other relevant sensors, but
> > being that there are no range checks in any of the cases in
> > smu_sensor_read(), I currently consider them all suspect pending
> > review.
> > 
> > [1] http://datasheets.maximintegrated.com/en/ds/MAX6690.pdf (Page
> > 11, Table 2)
> > 
> 
> It has been some time since I first looked at this, but something in
> the back of my mind said my first glance at it was flawed, so I
> recently revisited the matter.
> 
> I realize I was looking in the wrong place previously. After reading
> through powermac_thermal.c I had done a search for sensor_read() and
> found a match in smu.c so I started from there. However, there is no
> SMU on this early model G5. Searching more than just the same
> directory, I found sensor_read() occurs in many places. The pertinent
> function is really in sys/dev/iicbus/max6690.c as
> max6690_sensor_read().
> 
> Looking at the correct function, I see multiple problems.
> max6690_read() is called twice, both times storing the return value
> in err. The value of err is checked only once, after both calls, so
> if the first call returns an error but the second succeeds, the error
> indicator is overwritten and calculations will be done with bad data.
> The check for err < 0 should be done after each call to
> max6690_read(). After that basic check, which would only indicate an
> error if there was a problem with the I2C bus transaction,
> computations are done without any further data validation. As stated
> before, the sensor may signal an error condition by returning the
> value 10000000. There is no check for that reserved value, so when
> the sensor attempts to indicate an error, we end up with a
> temperature of 127C being returned from max6690_sensor_read(). That
> value causes havoc for obvious reasons.
> 
> This should be easy to clean up but I don't have time at the moment to
> do the testing of the changes on the one machine to which this is
> applicable. If nobody else deals with it, I'll get around to doing so
> and post a patch when that time comes.

Found this in my email archives.  I committed the error
checking for the integer part as r273016.

Thanks!

- Justin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-ppc/attachments/20141012/4264b83b/attachment.sig>


More information about the freebsd-ppc mailing list