Overheating PC problems - is my sensor lying?
A while back one of our PCs decided to commit suicide. During normal use it suddenly powered down and refused to do anything when it powered back up. No drives span up and no bios reports despite the graphics card reporting correctly. It was tracked down to a blown motherboard which was replaced.

Ever since then the PC has been prone to power down because it's getting too hot - but I think it's lying.

Two virtually identical systems side by side (Felix and Cutty). Cutty starts at 27c and climbs to about 40c in five minutes and sits there happily. Felix starts at 52c and climbs to about 64c in the same time, then sits there. Current bios shutdown temp is 65c which means a little graphical load and Felix powers down.

Obviously, I'm thinking that the sensor on Felix is damaged, but how can I check? Is it safe to assume that both machines are going to be running at approximately same temperature? Most importantly, if I disable the sensor shutdown sequence, what risks am I taking with this machine? Am I merely risking releasing the chip spirit or is the a chance I'll destroy the mboard and connected cards?

"virtually identical systems" meaning Athlon 64 3000 vs 3400 processors, but otherwise identical mboards, fans, cases, etc...
Most motherboard temp sensors are notoriously fickle. Small differences in the contact surface and location can cause wide fluctuations in the accuracy of reported temps.

For this reason, I usually use them to verify that the heatsink is working properly and that temps are below 80-90c.

I don't bother with overheating shutdown for two reasons. First, my desktop PC is only on when I'm at it, so if something bad happens I can address it, plus no fan just quits - they make noise before the fail. Second, processors are pretty cheap, especially for a machine that's a few years old.

In your case, I would suspect the powersupply first, then the motherboard.

Hope this helps.
posted by Pogo_Fuzzybutt at 5:56 AM on June 18, 2007

Modern CPU's have thermistor or diode based on-die thermal sensors, that are supposed to be used by motherboard firmware to shutdown the machine before CPU damage occurs. Most also have additional thermal sensors in other areas of the motherboard, that can be independently monitored and used to control secondary case fans and other cooling devices. It's possible yours has become defective, but you haven't proven it, so you have to assume that the sensor is good until you can prove that it isn't. The reason for that stance is that silicon semi-conductors in thermal runaway do tend to draw lots of current that quickly helps them complete the process of frying themselves. In those final seconds, they can certianly damage other components, too.

You'd need an appropriate external thermometer that could be attached to the CPU die to physically check the temperature of the CPU, without involving a suspect on-die thermal sensor. This is a bit of an engineering rig, as you generally have to remove the CPU heat sink to install the thermocouple that is generally used, and then re-install the heat sink, without creating new thermal problems. So most people choose to use their BIOS figures, compared to other case measurements, using tools like Speedfan which can access and report temps from other sensors on the motherboard as well. If you saw high CPU core temps, but low temps from other motherboard sensors in a 15 minute period after start up, you might diagnose a poor thermal interface between your CPU and its heatsink, and fix that problem.

It's more a question of before/after comparisons than a matter of absolutes, with this latter approach.
posted by paulsc at 5:57 AM on June 18, 2007

Sorry - I should have made it clearer in the post - it's the CPU temp that is reading hot.

The chip has been reseated several times with several fans (including an Arctic64 and a copper Zalman flower type fan). No difference.

The thought of thermal runaway doesn't fill me with joy...
posted by twine42 at 6:16 AM on June 18, 2007

Other than a faulty reading, it could be a poorly seated heatsink, or it could be that the heat spreader was bonded improperly.

The strongest evidence you have that their might be a faulty sensor is the high starting temperature. You should definitely double check that reading, by checking immediately following ~5 hours of being unplugged. To actually heat from ambient to ~50C in the time it takes to enter the BIOS is quite a feat.

Swap the CPUs and see if the problem moves.
posted by Chuckles at 7:17 AM on June 18, 2007

