Why is my CPU so hot?
...or is it? For the last month or so, my PC (Windows 7, home-built ASUS P8B WS mobo, Xeon E3-1240 v2 Ivy Bridge, AMD FirePro V5900) has been abruptly shutting down (acting like a reset, everything immediately powers off including fans, power LED, etc, then reboots on its own) usually when doing graphically intensive things. I believe the CPU is hitting its thermal max & shutting down, but the evidence is mixed: if I take a look at the CPU temp in the BIOS monitor on restart immediately after one of these events, it's cool (40ish C) and if I continue into Windows and look at CPU temps in SpeedFan, I begin to get puzzled: Core0 through Core 3 temps vary around 30C at idle, but the "CPU" temp is up in the 80-90C range. Why could be causing the elevated temps at idle and why the discrepancy between temps described above?

Machine had been running fine for a year (although I never had cause to look at CPU temp).

I took it into a repair shop & they diagnosed the high temp shutdown, reapplied thermal paste, ran under 100% load for a day with CPU temps in the 60C range (not sure what they used to monitor temps). I got the machine home and got an immediate shutdown first time I tried one of the things which had been causing problems for me.

Things the shop checked at min, probably others:

Hard drive OK
memory OK
graphic card OK
Cooling fans all running
Power supply OK

BIOS and graphics drivers are up to date.

The thing guaranteed to cause shutdown: scrolling through the slide strip thingy in Adobe Lightbox. Other incidents in AutoCAD, Sketchup with large models and Dynascape (landscape design).

What the heck is going on? Things run OKish at low loads, but I need to be able push it hard. I'm worried about frying the CPU, too.
Response by poster: Also: no sign of CPU frequency throttling, which is enabled.
posted by skyscraper at 5:53 PM on November 2, 2013

Best answer: When I was having a problem like this with my PC shutting down under load, it was actually my power supply going out/not having enough juice to run everything.
posted by OnTheLastCastle at 5:56 PM on November 2, 2013

Response by poster: Thanks, that was the repair shop's first thought - Power supply was been checked, and there's still the question of why there's that extremely elevated CPU temperature.
posted by skyscraper at 6:00 PM on November 2, 2013

Your home wall circuits, (or your UPS or lightening arrestor/power strip) may be running a lower voltage than expected, because of the high demand of your PC. If your wall voltage falls to perhaps 105 VAC from the nominal 120 VAC expected, your switching power supply has no choice but to spend more of each AC power sine wave switched on, to try to regulate current. And if it can't, because your house voltage as measured at the PC power supply input is too low, or your power supply is sized very close to the maximum demand for current as the sum of your mainboard, and GPU, you're going to force it to overcurrent shutdowns, pretty frequently. Recovery after power loss is usually a mainboard BIOS setting, so that also explains the auto reboot after such incidents.

Your choices are probably a bigger capacity power supply, or finding and fixing what is dropping your house AC voltage to below nominal, under load, if your house circuit voltage is, in fact, dropping measurably under load. Sometimes, it just something as simple as finding that your refrigerator, and some other appliances are all on the same 120 VAC "leg" of your 240 VAC mains service, as the outlet supplying your PC, and that, as a result, all that combined load is pulling down one "side" of the mains box to maybe 105 VAC, while the other is forced up to 135 VAC, by the house power transformer. Sometimes people notice this by observing that light bulbs flicker and burn out a lot faster in some locations than others.

And sometimes, it's just a borderline PC power supply, and test suite that doesn't stress the GPU as much as real world use does, that leads you astray. A heftier power supply may solve everything.
posted by paulsc at 6:00 PM on November 2, 2013 [1 favorite]

"... Also: no sign of CPU frequency throttling, which is enabled."
posted by skyscraper at 8:53 PM on November 2

It may be enabled in BIOS, but are you very sure your actual CPU can throttle? Not every stepping can. Or, maybe your CPU is using other power management strategies than clock change to manage power and heat, first. You're not getting to full temp indications across all cores, so I wonder why you think this is primarily a CPU heating issue?
posted by paulsc at 6:06 PM on November 2, 2013

Response by poster: CPU temp because repair shop diagnosis + that 80-90C reading from SpeedFan. I'm not sophisticated enough to know where that reading is as opposed to the Core CPU readings.

I've never noticed voltage readings below about 118V at our house - just confirmed 120 at the outlet.

Current power supply 500W, seems as if that should be enough.

I am probably going to haul the box back into the repair shop, just trying to get possible leads for them as they sounded not quite sure where they would look next when I called & told them I was still having problems.
posted by skyscraper at 6:18 PM on November 2, 2013

Best answer: What sort of power supply do you have?

Often these are overlooked. My new machine had a cheap one and randomly rebooted. I paid a whopping $50 for a decent one and have had no issues since
posted by Mario Speedwagon at 6:25 PM on November 2, 2013

Response by poster: Hmph. Searching for "speedfan CPU" turns up multiple examples of elevated readings when nothing is wrong. "Why didn't you check that before?" I hear you say? Because I'm not very bright, that's why.

I'll verify with the shop that they checked out the power supply as a possible problem - That was their guess as to the problem when I dropped it off, I hope they checked it out.

Power supply is an Antec 500W that people on Newegg seemed to think was good.
posted by skyscraper at 6:33 PM on November 2, 2013

Best answer: "Current power supply 500W, seems as if that should be enough."

Eh, aggregate wattage available from a power supply is not the only measure. It has to be able to supply enough DC amperage on each rail, to service the demand it is driving. So, a power supply with a good +5 VDC logic section, and minimal +12 and -12 VDC rails, might not work well with your particular setup.

As it stands, your processor wants about 70 watts, your video card about 75 watts, your main board probably wants another 70 watts or so for audio, ECC memory, USB, especially USB 3.0, and if you've actually got any USB 3.0 peripherals plugged in, you jump the demands on your supply by all those, too, plus another 20 or so watts for each hard drive and any optical drives you've got. So, minimum of 300+ watts for your system, on a nominal 500 watt supply.

"Power supply is an Antec 500W that people on Newegg seemed to think was good."
posted by skyscraper at 9:33 PM on November 2

No, not a "bad" power supply, but not a server class supply, either. If you're running Xeon processor, and workstation motherboards, why skimp on the power supply?
posted by paulsc at 6:39 PM on November 2, 2013

Best answer: Power supply, power supply, power supply. Failing that, perhaps video card. Failing that, probably something like a motherboard. It's extremely unlikely that your machine is shutting down due to a too-hot CPU in the situations you describe: it'll limit its clock rate long before you reach its failsafe.
posted by introp at 6:54 PM on November 2, 2013

Best answer: I had an endless series of issues like this with a high-powered system i used to own(i7-920 overclocked to 4ghz, GTX 580 over clocked with an enormous cooling system, a huge pile of hard drives, a lot of sticks of ram which DO draw more current based on the number and type of sticks, and operating above stock speeds, bla bla bla).

Xeons are well known for drawing lots and lots of current deceptive of their ratings. just an i7 2600 can draw north of 200 watts *on its own* under load without even being overclocked. Many high powered gpus, and yes i know that fire pro is a single slot card but bear with me, can draw between 2 and 300 watts.

At the time i had a very very nice high current pc power and cooling($$$$) power supply in the 610 watt range. It supplied a lot more current than many 750 or even 1000 watt power supplies.

I replaced it with a top of the line 750 watt corsair that barely put out more current, but was technically rated higher. Problem solved.

Before you do anything else though, download furmark and prime95 and run both at the same time. On even a properly configured system with headroom this will make it groan and shoot up to temps right near the limits of what either the cooling system or components can handle. That's normal. On a system with issues you'll get an instant black screen and reboot or hard lock. That i7 system did this during a test like that(note the stretched out EEERROOOR, awesome).

So now you've proved it's an issue that occurs under load, cool! now try the cpu test. still fine? try the gpu test. If it only happens when both are loaded, power supply. If it only happens when the cpu is loaded, power supply. If it only happens when the GPU is loaded, i bet you're getting a weird temperature spike. But still power supply, because seriously, running a system like that on a 500w power supply is sketchy.

For reference even the single CPU single GPU mac pros have something like an 850-900w power supply. Apple doesn't ridiculously over spec this stuff, some of the iMacs(including mine, which has the hotter GPU and such) only have 200w power supplies. You just need a bigger, quality unit here.

If replacing the power supply doesn't do it then either RMA the fire pro or if it's not still under warranty take it apart and re-apply the thermal paste with something GOOD like arctic silver 5, arctic alumina, IC diamond, etc. I've had two high end GPUs come straight from the factory that started randomly shutting down like this because they were getting inconsistent contact with the heatsink once they reached operating temp and things expanded and shifted around some tiny amount like 1mm or less.

If you want a recommendation on which PSU to buy i can't say enough nice things about this, of which i own a slighter older revision from 2010-ish. I've used it in like... 3 systems now and it's the most bulletproof thing in the universe.

You're not using more power by having a higher wattage PSU, that's a myth, especially on an 80+ certified unit. You're just giving yourself the headroom to not have voltage sags(or weird spikes of inability to deliver current, etc) when there's a sudden spike of load.

As a closing note, you won't damage anything. The system shutting down the way it is is the failsafe to prevent damage. I'll also note that most modern video cards also have a non heat based "not enough current" shutdown deep in their vBIOS/firmware, which is whats shut down systems of mine in the past.
posted by emptythought at 7:09 PM on November 2, 2013 [5 favorites]

Response by poster: emptythought, I downloaded furmark and prime95, fired them up and ... no excitement, things just ran for 15 mins +, although at elevated temps - around 65C for both GPU and CPU cores. Bizarrely, the "CPU" reading from speedfan, which I had been worrying about, went *down* to around 45C, so this is obviously a bogus reading. No shutdown. And now I can't get repeat the shutdown I've been able to provoke previously with Lightroom.

I guess I had picked up the folk wisdom that you didn't want to oversize a power supply - if the problem reappears, I'll go for a higher wattage PSU.

Thanks everyone, I'm going to mark this one as solved despite the lack of a smoking gun. Everyone that mentioned power supply (wait! that's everyone!) deserves a best answer.
posted by skyscraper at 8:55 PM on November 2, 2013

emptythought has most of this covered. What you've described does sound very much power-related.

I have often seen Speedfan offer one or two completely bogus readings; I don't think it's particularly choosy about making sure the sensors it thinks it's reading actually exist.

Antec power supplies are generally quite reasonable, though I have seen one fail once (a small electrolytic capacitor inside the PSU went off like a little firecracker - probably an isolated component failure, not an actual design fault). Corsair power supplies are generally very nice indeed and I have yet to see one fail.

But before you start down the power supply swapping road, you might want to try running your box via a half decent line-interactive UPS, install its monitoring software, and see how often it ends up running in Boost mode. If your machine couldn't be faulted at the shop but regularly fails for you at home, the problem might indeed be saggy or unusually dirty home mains power.

I wouldn't necessarily rule out graphics driver issues either, given that the thing you can do to bring on the crash is graphics-related but seems unlikely to stress the CPU much. Sometimes the latest is by no means the greatest, and you can get rid of persistent crashes by stepping back a few versions.
posted by flabdablet at 9:53 PM on November 2, 2013

Oh yes: if the problem is indeed mains-related, you will probably find it's more likely to occur at certain times of day. Keep a log.
posted by flabdablet at 9:55 PM on November 2, 2013

Yes, speedfan lies. I don't think there are any modern Intel processors that don't have some kind of thermal throttling. If the processor isn't throttling, it probably doesn't have a heat problem.

Step 1: Look at the board for swollen capacitors. This is just a thing that happens. If there are any, the board is bad. You might be able to get someone to repair it.

Step 2: Run memtest86+ on your system overnight. If you get any errors, you have bad ram somewhere. Troubleshoot further by running the test on each individual stick of ram until you find the bad one(s).

Step 3: Swap out the video card if you can.

Step 4: Replace the power supply. You can try to diagnose the problem by monitoring voltages, but it is very likely that whatever problem that might be occurring will happen more quickly than your monitoring software can capture.
posted by gjc at 4:20 AM on November 3, 2013

A stupidly simple check: When I ran into this problem earlier in the year the problem turned out to be a toast crumb lodged in the fan and preventing it turning. Check not only the temperature that Speedfan is recording but also fan speed. A fan which is simply not turning could give you the same problems of runaway temperatures which are only noticeable under load.
posted by rongorongo at 4:44 AM on November 3, 2013

On the isolation/line interactive UPS front, you'll get a much better result from buying a properly rated(ie: 3 amp or so) actual isolation transformer like a powervar.

Most ups units that cost less than $100 or maybe even more are built like garbage. There's a pile the size of a small fridge of dead ones at my office from just the past 5 years or so.

Go to your local PC recycler and ask if they have any. Nearly ALL touch screen or IBM cash registers and POS terminals in stores use them, so they often have piles.

An isolation transformer will literally last forever. There's no battery to wear out or anything, and since isolation and smoothing is it's primary function those parts are VERY robust. Oh yea, and cheap. The most I've seen a used one cost is like $30.

Really check if you're having random sags and brownouts though. God my office has fucking awful ones. You'd need a line interactive UPS for that. Can't recommend the best way to do that, but if you notice your lights regularly going dim call the local power company and demand they send out someone with a line analyzer to leave running for a few days. They will.
posted by emptythought at 3:54 PM on November 3, 2013

Most ups units that cost less than $100 or maybe even more are built like garbage.

Yeah, anything rated under 1500VA and not online or line-interactive is pretty much guaranteed to be a waste of time.

The Powervar units that emptythought recommends are online types, and will certainly do the best job of cleaning up your PC's mains supply if the issue is caused by noise spikes. If it's just brownouts (voltage sag), a line-interactive type like the CyberPower unit I linked earlier should deal with it quite effectively at lower cost.

Most of the really cheap UPS units that emptythought is rightly dismissive of are offline designs. These have nothing to recommend them; the only use case they make any sense for at all is mains power that's always clean and well-regulated except when it completely blacks out, which is pretty clearly not what you have.
posted by flabdablet at 5:21 PM on November 3, 2013

On swollen capacitors: unlikely to be an issue with the P8B WS mobo which uses solid types to avoid precisely that failure mode.
posted by flabdablet at 5:57 PM on November 3, 2013

