Why so many interrupts.
September 23, 2006 9:01 PM   Subscribe

I'm seeing a large number of interrupts on one processor of an AMD Opteron SMP system. Is this normal?

It seems excessive to have 2000 interrupts on Proc 0 of an AMD SMP system. The system is built with 2x Opteron 270 (Italy core) dual core chips, for a total of four cores.

Here's the output of mpstat -P ALL:

09:49:54 PM CPU %user %nice %system %iowait %irq %soft %idle intr/s
09:49:54 PM all 85.58 0.00 12.53 0.08 0.07 1.17 0.57 4025.49
09:49:54 PM 0 87.05 0.00 8.37 0.04 0.24 3.71 0.59 2969.66
09:49:54 PM 1 82.22 0.00 16.79 0.03 0.01 0.31 0.64 10.81
09:49:54 PM 2 86.44 0.00 12.55 0.08 0.01 0.33 0.59 520.02
09:49:54 PM 3 86.61 0.00 12.41 0.16 0.01 0.33 0.47 525.01

The far right column is interrupts per second. I'm spending far too many proc cycles on interrupt handling for my taste.
The server new, it's been running excessively loaded all day, and I'm getting a number of segfaults on the standard distro RPM build of Apache. I'm wondering if I might've gotten a bum proc, because I've never seen Apache segfault continuously before. Ideas?
posted by SpecialK to Computers & Internet (12 answers total)
What kind of memory stats are you getting? Swap processes? Do you have processes statically bound to processors, or are you letting the O/S allocate resources dynamically?

Not nearly enough data to analyse, reasonably. But, it's far more likely you have a bad socket or M/B, than a bad processor.
posted by paulsc at 9:25 PM on September 23, 2006

Some of the gigabit Ethernet devices generate a huge number of interrupts per second. I know at least some of them are moving to poll-mode drivers in Linux because of this... they just make the kernel work too hard otherwise.

Make sure you are running the very most recent supported Linux kernel you can find. The whole 2.6 series is flaky; they shovel in new features faster than they can really debug them, and the testing seems to be 'release it and see who screams'. I've had endless struggles with weird bugs in that kernel series.

Stick with a distro kernel, and try to get help in their forums or on their mailing lists.
posted by Malor at 9:59 PM on September 23, 2006

Response by poster: The kernel is RHEL 4.4 / Centos 4.4's standard 2.6.9. The processor is dynamically allocating. There's 4 gb of ram dedicated to each chip; Apache is taking up 3gb of that, and

I did notice that Apache is throwing a whole lot of segfaults... just found that joy in my logs a few minutes ago. Probably bad PHP modules.

"httpd[6423]: segfault at 0000007fbf3ffff8 rip 0000002a9a5d1c82 rsp 0000007fbf3fffa0 error 6" ... that probably accounts for the interrupts, now how can I get more diagnostic info out of it?
posted by SpecialK at 10:59 PM on September 23, 2006

Response by poster: Oops. "And ... " ... the rest is iocache.
posted by SpecialK at 11:04 PM on September 23, 2006

I don't know that tool, but is it counting the timer interrupts that the scheduler uses to preempt running processes?
posted by Netzapper at 11:42 PM on September 23, 2006

Best answer: May be a kernel issue, as that is a pretty old kernel. Also, if you are dynamically allocating across multiple CPU's, CPU 0 is effectively "managing" seg faults on the other processors, no? So your mpstat report is "normal" for dynamically assigned child processes failing on multiple processors, I think. 3/4 of the total machine work servicing interrupts is being done by CPU 0, while CPU 1 is hardly ever interrupting, and has the highest system load [so its doing some useful system work, I guess]. CPU 3 and 4 are interrupting evenly, so I'm guessing you've got fairly even scheduling dispatch, and more than enough available time on the system, if it wasn't faulting all the time.

What's the Apache error log say?
posted by paulsc at 12:09 AM on September 24, 2006

Response by poster: The apache error log reporting Segmentation Fault (11)'s. /var/log/messages is reporting the Error 6's.

"[Sat Sep 23 23:25:56 2006] [notice] child pid 6943 exit signal Segmentation fault (11)"

"Sep 23 23:25:55 sports kernel: httpd[6069]: segfault at 0000007fbf3ffd60 rip 0000002a9a5d1c7d rsp 0000007fbf3ffd00 error 6"

They're only happening when the server's under heavy load. After midnight last night, things finally slowed down and the load average dropped below 1 process per processor, and the segfaults stopped.
posted by SpecialK at 9:46 AM on September 24, 2006

Ah, well, if you're sure it is load related, I'd look to see if you are I/O bound, and you're having some kind of virtual memory paging problem. But, frankly, I think that's a long shot.

It could be that the reduction in load also means that there are no calls to Apache modules or external programs that are causing your segmentation fault. Perhaps you could fully load the server, serving only static pages, and never see the seg faults. That would point strongly towards corrupted installations of Apache modules, PHP, or other common server extensions or file/type handlers.

If you're working in 64 bit address space, I'd also wonder whether your O/S and application memory management is correctly setup for that. It's possible to munge installs on 64 bit systems, in weird ways.

I'm concerned about your upthread comment that Apache is taking 3 GB of memory space in your 4 GB memory total. I suppose that's possible, if you've got all kinds of modules loaded and tons of child processes enabled, but it's way beyond anything in my experience, and I think it would put your server in a memory squeeze immediately if anything were called that was dynamically loaded. What do you think is causing Apache's footprint to be so large? A lot of fairly complicated Apache installations with plenty of child processes run in 100MB, or less (usually, less), of memory...

Unfortunately, without a full understanding of your configuration and some process tracing, the memory hex addresses you're posting here don't mean much to me. Maybe somebody else coming along will see them as tea leaves, but in the meantime, all I can offer is some general advice to comment out various external modules in your Apache config file [httpd.conf] temporarily, and see if you can discover what is bloating up your memory footprint, and/or contributing most directly to your seg faults. Maybe go over your PHP installation (if you have PHP), and what ever other externals you're running, for proper configuration, and version compatibility.
posted by paulsc at 11:03 AM on September 24, 2006

Response by poster: Actually, 3 gb out of 8. And yes, we're running TONS of child processes in an effort to serve up the traffic. I've unloaded many apache modules; each child should take up about 4kb of static memory space plus whatever dynamic information it's loaded from files or the database. When I measured the 3gb, we had a load average over 100 and 434 apache processes alive.

Yeah, we're running 64 bit... CentOS x86_64.

What's the best way to trace I/O problems? I'm very much an amateur at this layer of sysadmin-ing.
posted by SpecialK at 11:39 AM on September 24, 2006

"... What's the best way to trace I/O problems? ..."
posted by SpecialK at 2:39 PM EST on September 24 [+fave] [!]

You could start with sysstat.
posted by paulsc at 2:05 PM on September 24, 2006

Also, is your database running on the same box? Are you getting any database problems, such as long queries, SQL errors, etc., that could be hanging up kernel threads? If you're blocking threads, you could be forcing kernel paging unduly at high load, and setting up scenarios where seg faults in applications could be more likely, I suppose, but it doesn't look from what you've posted that that is what is happening.

I don't deal with dual processor Opteron systems or CentOS much, but from your stats posted here, I'm seeing that CPU 0 is running 3.71 %soft, and generating the majority of the interrupts, meaning it is doing most of the process scheduling for the system. CPU 3 is your worst %iowait, but it's still only 0.16, so your disk system is keeping up with the load, for what you're doing, if I believe mpstat. CPU 1 seems to be doing something unique, in that it isn't interrupting frequently, maybe because it is bound to some process that is heavier in kernel space. If your mpstat command didn't specify any count or interval parameters, 4000 or interrupts since you last booted the machine, for a heavily loaded Web server, also locally hosting a database, may not be bad... If you specified mpstat interval and count parameter for a conventional 2 second report window, this may not be so good.

Seg faults are the more worrisome issues, I'd guess, if your applications aren't running smoothly.
posted by paulsc at 2:57 PM on September 24, 2006

Somewhat belatedly, I'll add one more comment.

Segfault 11 is very often a sign of bad memory. Can you take the machine offline for memory testing for awhile?
posted by Malor at 9:48 PM on September 24, 2006

« Older What kind of bird is this?   |   Looking for interesting floor/table lamps. Newer »
This thread is closed to new comments.