Odd performance problems after upgrade from 4.11 to 6.0-Stable

Mon Jan 2 10:37:02 PST 2006

> Date: Wed, 14 Dec 2005 19:52:03 -0500
> From: Kris Kennaway <kris at obsecurity.org>
> 
> 
> --45Z9DzgjV8m4Oswq
> Content-Type: text/plain; charset=us-ascii
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> On Wed, Dec 14, 2005 at 04:45:47PM -0800, Kevin Oberman wrote:
> > > Date: Wed, 14 Dec 2005 19:34:04 -0500
> > > From: Kris Kennaway <kris at obsecurity.org>
> > >=20
> > > On Wed, Dec 14, 2005 at 04:26:18PM -0800, Kevin Oberman wrote:
> > >=20
> > > > I am attaching a dmesg. I do have a few of drivers (uhci, pcm, psm,
> > > > atkbd0 and ichsmb) that are still marked as GIANT-LOCKED, but I'm not
> > > > using the USB very often. And I'm not using pcm or ichsmb during the
> > > > dump, either. I think everyone has the mouse and keyboard under GIANT,
> > > > but I can't really see those as a problem, either.
> > >=20
> > > A bunch of things are sharing interrupts with USB..disable it and see
> > > if that helps.  Also check vmstat -i to see if some device is
> > > storming.  If not, turn on MUTEX_PROFILING(9) in your kernel and run
> > > the dump (or something faster that also exhibits the problem), then
> > > look for what is contending with Giant.
> >=20
> > Yes, it may be time for MUTEX_PROFILING. I had already looked at
> > interrupts. My kernel is sans APIC so I didn't really think that
> > interrupts were a problems and I see:
> > interrupt                          total       rate
> > irq0: clk                      207037779       1000
> > irq1: atkbd0                       50208          0
> > irq6: fdc0                             9          0
> > irq8: rtc                       26498038        128
> > irq10: pcm0 ichsmb0                    2          0
> > irq11: xl0 uhci0                18076067         87
> > irq12: psm0                       869500          4
> > irq13: npx0                            1          0
> > irq14: ata0                     10423468         50
> > irq15: ata1                          112          0
> > Total                          262955184       1270
> >
> > Clearly no storms and nothing looks obviously broken. USB and the
> > network card share an IRQ, but the USB is not connected to anything and
> > I would not think that it is generating many interrupts. The network
> > IS being used and I'm not seeing all that many interrupts on IRQ11.
> 
> Whenever there is an interrupt on irq11 from the NIC, *both* drivers
> will wake up to process it.  uhci0 will need to acquire Giant.  If
> something else is also trying to acquire Giant (bufdaemon), then they
> will serialize, degrading performance.  This may not be the cause
> since there are only a few interrupts, but MUTEX_PROFILING will tell
> you.

Well, with the holidays and such, this has taken a while, but here is an
update.

I have removed USB support. I hardly ever use it on this system, so that
was an obvious step. No improvement at all.

# vmstat -i
interrupt                          total       rate
irq0: clk                      319818027       1000
irq1: atkbd0                       15443          0
irq6: fdc0                            11          0
irq8: rtc                       40932392        128
irq10: pcm0 ichsmb0               125545          0
irq11: xl0                       3616426         11
irq12: psm0                       281380          0
irq13: npx0                            1          0
irq14: ata0                      8756176         27
irq15: ata1                          144          0
Total                          373545545       1168

Only one shared interrupt and both IRQ 10 devices should have been
totally quiescent during my test run.

The test was building a glimpse index of my inbox. CPU at about
20%. System interactive response was terrible. Took about two minutes
just to log in. Starting Gnome takes roughly forever (about 10
minutes).

I collected mutex stats for just about 3 minutes and found nothing
surprising, but I may not know what to look for. Nothing shows a total
time of over 3.1 seconds. The total time for all of them is 28
seconds. The sum of all Giant lock times was only 4.65 seconds and the
largest of these was in kern_sysctl.c, so I expect it was the profiling
that ate 3.1 of those 4.65 seconds.

I am attaching a spreadsheet with the profile data in case anyone wants
to look at it. (Probably the mail system will strip it, so let me know if I 
should post it.)

Still totally baffled and still feeling the pain.
-- 
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: oberman at es.net			Phone: +1 510 486-8634