insanely-high interrupt rates

Wed Apr 3 08:17:41 UTC 2019

>>>>> "Ian" == Ian Lepore <ian at freebsd.org> writes:

 Ian> I now have an rpi2b, and I see the same thing you do in the
 Ian> 'vmstat -i' output, including rates like 500K int/sec on cpu3
 Ian> timer0. But while it was behaving like that, the output from
 Ian> 'vmstat 1' showed a perfectly steady 8800-9200 int/sec, which is
 Ian> certainly more consistent with top showing under 2% being used for
 Ian> combined system+interrupt.

I'm seeing this too right now. The problem is that whatever this problem
is, it's not being accounted for in the %cpu statistics, probably
because it takes one CPU completely out of play.

So here's what I see. First, top:

last pid:  6110;  load averages:  1.00,  1.00,  1.00        up 1+19:26:47  08:35:02
107 threads:   6 running, 81 sleeping, 20 waiting
CPU:  0.3% user,  0.0% nice,  0.7% system,  0.2% interrupt, 98.8% idle
Mem: 10M Active, 611M Inact, 32M Laundry, 168M Wired, 98M Buf, 115M Free
Swap: 4096M Total, 4096M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME     CPU COMMAND
   10 root        155 ki31      0    32K CPU2     2  43.3H  99.35% idle{idle: cpu2}
   10 root        155 ki31      0    32K RUN      1  43.3H  98.71% idle{idle: cpu1}
   10 root        155 ki31      0    32K CPU0     0  43.1H  98.49% idle{idle: cpu0}
 1186 andrew       20    0    14M  8728K select   1   0:01   0.90% xterm

Notice the load average of 1+, and that while it's reporting 98.8% idle,
only three of the four cpu idle processes show up. The fourth idle
process shows no accumulated CPU time, as shown here by ps axlH:

   0   10    0   0 155  0     0    32 -        RNL   -  2589:43.57 [idle/idle: cpu0]
   0   10    0   0 155  0     0    32 -        RNL   -  2597:14.64 [idle/idle: cpu1]
   0   10    0   0 155  0     0    32 -        RNL   -  2597:19.11 [idle/idle: cpu2]
   0   10    0   0 155  0     0    32 -        RNL   -     0:00.00 [idle/idle: cpu3]

The CPU time spent handling these interrupts is not being accounted for
anywhere.

Also, to confirm that there is something odd going on, the CPU is
running hotter than it would normally do while idle: ~42C rather than
the ~38C that I would normally see with this level of activity.

 Ian> I think the problem here is with vmstat -i, not with actual
 Ian> interrupts. Something about the stats reporting is wrong on
 Ian> armv6/v7.

There are two obvious issues with vmstat -i: it double- or triple-counts
some interrupts, and it reports a false rate once the counters wrap
(which happens quite quickly when this issue is in play). But these
don't account for the problem described here.

Here's lines from vmstat -i taken 10 seconds apart:

local_intc0,3: +                                 3085878928      19685

local_intc0,3: +                                 3094431425      19738

That's 8552497 interrupts in 10 seconds, which matches systat -vm's
report of ~850k interrupts/sec. Since these are 32-bit counters they
will wrap in an hour and a half at that rate, so the ~19700/sec rate
reported is entirely spurious (since it's just counter/uptime).

Looking at the generic_timer driver, there's an obvious problem: it
installs one interrupt handler on all the irqs and supposedly enables
only one of the timers, but the interrupt handler code simply assumes
that the timer causing the interrupt is the one that it enabled; if it
gets an interrupt from one of the others, it'll never clear it or mask
it, it'll just return with the interrupt unhandled, hence the interrupt
storm. On 32-bit arm it is only supposed to enable the timer that's
attached to local_intc0,1 and this indeed shows an interrupt rate of
~1100/sec which seems about right for the clock interrupt; the timer on
local_intc0,3 (the virtual one) is never supposed to be enabled so far
as I can tell.

-- 
Andrew.