Interrupt routine usage not shown by top in 8.0
Barney Cordoba
barney_cordoba at yahoo.com
Fri Mar 13 10:52:21 PDT 2009
--- On Fri, 3/13/09, Robert Watson <rwatson at FreeBSD.org> wrote:
> From: Robert Watson <rwatson at FreeBSD.org>
> Subject: Re: Interrupt routine usage not shown by top in 8.0
> To: "Barney Cordoba" <barney_cordoba at yahoo.com>
> Cc: current at freebsd.org
> Date: Friday, March 13, 2009, 11:41 AM
> On Fri, 13 Mar 2009, Barney Cordoba wrote:
>
> > Its difficult to have "better benchmarks"
> when the system being tested doesn't have accounting
> that works. My test is designed to isolate the driver
> receive function in a controlled way. So it doesn't much
> matter whether the data is real or not, as long as the tests
> generate a consistent load.
>
> Strikes me that this thread is getting a bit contentious,
> and I don't meaning in a locking sense :-).
>
> FreeBSD provides two interrupt execution environments: fast
> interrupts, and ithreads. Historically (5.x ... 7.x) device
> drivers have had to select one of the two models, but in 8.x
> a hybrid mode, called interrupt filters, allows drivers to
> do both for an interrupt source. The problem you're
> running into is that "fast" interrupts borrow the
> complete context of the thread they preempt, including its
> stack and accounting characteristics. For pure ithread
> drivers, this is generally not a problem, as the sole
> purpose of that interrupt handler is to kick the scheduler
> to launch the full ithread context, which will typically
> immediately preempt (at that same point in the stack) in
> order to give the interrupt handler a full context that can
> sleep on locks, be accounted for, etc.
>
> if_em lives in a world a bit between these models, in which
> it wants both a fast context, to do a bit of low level
> interaction with the device, and a "slow" or full
> context in which to execute the network stack, perform
> memory allocation, and so on. Because interrupt filters
> weren't yet around (and are presumably too experimental
> to use in 8.x right now), it does this by creating its own
> ithread-like execution context using a task queue. The
> result is mis-billing of what is effectively an ithread as
> system time instead of interrupt time. You'll notice
> that if_em (and other drivers employing the same trick) do
> elevate the priority of the task queue thread so that the
> scheduler treats it (almost) the same way it treats an
> ithread (it will immediately preempt most stuff).
>
> > The only thing obviously "bogus" is that
> FreeBSD is launching 16,000 tasks per second (an interrupt
> plus taskqueue task 8000 times per second), plus 2000 timer
> interrupts and reporting 0% cpu usage. So I'm to assume
> that the system will never show 100% usage as the entire
> overhead of the scheduler is not accounted for?
>
> The overhead of the scheduler is billed to a combination of
> the thread being switched out of, and the thread being
> switched to. Fast interrupt execution is billed to the
> thread it preempts. In the scenario you describe, you will
> only get mis-billing to idle if those fast interrupts
> preempt only the idle thread. Otherwise they will get
> billed to whatever is preempted. On a system where you have
> a network interface effectively keeping the CPU busy, it
> will get billed to the task queue thread (I expect) because
> the task queue is what will get preempted. Now, this might
> not strictly be true because the scheduler tries hard to
> keep ithreads running close to where the interrupt is
> delivered, but if it doesn't know the task queue thread
> is an ithread, it may do this less well. Presumably this is
> a temporary state of mind while interrupt filters are being
> adopted, only the interrupt filter work seems stalled (?).
>
> > Calling handle_rxtx was a timesaver to determine the
> overhead of forcing 8000 context switches per second (16000
> in a router setup) for apparently no reason. Since the OS
> doesn't account for these, there seems no way to make
> this determination. Its convenient to say it works well or
> better than something else when there is no way to actually
> find out via measurement. I don't see how launching 8000
> tasks per second could be faster than not launching 8000
> tasks per second, but I'm also not up on the newest
> math.
>
> The purpose of a context switch from a fast interrupt
> context is to give interrupt code the ability to acquire
> general kernel locks, as opposed to just spin locks. If you
> run in a borrowed context (i.e., you have synchronously
> preempted a thread to run a fast interrupt), you may (will)
> generate deadlocks due to violating lock orders, since you
> don't want to (can't) release the locks already
> acquired by the thread, and may then acquire locks in the
> wrong order. If you want to acquire full sleep locks, you
> need a full context, which requires a context switch out of
> the preempted thread and into an ithread (or task queue
> thread or whatever). Passage into the normal input paths of
> the network stack will encounter normal locks, so must be
> done from a full context.
>
> > Since you know how things work better than any of us
> regular programmers, if you could please answer these
> questions it would save a lot of time and may result in
> better drivers for us all:
> >
> > 1) MSIX interrupt routines readily do "work"
> and pass packets up the IP stack, while you claim that MSI
> interrupts cannot? Please explain the locking differences
> between MSI and MSIX, and what locks may be encountered by
> an MSI interrupt routine with "real traffic" that
> will not be a problem for MSIX or taskqueue launched tasks.
> Its certainly not obvious from any code or docs that
> I've seen.
>
> It's fine to enter the network stack from any full and
> dedicated thread context, which means it's OK from an
> ithread or a task queue thread kicked by a fast interrupt,
> but it's not OK from a fast interrupt. There's no
> difference between MSI/MSIX as far as I know from this
> perspective, only in how the drivers use them.
>
> > 2) the bge and fxp (and many other) drivers happily
> pass traffic up the IP stack directly from their interrrupt
> routines, so why is it bogus for em to do so? And why do
> these drivers not use the taskqueue approach that you claim
> is superior?
>
> Only a few drivers use the fast interrupt approach; those
> that don't presumably do it because the approach of
> mixing "fast" and "slow" contexts
> hasn't been applied by their authors. If the interrupt
> filter model is going to become mainstream, I think we'd
> like to see them adopt that rather than hand-crafting fast
> interrupts and taskqueues, to the same effect, in every
> driver. On the other hand, one benefit to the task queue
> model is that you can deliver events to it that *aren't*
> interrupts, such as requests for state transitions from the
> software side of the stack.
>
> > 2b) Does this also imply that systems with bge or
> network drivers that do the "work" in the
> interrupt handler will yield completely bogus cpu usage
> numbers?
>
> All direct interrupt deliveries bill a small amount of CPU
> time to the context that they execute in. Drivers that do
> less work in the fast interrupt delivery context will bill
> less time outside of their own worker ithreads.
> There are two ways to measure CPU use, btw: one is a
> sampled approach involving timers, which works badly in fast
> interrupt contexts with interrupts disabled because the
> ticks are deferred until after interrupts are re-enabled, an
> the other is explicit time measurement, which is quite
> expensive. Currently the kernel uses the TSC, where
> available, and an estimator to map between CPU cycles and
> real time, but that also has its limitations.
>
> > 3) The em driver drops packets well before 100% cpu
> usage is realized. Of course I'm relying on wrong cpu
> usage stats, so I may be mistaken. Is there a way (or what
> is the preferred way) to increase the priority of a task
> relative to other system processes (rather than relative to
> tasks in the queue) so that packets can avoid being dropped
> while the system runs other, non-essential tasks?
>
> When the em driver creates task queue threads, it assigns
> them an ithread priority. You can manually adjust that
> priority in the code, but I'm not sure we have an
> explicit management API from userspace to adjust those
> priorities without source code changes (but I may be wrong).
>
> > 3b) Is there a way to lock down a task such as a NIC
> receive task to give absolute priority or exclusive use of a
> cpu? The goal is to make certain that the task doesn't
> yield before it completes some minimum amount of work.
>
> You can use cpuset to force a specific thread onto a
> specific CPU, and to force other threads not to run on that
> CPU. You can also use cpuset, I believe, to direct the
> low-level interrupt delivery for sources to specific CPUs,
> but I've not done this myself.
>
> > Its my view that it would be better to just suck
> packets out of the ring and queue them for upper layers, but
> I dont yet have a handle on the trade offs. Currently the
> system drops too many packets unnecessarily at extremely
> high load.
>
> This is, effectively, what fast interrupt handlers + task
> queues do. There's another potential dispatch point,
> between the link layer and the network layer, which is
> controlled by the net.isr.direct flag; right now we dispatch
> the whole stack to completion in the ithread, but you can
> turn that off by setting the flag to zero. In practice, the
> context switch avoidance associated with doing that appears
> to be a significant win for many, but not all, workloads.
>
> Robert N M Watson
> Computer Laboratory
> University of Cambridge
Thanks Robert. I think Scott misinterpreted that I was just trying
to create a test that generated a substantial load that wasn't
accounted for in Top output, and not to claim that the code was
wrong or bad. Your detailed explanation probably saved me a week
of dilly-dallying. I do find it troubling that such
a large load can go unaccounted for.
What's confusing about the em driver is that MSIX and MSI interrupts
use different types of interrupts in the same driver. I'd assumed that
MSIX was newer and therefore the preferred method. Is the MSIX
code using MP_SAFE instead of FAST_IRQ simply because no-one had a
need to update it?
Barney
More information about the freebsd-current
mailing list