CPU affinity with ULE scheduler

Mon Nov 17 03:36:41 PST 2008

On Mon, Nov 17, 2008 at 7:11 PM, Archimedes Gaviola
<archimedes.gaviola at gmail.com> wrote:
> On Fri, Nov 14, 2008 at 12:28 AM, John Baldwin <jhb at freebsd.org> wrote:
>> On Thursday 13 November 2008 06:55:01 am Archimedes Gaviola wrote:
>>> On Wed, Nov 12, 2008 at 1:16 AM, John Baldwin <jhb at freebsd.org> wrote:
>>> > On Monday 10 November 2008 11:32:55 pm Archimedes Gaviola wrote:
>>> >> On Tue, Nov 11, 2008 at 6:33 AM, John Baldwin <jhb at freebsd.org> wrote:
>>> >> > On Monday 10 November 2008 03:33:23 am Archimedes Gaviola wrote:
>>> >> >> To Whom It May Concerned:
>>> >> >>
>>> >> >> Can someone explain or share about ULE scheduler (latest version 2 if
>>> >> >> I'm not mistaken) dealing with CPU affinity? Is there any existing
>>> >> >> benchmarks on this with FreeBSD? Because I am currently using 4BSD
>>> >> >> scheduler and as what I have observed especially on processing high
>>> >> >> network load traffic on multiple CPU cores, only one CPU were being
>>> >> >> stressed with network interrupt while the rests are mostly in idle
>>> >> >> state. This is an AMD-64 (4x) dual-core IBM system with GigE Broadcom
>>> >> >> network interface cards (bce0 and bce1). Below is the snapshot of the
>>> >> >> case.
>>> >> >
>>> >> > Interrupts are routed to a single CPU.  Since bce0 and bce1 are both on
>>> > the
>>> >> > same interrupt (irq 23), the CPU that interrupt is routed to is going
>> to
>>> > end
>>> >> > up handling all the interrupts for bce0 and bce1.  This not something
>> ULE
>>> > or
>>> >> > 4BSD have any control over.
>>> >> >
>>> >> > --
>>> >> > John Baldwin
>>> >> >
>>> >>
>>> >> Hi John,
>>> >>
>>> >> I'm sorry for the wrong snapshot. Here's the right one with my concern.
>>> >>
>>> >>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
>>> >>    17 root        1 171   52     0K    16K CPU0   0  54:28 95.17% idle:
>> cpu0
>>> >>    15 root        1 171   52     0K    16K CPU2   2  55:55 93.65% idle:
>> cpu2
>>> >>    14 root        1 171   52     0K    16K CPU3   3  58:53 93.55% idle:
>> cpu3
>>> >>    13 root        1 171   52     0K    16K RUN    4  59:14 82.47% idle:
>> cpu4
>>> >>    12 root        1 171   52     0K    16K RUN    5  55:42 82.23% idle:
>> cpu5
>>> >>    16 root        1 171   52     0K    16K CPU1   1  58:13 77.78% idle:
>> cpu1
>>> >>    11 root        1 171   52     0K    16K CPU6   6  54:08 76.17% idle:
>> cpu6
>>> >>    36 root        1 -68 -187     0K    16K WAIT   7   8:50 65.53%
>>> >> irq23: bce0 bce1
>>> >>    10 root        1 171   52     0K    16K CPU7   7  48:19 29.79% idle:
>> cpu7
>>> >>    43 root        1 171   52     0K    16K pgzero 2   0:35  1.51%
>> pagezero
>>> >>  1372 root       10  20    0 16716K  5764K kserel 6  58:42  0.00% kmd
>>> >>  4488 root        1  96    0 30676K  4236K select 2   1:51  0.00% sshd
>>> >>    18 root        1 -32 -151     0K    16K WAIT   0   1:14  0.00% swi4:
>>> > clock s
>>> >>    20 root        1 -44 -163     0K    16K WAIT   0   0:30  0.00% swi1:
>> net
>>> >>   218 root        1  96    0  3852K  1376K select 0   0:23  0.00% syslogd
>>> >>  2171 root        1  96    0 30676K  4224K select 6   0:19  0.00% sshd
>>> >>
>>> >> Actually I was doing a network performance testing on this system with
>>> >> FreeBSD-6.2 RELEASE using its default scheduler 4BSD and then I used a
>>> >> tool to generate big amount of traffic around 600Mbps-700Mbps
>>> >> traversing the FreeBSD system in bi-direction, meaning both network
>>> >> interfaces are receiving traffic. What happened was, the CPU (cpu7)
>>> >> that handles the (irq 23) on both interfaces consumed big amount of
>>> >> CPU utilization around 65.53% in which it affects other running
>>> >> applications and services like sshd and httpd. It's no longer
>>> >> accessible when traffic is bombarded. With the current situation of my
>>> >> FreeBSD system with only one CPU being stressed, I was thinking of
>>> >> moving to FreeBSD-7.0 RELEASE with the ULE scheduler because I thought
>>> >> my concern has something to do with the distributions of load on
>>> >> multiple CPU cores handled by the scheduler especially at the network
>>> >> level, processing network load. So, if it is more of interrupt
>>> >> handling and not on the scheduler, is there a way we can optimize it?
>>> >> Because if it still routed only to one CPU then for me it's still
>>> >> inefficient. Who handles interrupt scheduling for bounding CPU in
>>> >> order to prevent shared IRQ? Is there any improvements with
>>> >> FreeBSD-7.0 with regards to interrupt handling?
>>> >
>>> > It depends.  In all likelihood, the interrupts from bce0 and bce1 are both
>>> > hardwired to the same interrupt pin and so they will always share the same
>>> > ithread when using the legacy INTx interrupts.  However, bce(4) parts do
>>> > support MSI, and if you try a newer OS snap (6.3 or later) these devices
>>> > should use MSI in which case each NIC would be assigned to a separate CPU.
>> I
>>> > would suggest trying 7.0 or a 7.1 release candidate and see if it does
>>> > better.
>>> >
>>> > --
>>> > John Baldwin
>>> >
>>>
>>> Hi John,
>>>
>>> I try 7.0 release and each network interface were already allocated
>>> separately on different CPU. Here, MSI is already working.
>>>
>>>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
>>>    12 root        1 171 ki31     0K    16K CPU6   6 123:55 100.00% idle:
>> cpu6
>>>    15 root        1 171 ki31     0K    16K CPU3   3 123:54 100.00% idle:
>> cpu3
>>>    14 root        1 171 ki31     0K    16K CPU4   4 123:26 100.00% idle:
>> cpu4
>>>    16 root        1 171 ki31     0K    16K CPU2   2 123:15 100.00% idle:
>> cpu2
>>>    17 root        1 171 ki31     0K    16K CPU1   1 123:15 100.00% idle:
>> cpu1
>>>    37 root        1 -68    -     0K    16K CPU7   7   9:09 100.00% irq256:
>> bce0
>>>    13 root        1 171 ki31     0K    16K CPU5   5 123:49 99.07% idle: cpu5
>>>    40 root        1 -68    -     0K    16K WAIT   0   4:40 51.17% irq257:
>> bce1
>>>    18 root        1 171 ki31     0K    16K RUN    0 117:48 49.37% idle: cpu0
>>>    11 root        1 171 ki31     0K    16K RUN    7 115:25  0.00% idle: cpu7
>>>    19 root        1 -32    -     0K    16K WAIT   0   0:39  0.00% swi4:
>> clock s
>>> 14367 root        1  44    0  5176K  3104K select 2   0:01  0.00% dhcpd
>>>    22 root        1 -16    -     0K    16K -      3   0:01  0.00% yarrow
>>>    25 root        1 -24    -     0K    16K WAIT   0   0:00  0.00% swi6:
>> Giant t
>>> 11658 root        1  44    0 32936K  4540K select 1   0:00  0.00% sshd
>>> 14224 root        1  44    0 32936K  4540K select 5   0:00  0.00% sshd
>>>    41 root        1 -60    -     0K    16K WAIT   0   0:00  0.00% irq1:
>> atkbd0
>>>     4 root        1  -8    -     0K    16K -      2   0:00  0.00% g_down
>>>
>>> The bce0 interface interrupt (irq256) gets stressed out which already
>>> have 100% of CPU7 while CPU0 is around 51.17%. Any more
>>> recommendations? Is there anything we can do about optimization with
>>> MSI?
>>
>> Well, on 7.x you can try turning net.isr.direct off (sysctl).  However, it
>> seems you are hammering your bce0 interface.  You might want to try using
>> polling on bce0 and seeing if it keeps up with the traffic better.
>>
>> --
>> John Baldwin
>>
>
> With net.isr.direct=0, my IBM system lessens CPU utilization per
> interface (bce0 and bce1) but swi1:net increase its utilization.
> Can you explained what's happening here? What does net.isr.direct do
> with the decrease of CPU utilization on its interface? I really wanted
> to know what happened internally during the packets being processed
> and received by the interfaces then to the device interrupt up to the
> software interrupt level because I am confused when enabling/disabling
> net.isr.direct in sysctl. Is there a tool that can we used to trace
> this process just to be able to know which part of the kernel internal
> is doing the bottleneck especially when net.isr.direct=1? By the way
> with device polling enabled, the system experienced packet errors and
> the interface throughput is worst, so I avoid using it though.
>
>   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
>
>   16 root        1 171 ki31     0K    16K CPU10  a  86:06 89.06% idle: cpu10
>   27 root        1 -44    -     0K    16K CPU1   1  34:37 82.67% swi1: net
>   52 root        1 -68    -     0K    16K WAIT   b  51:59 59.77% irq32: bce1
>   15 root        1 171 ki31     0K    16K RUN    b  69:28 43.16% idle: cpu11
>   25 root        1 171 ki31     0K    16K RUN    1 115:35 24.27% idle: cpu1
>   51 root        1 -68    -     0K    16K CPU10  a  35:21 13.48% irq31: bce0
>
>
> Regards,
> Archimedes
>

One more thing, I observed that when net.isr.direct=1, bce0 is using
irq256 and bce1 is using irq257 while net.isr.direct=0, bce0 is now
using irq31 and bce1 is using irq32. What makes it different?