Re: Periodic rant about SCHED_ULE

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Thu, 23 Mar 2023 03:15:21 UTC
On 3/22/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 3/22/23, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote:
>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree wrote:
>>>
>>> Yes, there are reports that FreeBSD is not responsive by default - but
>>> this
>>> may make it get overall better throughput at the expense of
>>> responsiveness,
>>> because it might be doing fewer context switches.  So just complaining
>>> about
>>> a longer buildworld without seeing how much dnetc did in the same
>>> wallclock
>>> time period is useless.  Periodic rant's don't fix this lack of
>>> information.
>>>
>>
>> I reported the issue with ULE some 15 to 20 years ago.
>> I gave up reporting the issue.  The individuals with the
>> requisite skills to hack on ULE did not; and yes, I lack
>> those skills.  The path of least resistance is to use
>> 4BSD.
>>
>> %  cat a.f90
>> !
>> ! Silly numerically intensive computation.
>> !
>> program foo
>>    implicit none
>>    integer, parameter :: m = 200, n = 1000, dp = kind(1.d0)
>>    integer i
>>    real(dp) x
>>    real(dp), allocatable :: a(:,:), b(:,:), c(:,:)
>>    call random_init(.true., .true.)
>>    allocate(a(n,n), b(n,n))
>>    do i = 1, m
>>       call random_number(a)
>>       call random_number(b)
>>       c = matmul(a,b)
>>       x = sum(c)
>>       if (x < 0) stop 'Whoops'
>>    end do
>> end program foo
>> % gfortran11 -o z -O3 -march=native a.f90
>> % time ./z
>>        42.16 real        42.04 user         0.09 sys
>> % cat foo
>> #! /bin/csh
>> #
>> # Launch NCPU+1 images with a 1 second delay
>> #
>> foreach i (1 2 3 4 5 6 7 8 9)
>>    ./z &
>>    sleep 1
>> end
>> % ./foo
>>
>> In another xterm, you can watch the 9 images.
>>
>> % top
>> st pid:  1709;  load averages:  4.90,  1.61,  0.79    up 0+00:56:46
>> 11:43:01
>> 74 processes:  10 running, 64 sleeping
>> CPU: 99.9% user,  0.0% nice,  0.1% system,  0.0% interrupt,  0.0% idle
>> Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, 14G
>> Free
>> Swap: 16G Total, 16G Free
>>
>>   PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME     CPU
>> COMMAND
>>  1699 kargl         1  56    0    68M    35M RUN      3   0:41  92.60% z
>>  1701 kargl         1  56    0    68M    35M RUN      0   0:41  92.33% z
>>  1689 kargl         1  56    0    68M    35M CPU5     5   0:47  91.63% z
>>  1691 kargl         1  56    0    68M    35M CPU0     0   0:45  89.91% z
>>  1695 kargl         1  56    0    68M    35M CPU2     2   0:43  88.56% z
>>  1697 kargl         1  56    0    68M    35M CPU6     6   0:42  88.48% z
>>  1705 kargl         1  55    0    68M    35M CPU1     1   0:39  88.12% z
>>  1703 kargl         1  56    0    68M    35M CPU4     4   0:39  87.86% z
>>  1693 kargl         1  56    0    68M    35M CPU7     7   0:45  78.12% z
>>
>> With 4BSD, you see the ./z's with 80% or greater CPU.  All the ./z's exit
>> after 55-ish seconds.  If you try this experiment on ULE, you'll get
>> NCPU-1
>> ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as the
>> two images ping-pong on one cpu.  Back when I was testing ULE vs 4BSD,
>> this was/is due to ULE's cpu affinity where processes never migrate to
>> another cpu.  Admittedly, this was several years ago.  Maybe ULE has
>> gotten better, but George's rant seems to suggest otherwise.
>>
>
> While I have not tried openmpi yet, I can confirm there is a problem
> here, but the situtation is not as clear cut as one might think.
>
> sched_4bsd *round robins* all workers across all CPUs, which comes at
> a performance *hit* compared to ule if number of workers is <= CPU
> count -- here ule maintains affinity, avoiding cache busting. If you
> slap in $cpu_count + 1 workers, 4bsd keeps the round robin equally
> penalizing everyone, while ule mostly penalizes a select victim. By
> the end of it, you get lower total cpu time spent, but higher total
> real time. See below for an example.
>
> Two massive problems with 4bsd, apart from mandatory round robin which
> also happens to help by accident:
> 1. it has one *global* lock, meaning the scheduler itself just does
> not scale and this is visible at modest contemporary scales
> 2. it does not understand topology -- no accounting done for ht nor
> numa, but i concede the latter wont be a factor for most people
>
> That said, ULE definitely has performance bugs which need to be fixed.
> At least for the case below, 4BSD just "lucked" into sucking less
> simply because it is so primitive.
>
> I wrote a cpu burning program (memset 1 MB in a loop, with enough
> iterations to take ~20 seconds on its own).
>
> I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct
> cores.
>
> The test runs *9* workers, here is a sample run:
>
> 4bsd:
>        23.18 real        20.81 user         0.00 sys
>        23.26 real        20.81 user         0.00 sys
>        23.30 real        20.81 user         0.00 sys
>        23.34 real        20.82 user         0.00 sys
>        23.41 real        20.81 user         0.00 sys
>        23.41 real        20.80 user         0.00 sys
>        23.42 real        20.80 user         0.00 sys
>        23.53 real        20.81 user         0.00 sys
>        23.60 real        20.80 user         0.00 sys
> 187.31s user 0.02s system 793% cpu 23.606 total
>
> ule:
>        20.67 real        20.04 user         0.00 sys
>        20.97 real        20.00 user         0.00 sys
>        21.45 real        20.29 user         0.00 sys
>        21.51 real        20.22 user         0.00 sys
>        22.77 real        20.04 user         0.00 sys
>        22.78 real        20.26 user         0.00 sys
>        23.42 real        20.04 user         0.00 sys
>        24.07 real        20.30 user         0.00 sys
>        24.46 real        20.16 user         0.00 sys
> 181.41s user 0.07s system 741% cpu 24.465 total
>
> It reliably uses 187s user time on 4BSD and 181s on ULE. At the same
> time it also reliably has massive time imblance between
> fastest/slowest in terms of total real time between workers *and* ULE
> reliably uses more total real time to finish the entire thing.
>
> iow this is a tradeoff, but most likely a bad one
>

So I also ran the following setup: 8 core vm doing -j 8 buildkernel,
while 8 nice -n 20 processes are cpu-bound. After the build ends
workers report how many ops they did in that time.

ule is way off the reservation here.

unimpeded build takes: 441.39 real 3135.63 user 266.92, similar on
both schedulers

with cpu hoggers:
4bsd:       657.22 real      3152.02 user       253.30 sys [+49%]
ule:        4427.69 real      3225.33 user       194.86 sys [+903%]

ule spends less time in the kernel, but the time blows up -- over 10 x
the base line.
this is clearly a total non-starter.

full stats:

4bsd:
hogger pid/ops
58315: 5546013
58322: 5557294
58321: 5545052
58313: 5546347
58318: 5537874
58317: 5555303
58323: 5545116
58320: 5548530

runtimes:

      657.23 real       230.02 user         0.02 sys
      657.23 real       229.83 user         0.00 sys
      657.23 real       230.50 user         0.00 sys
      657.23 real       230.53 user         0.00 sys
      657.23 real       230.14 user         0.01 sys
      657.23 real       230.19 user         0.00 sys
      657.23 real       230.09 user         0.00 sys
      657.23 real       230.10 user         0.03 sys

kernel build:
      657.22 real      3152.02 user       253.30 sys

ule:
hogger pid/ops
77794: 95916836
77792: 95909794
77789: 96454064
77796: 95813778
77791: 96728121
77795: 96678291
77798: 97258060
77797: 96347984

     4427.70 real      4001.94 user         0.10 sys
     4427.70 real      3980.68 user         0.16 sys
     4427.70 real      3973.96 user         0.10 sys
     4427.70 real      3980.11 user         0.13 sys
     4427.70 real      4012.32 user         0.07 sys
     4427.70 real      4008.79 user         0.12 sys
     4427.70 real      4034.77 user         0.09 sys
     4427.70 real      3995.40 user         0.08 sys

kernel build:
     4427.69 real      3225.33 user       194.86 sys

-- 
Mateusz Guzik <mjguzik gmail.com>