Re: Periodic rant about SCHED_ULE
- Reply: Mateusz Guzik : "Re: Periodic rant about SCHED_ULE"
- In reply to: Mateusz Guzik : "Re: Periodic rant about SCHED_ULE"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 25 Mar 2023 01:46:43 UTC
On 3/23/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > On 3/22/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >> On 3/22/23, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote: >>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree wrote: >>>> >>>> Yes, there are reports that FreeBSD is not responsive by default - but >>>> this >>>> may make it get overall better throughput at the expense of >>>> responsiveness, >>>> because it might be doing fewer context switches. So just complaining >>>> about >>>> a longer buildworld without seeing how much dnetc did in the same >>>> wallclock >>>> time period is useless. Periodic rant's don't fix this lack of >>>> information. >>>> >>> >>> I reported the issue with ULE some 15 to 20 years ago. >>> I gave up reporting the issue. The individuals with the >>> requisite skills to hack on ULE did not; and yes, I lack >>> those skills. The path of least resistance is to use >>> 4BSD. >>> >>> % cat a.f90 >>> ! >>> ! Silly numerically intensive computation. >>> ! >>> program foo >>> implicit none >>> integer, parameter :: m = 200, n = 1000, dp = kind(1.d0) >>> integer i >>> real(dp) x >>> real(dp), allocatable :: a(:,:), b(:,:), c(:,:) >>> call random_init(.true., .true.) >>> allocate(a(n,n), b(n,n)) >>> do i = 1, m >>> call random_number(a) >>> call random_number(b) >>> c = matmul(a,b) >>> x = sum(c) >>> if (x < 0) stop 'Whoops' >>> end do >>> end program foo >>> % gfortran11 -o z -O3 -march=native a.f90 >>> % time ./z >>> 42.16 real 42.04 user 0.09 sys >>> % cat foo >>> #! /bin/csh >>> # >>> # Launch NCPU+1 images with a 1 second delay >>> # >>> foreach i (1 2 3 4 5 6 7 8 9) >>> ./z & >>> sleep 1 >>> end >>> % ./foo >>> >>> In another xterm, you can watch the 9 images. >>> >>> % top >>> st pid: 1709; load averages: 4.90, 1.61, 0.79 up 0+00:56:46 >>> 11:43:01 >>> 74 processes: 10 running, 64 sleeping >>> CPU: 99.9% user, 0.0% nice, 0.1% system, 0.0% interrupt, 0.0% idle >>> Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, 14G >>> Free >>> Swap: 16G Total, 16G Free >>> >>> PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU >>> COMMAND >>> 1699 kargl 1 56 0 68M 35M RUN 3 0:41 92.60% z >>> 1701 kargl 1 56 0 68M 35M RUN 0 0:41 92.33% z >>> 1689 kargl 1 56 0 68M 35M CPU5 5 0:47 91.63% z >>> 1691 kargl 1 56 0 68M 35M CPU0 0 0:45 89.91% z >>> 1695 kargl 1 56 0 68M 35M CPU2 2 0:43 88.56% z >>> 1697 kargl 1 56 0 68M 35M CPU6 6 0:42 88.48% z >>> 1705 kargl 1 55 0 68M 35M CPU1 1 0:39 88.12% z >>> 1703 kargl 1 56 0 68M 35M CPU4 4 0:39 87.86% z >>> 1693 kargl 1 56 0 68M 35M CPU7 7 0:45 78.12% z >>> >>> With 4BSD, you see the ./z's with 80% or greater CPU. All the ./z's >>> exit >>> after 55-ish seconds. If you try this experiment on ULE, you'll get >>> NCPU-1 >>> ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as the >>> two images ping-pong on one cpu. Back when I was testing ULE vs 4BSD, >>> this was/is due to ULE's cpu affinity where processes never migrate to >>> another cpu. Admittedly, this was several years ago. Maybe ULE has >>> gotten better, but George's rant seems to suggest otherwise. >>> >> >> While I have not tried openmpi yet, I can confirm there is a problem >> here, but the situtation is not as clear cut as one might think. >> >> sched_4bsd *round robins* all workers across all CPUs, which comes at >> a performance *hit* compared to ule if number of workers is <= CPU >> count -- here ule maintains affinity, avoiding cache busting. If you >> slap in $cpu_count + 1 workers, 4bsd keeps the round robin equally >> penalizing everyone, while ule mostly penalizes a select victim. By >> the end of it, you get lower total cpu time spent, but higher total >> real time. See below for an example. >> >> Two massive problems with 4bsd, apart from mandatory round robin which >> also happens to help by accident: >> 1. it has one *global* lock, meaning the scheduler itself just does >> not scale and this is visible at modest contemporary scales >> 2. it does not understand topology -- no accounting done for ht nor >> numa, but i concede the latter wont be a factor for most people >> >> That said, ULE definitely has performance bugs which need to be fixed. >> At least for the case below, 4BSD just "lucked" into sucking less >> simply because it is so primitive. >> >> I wrote a cpu burning program (memset 1 MB in a loop, with enough >> iterations to take ~20 seconds on its own). >> >> I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct >> cores. >> >> The test runs *9* workers, here is a sample run: >> >> 4bsd: >> 23.18 real 20.81 user 0.00 sys >> 23.26 real 20.81 user 0.00 sys >> 23.30 real 20.81 user 0.00 sys >> 23.34 real 20.82 user 0.00 sys >> 23.41 real 20.81 user 0.00 sys >> 23.41 real 20.80 user 0.00 sys >> 23.42 real 20.80 user 0.00 sys >> 23.53 real 20.81 user 0.00 sys >> 23.60 real 20.80 user 0.00 sys >> 187.31s user 0.02s system 793% cpu 23.606 total >> >> ule: >> 20.67 real 20.04 user 0.00 sys >> 20.97 real 20.00 user 0.00 sys >> 21.45 real 20.29 user 0.00 sys >> 21.51 real 20.22 user 0.00 sys >> 22.77 real 20.04 user 0.00 sys >> 22.78 real 20.26 user 0.00 sys >> 23.42 real 20.04 user 0.00 sys >> 24.07 real 20.30 user 0.00 sys >> 24.46 real 20.16 user 0.00 sys >> 181.41s user 0.07s system 741% cpu 24.465 total >> >> It reliably uses 187s user time on 4BSD and 181s on ULE. At the same >> time it also reliably has massive time imblance between >> fastest/slowest in terms of total real time between workers *and* ULE >> reliably uses more total real time to finish the entire thing. >> >> iow this is a tradeoff, but most likely a bad one >> > > So I also ran the following setup: 8 core vm doing -j 8 buildkernel, > while 8 nice -n 20 processes are cpu-bound. After the build ends > workers report how many ops they did in that time. > > ule is way off the reservation here. > > unimpeded build takes: 441.39 real 3135.63 user 266.92, similar on > both schedulers > > with cpu hoggers: > 4bsd: 657.22 real 3152.02 user 253.30 sys [+49%] > ule: 4427.69 real 3225.33 user 194.86 sys [+903%] > > ule spends less time in the kernel, but the time blows up -- over 10 x > the base line. > this is clearly a total non-starter. > > full stats: > > 4bsd: > hogger pid/ops > 58315: 5546013 > 58322: 5557294 > 58321: 5545052 > 58313: 5546347 > 58318: 5537874 > 58317: 5555303 > 58323: 5545116 > 58320: 5548530 > > runtimes: > > 657.23 real 230.02 user 0.02 sys > 657.23 real 229.83 user 0.00 sys > 657.23 real 230.50 user 0.00 sys > 657.23 real 230.53 user 0.00 sys > 657.23 real 230.14 user 0.01 sys > 657.23 real 230.19 user 0.00 sys > 657.23 real 230.09 user 0.00 sys > 657.23 real 230.10 user 0.03 sys > > kernel build: > 657.22 real 3152.02 user 253.30 sys > > ule: > hogger pid/ops > 77794: 95916836 > 77792: 95909794 > 77789: 96454064 > 77796: 95813778 > 77791: 96728121 > 77795: 96678291 > 77798: 97258060 > 77797: 96347984 > > 4427.70 real 4001.94 user 0.10 sys > 4427.70 real 3980.68 user 0.16 sys > 4427.70 real 3973.96 user 0.10 sys > 4427.70 real 3980.11 user 0.13 sys > 4427.70 real 4012.32 user 0.07 sys > 4427.70 real 4008.79 user 0.12 sys > 4427.70 real 4034.77 user 0.09 sys > 4427.70 real 3995.40 user 0.08 sys > > kernel build: > 4427.69 real 3225.33 user 194.86 sys > added a probe to runq_add*: diff --git a/sys/kern/kern_switch.c b/sys/kern/kern_switch.c index aee0c9cbd1ae..db73226547c5 100644 --- a/sys/kern/kern_switch.c +++ b/sys/kern/kern_switch.c @@ -357,6 +357,9 @@ runq_setbit(struct runq *rq, int pri) rqb->rqb_bits[RQB_WORD(pri)] |= RQB_BIT(pri); } +SDT_PROVIDER_DECLARE(sched); +SDT_PROBE_DEFINE3(sched, , , runq_add, "struct thread *", "int", "int"); + /* * Add the thread to the queue specified by its priority, and set the * corresponding status bit. @@ -378,6 +381,7 @@ runq_add(struct runq *rq, struct thread *td, int flags) } else { TAILQ_INSERT_TAIL(rqh, td, td_runq); } + SDT_PROBE3(sched, , , runq_add, td, flags, pri); } void @@ -396,6 +400,7 @@ runq_add_pri(struct runq *rq, struct thread *td, u_char pri, int flags) } else { TAILQ_INSERT_TAIL(rqh, td, td_runq); } + SDT_PROBE3(sched, , , runq_add, td, flags, pri); } /* * Return true if there are runnable processes of any priority on the run and used this one-liner: dtrace -b 16M -x aggsize=32M -x dynvarsize=32M -n 'sched:::runq_add /args[0]->td_ucred->cr_uid == 2000/ { self->runq_t = timestamp; } sched:::on-cpu /self->runq_t/ { @runqlat["runq", execname == "cpuburner-long" ? "cpuburner" : "rest"] = quantize(timestamp - self->runq_t); self->runq_t = 0; } sched:::on-cpu /curthread->td_ucred->cr_uid == 2000/ { self->oncpu_t = timestamp; } sched:::off-cpu /self->oncpu_t/ { @oncpu["oncpu", execname == "cpuburner-long" ? "cpuburner" : "rest"] = quantize(timestamp - self->oncpu_t); } tick-300s { exit(0); } caped at 5 minutes because dtrace starts running into aggregation drops Key takeaway: 4bsd let's the cpu hog stay on cpu for longer than ule, but when it is taken off, it waits a long time to get back. in contrast, it gets back on very quick with ule and it is everyone else waiting big time. times in nanoseconds 4bsd: runq cpuburner value ------------- Distribution ------------- count 2048 | 0 4096 | 2 8192 |@@@@@@@@@@ 4343 16384 |@@@@@ 2159 32768 |@ 363 65536 |@@@ 1359 131072 | 215 262144 | 129 524288 | 101 1048576 | 132 2097152 | 185 4194304 |@ 390 8388608 |@ 318 16777216 |@ 398 33554432 |@@ 838 67108864 |@@@@@@@ 3025 134217728 |@@@@@ 2474 268435456 |@@@ 1552 536870912 | 153 1073741824 | 0 runq rest value ------------- Distribution ------------- count 2048 | 0 4096 | 637 8192 |@@@@@@@@@@@@@@@@@@@@@@@@ 364832 16384 |@@@ 52982 32768 |@ 11613 65536 |@@ 34286 131072 |@@@ 39734 262144 |@@ 23261 524288 |@ 21985 1048576 |@ 18999 2097152 |@ 10789 4194304 | 6239 8388608 | 4725 16777216 | 4598 33554432 | 4050 67108864 | 5857 134217728 | 3979 268435456 | 2024 536870912 | 2011 1073741824 | 1105 2147483648 | 841 4294967296 | 519 8589934592 | 372 17179869184 | 133 34359738368 | 37 68719476736 | 30 137438953472 | 1 274877906944 | 0 oncpu cpuburner value ------------- Distribution ------------- count 2048 | 0 4096 | 20 8192 | 10 16384 | 19 32768 | 161 65536 | 137 131072 | 77 262144 | 104 524288 | 147 1048576 |@ 301 2097152 |@ 482 4194304 |@@@ 1178 8388608 |@@@@@@@@@@@@@@@ 6971 16777216 |@ 474 33554432 |@ 669 67108864 |@@@@@@@@@@@@@@@@ 7299 134217728 | 14 268435456 | 38 536870912 | 38 1073741824 | 2 2147483648 | 2 4294967296 | 0 oncpu rest value ------------- Distribution ------------- count 512 | 0 1024 | 102 2048 |@@@@@@@@@@@ 146555 4096 |@@ 23373 8192 |@@@ 45165 16384 |@ 14531 32768 |@@@@@ 67664 65536 |@@@@@@ 80155 131072 |@@@@@ 64609 262144 |@@ 21509 524288 |@@ 23393 1048576 |@@ 21058 2097152 |@ 7854 4194304 |@ 8788 8388608 |@ 20242 16777216 | 2352 33554432 | 2084 67108864 | 4388 134217728 | 1123 268435456 | 755 536870912 | 135 1073741824 | 2 2147483648 | 0 ule: runq cpuburner value ------------- Distribution ------------- count 2048 | 0 4096 | 2 8192 |@@@@@@@@@@@@@@@@@@@@@@@@ 37193 16384 |@@@@@ 8612 32768 |@ 1481 65536 |@@@@@ 8430 131072 |@@ 3102 262144 |@ 938 524288 | 457 1048576 |@ 2063 2097152 | 257 4194304 | 41 8388608 | 428 16777216 | 12 33554432 | 1 67108864 | 2 134217728 | 0 runq rest value ------------- Distribution ------------- count 4096 | 0 8192 |@ 649 16384 |@@@@ 1953 es 32768 |@ 539 65536 |@@@@@ 2369 131072 |@@@@ 1904 262144 |@ 471 524288 | 131 1048576 |@ 458 2097152 |@ 443 4194304 |@ 547 8388608 |@ 632 16777216 |@@ 984 33554432 |@@@ 1606 67108864 |@@@@@@@@ 3894 134217728 |@ 511 268435456 |@ 489 536870912 |@ 445 1073741824 |@ 475 2147483648 |@ 314 4294967296 | 188 8589934592 | 136 17179869184 | 100 34359738368 | 81 68719476736 | 39 137438953472 | 3 274877906944 | 0 oncpu cpuburner value ------------- Distribution ------------- count 2048 | 0 4096 | 13 8192 | 311 16384 |@@ 2369 32768 |@ 853 65536 |@ 2302 131072 |@ 1302 262144 | 518 524288 |@ 872 1048576 |@ 1172 2097152 |@ 1435 4194304 |@@ 2706 8388608 |@@@@@@@@@@@@@@@@@@@ 29669 16777216 |@@ 2733 33554432 |@@ 3910 67108864 |@@@@@@ 9991 134217728 |@ 1800 268435456 |@ 840 536870912 | 200 1073741824 | 15 2147483648 | 7 4294967296 | 1 8589934592 | 0 oncpu rest value ------------- Distribution ------------- count 512 | 0 1024 |@ 419 2048 |@@@@@@@@@ 5730 4096 |@ 778 8192 |@@@@@ 3233 16384 |@@ 1400 32768 |@@@@ 2739 65536 |@@@@@@@@@@@@ 7550 131072 |@@@ 1720 262144 | 151 524288 | 112 1048576 |@@@ 1688 2097152 | 174 4194304 | 86 8388608 |@ 411 16777216 | 7 33554432 | 3 67108864 | 0 -- Mateusz Guzik <mjguzik gmail.com>