Re: ULE process to resolution
- In reply to: Mateusz Guzik : "Re: ULE process to resolution"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 20 Apr 2023 21:34:22 UTC
On 4/4/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > Hello, > > On 3/31/23, Jeff Roberson <jroberson@jroberson.net> wrote: >> As I read these threads I can state with a high degree of confidence that >> many of these tests worked with superior results with ULE at one time. >> It may be that tradeoffs have changed or exposed weaknesses, it may also >> be that it's simply been broken over time. I see a large number of >> commits intended to address point issues and wonder whether we adequately >> explored the consquences. Indeed I see solutions involving tunables >> proposed here that will definitively break other cases. >> > > One of the reporters claims the bug they complain about was there > since early days. This made me curious how many problems reproduce on > something like 7.1 (dated 2009), to that end I created an 8 core vm > which I ran of bunch of tests on in addition to main. All 3 problems > reported below reproduced there, no X testing though :) > > Bugs (one not reported in the other thread): > 1. threads walking around the machine when spending little time off > cpu, all while the machine is otherwise idle > > The problem with this on bare metal is that the victim cpu may be > partially powered off, so now there is latency stemming from poking it > back up, whatever other migration cost aside. > > I noticed this few years back when looking at postgres -- both the > server and pgbench would walk around everywhere, reducing perf. I > checked this reproduces on fresh main. The box at hand as 2 sockets * > 10 cores * 2 threads. > > I *suspect* this is adequately modeled with a microbenchmark > https://github.com/antonblanchard/will-it-scale/ named > context_switch1_processes -- it too experiences all-machine walk > unless explicitly bound (pass -n to *not* bind it). I verified they > walk all around on 7.1 as well, but I don't know if postgres also > would. > > how to bench: > su - postgres > /usr/local/bin/pg_ctl -D /var/db/postgres/data15 -l logfile start > pgbench -i -s 10 > pgbench -M prepared -S -T 800000 -c 1 -j 1 -P1 postgres > > ... and you are in. > > 2. unfairness when oversubscribing with cpu hogs > > Steve Kargl claims he reported this one numerous times since the early > days of ULE, I confirmed it was a problem on 7.1 and is a problem > today. > > Say an 8 core vm (with making sure these are cores pinned on the host) > > I'm going to copy paste my other message here: > I wrote a cpu burning program (memset 1 MB in a loop, with enough > iterations to take ~20 seconds on its own). > > I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct > cores. > > The test runs *9* workers, here is a sample run: > [copy] > 4bsd: > 23.18 real 20.81 user 0.00 sys > 23.26 real 20.81 user 0.00 sys > 23.30 real 20.81 user 0.00 sys > 23.34 real 20.82 user 0.00 sys > 23.41 real 20.81 user 0.00 sys > 23.41 real 20.80 user 0.00 sys > 23.42 real 20.80 user 0.00 sys > 23.53 real 20.81 user 0.00 sys > 23.60 real 20.80 user 0.00 sys > 187.31s user 0.02s system 793% cpu 23.606 total > > ule: > 20.67 real 20.04 user 0.00 sys > 20.97 real 20.00 user 0.00 sys > 21.45 real 20.29 user 0.00 sys > 21.51 real 20.22 user 0.00 sys > 22.77 real 20.04 user 0.00 sys > 22.78 real 20.26 user 0.00 sys > 23.42 real 20.04 user 0.00 sys > 24.07 real 20.30 user 0.00 sys > 24.46 real 20.16 user 0.00 sys > 181.41s user 0.07s system 741% cpu 24.465 total > [/paste] > > While ule spends fewer *cycles*, it spends more real time and it is > *probably* bad. > > you can repro with: > https://people.freebsd.org/~mjg/.junk/cpuburner1.c > cc -O0 -o cpuburner1 cpuburner1.c > > and a magic script: > #!/bin/sh > > ins=$1 > > shift > > while [ $ins -ne 0 ]; do > time ./cpuburner1 $1 $2 & > ins=$((ins-1)) > done > > wait > > run like this, pick the second number to take 20-ish seconds on your cpu: > sh burn.sh 1048576 500000 > > 3. threads struggling to get back on cpu against nice -n 20 higs > > This acutely affects buildkernel. > > I once more played around, the bug was already there in 7.1, extending > total time from ~4 minutes to 30. > > The problem is introduced with the machinery to attempt to provide > fairness for pri <= PRI_MAX_BATCH. I verified that with straight up > removing all of it. Then buildikernel managed to finish in sensible > time, but the cpu hogs were overly negatively affected -- little cpu > time and very unfairly distributed between them. Key point though that > this *can* stick to close to base time. > > I had seen the patch from https://reviews.freebsd.org/D15985 , it does > not fix the problem but it does alleviate it to some extent. It is > weirdly hacky and seems to be targeting just the testcase you had > instead of the more general problem. > > I applied it to a 2018-ish tree so that there are no woes from rebasing. > stock: 290.95 real 2048.22 user 247.967 sys > stock+hogs: 883.81 real 2111.34 user 189.42 sys > patched+hogs: 460.84 real 2055.63 user 232.00 sys > > Interestingly stock kernel from that period is less affected by the > general problem, but it is still pretty bad. With the patch things > improve markedly, but there is still ~50% increase in real time which > is way too much for being paired against -n 20. > > https://people.freebsd.org/~mjg/.junk/cpuburner2.c > > magic script: > #!/bin/sh > > workers=$1 > n=$2 > size=$3 > bkw=$4 > > echo workers $workers nice $n buildkernel $bkw > > shift > > while [ $workers -ne 0 ]; do > time nice -n $n ./cpuburner $size & > workers=$((workers-1)) > done > > time make -C /usr/src -ssss -j $bkw buildkernel > /dev/null > > # XXX webdev-style > pkill cpuburner > > wait > > sample use: time sh burn+bk.sh 8 20 1048576 8 > > I figured there would be a regression test suite available, with tests > checking what happens for known cases with possibly contradictory > requirements. Got nothing, instead I found people use hackbench (:S) > or just a workload. > > All that said, I'm buggering off the subject. My interest in it was > limited to the nice problem, since I have pretty good reasons to > suspect this is what is causing pathological total real time instances > for package builds. > Do you still plan to do anything here? 14.0 schedule has been posted and it starts with this: head slush/KBI freeze: April 25, 2023 [... ALPHA builds ...]: TBD (as-needed) stable/14 branch: May 12, 2023 releng/14.0 branch: May 26, 2023 BETA1 build starts: May 26, 2023 iow there is not much time to make any fixes for the release. That said, I had another look at your patch. It aged out of simple forward porting: commit 686bcb5c14aba6e67524be84e125bfdd3514db9e Author: Jeff Roberson <jeff@FreeBSD.org> Date: Sun Dec 15 21:26:50 2019 +0000 schedlock 4/4 and a follow up fixup: commit 6d3f74a14a83b867c273c6be2599da182a9b9ec7 Author: Mark Johnston <markj@FreeBSD.org> Date: Thu Jul 14 10:21:28 2022 -0400 sched_ule: Fix racy loads of pc_curthread which whacked access to data your patch relies on. Not saying this can't be augmented, just that it is extra churn. I also looked into why there is still tons of cpu time for the niced stuff and found the mechanism mostly does not work. Here are some results from FreeBSD 7.1 (2009 vintage) running full time cpu hogs with various nice levels: prio 10 ops 12863 prio 0 ops 12846 prio 20 ops 12794 prio 0 ops 24949 prio 20 ops 13551 prio 0 ops 11327 prio -20 ops 19474 prio 20 ops 7575 As you can see that release had about 33/66 split for the 0 vs 20 case which is alreadyp retty bad and funnily enough it had equal treatment for 0 vs 10 vs 20. Things further changed down the road and on fresh main it looks like this: prio 10 ops 4390 prio 0 ops 4963 prio 20 ops 3941 prio 0 ops 7235 prio 20 ops 6059 prio -20 ops 7225 prio 0 ops 3763 prio 20 ops 2547 as in nice 20 is penalized even less vs 0. tl;dr things were already bad on 7.1. to repro: fetch https://people.freebsd.org/~mjg/.junk/cpuburner-prio.c fetch https://people.freebsd.org/~mjg/.junk/script3.sh cpuset -l 2 sh script3.sh -- Mateusz Guzik <mjguzik gmail.com>