Re: Periodic rant about SCHED_ULE
- In reply to: Zaphod Beeblebrox : "Re: Periodic rant about SCHED_ULE"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 13 Jul 2021 22:22:05 UTC
On Tue, 2021-07-13 at 18:09 -0400, Zaphod Beeblebrox wrote: > I opened https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257160 > regarding > the following: > > SCHED_4BSD seems subject to a bit of rot at this point. To Wit, my 4 > core > riscv64 platform recently showed this top while doing a make -j4 of > my own > code. Note that each of the processes using more than 1000% CPU are > single-threaded. > > PID USERNAME THR PRI NICE SIZE RES > STATE C TIME WCPU COMMAND > 604 dgilbert 1 45 0 109M 66M CPU3 3 0:02 > 1039.89% c++ > 605 dgilbert 1 45 0 109M 66M CPU1 1 0:02 > 1031.29% c++ > 606 dgilbert 1 45 0 109M 66M RUN 2 0:02 > 1020.32% c++ > 603 dgilbert 1 44 0 109M 66M CPU0 0 0:02 > 1011.41% c++ > 854 root 1 40 0 17M 4764K > select 1 3:04 0.17% tmux > 425 root 1 40 0 14M 4040K > CPU2 2 0:03 0.15% top > > As I said there, I don't believe that this is RISCV64 related --- it > seems > to me that the data that top is pulling is either incorrect or top is > interpreting it incorrectly. The WCPU value seems to asymptotically > approach 100%, but I'm not sure of that --- I can only watch it for > so > long. The same behaviour is seen if you launch (while true; do true; > done) > & in the background. > > But OTOH, if you are running SCHED_ULE, and you launch two of those > while > true's at nice -20 for each cpu ... then launch one at nice '0' ... > you'll > find that the nice 0 process fails to get 100% cpu. To my mind, this > is a > failure of the scheduler to read my intentions of nice -20. In fact, > at > times, the processor share of the un-nice process will fall below > some of > the nice processes for a few dozen samples at a time. Here is a top > displaying that brokenness... > > PID USERNAME THR PRI NICE SIZE RES > STATE C TIME WCPU COMMAND > 36410 root 1 89 0 14M 796K > RUN 3 0:18 54.31% bash > 36370 root 1 106 20 14M 800K > RUN 1 0:58 49.86% bash > 36372 root 1 105 20 14M 800K > CPU1 1 0:56 49.69% bash > 36375 root 1 106 20 14M 800K > RUN 0 0:57 46.37% bash > 36373 root 1 103 20 14M 800K > RUN 3 0:56 44.94% bash > 36371 root 1 105 20 14M 800K > CPU0 0 0:57 43.51% bash > 36376 root 1 105 20 14M 800K > RUN 2 0:59 38.76% bash > 36369 root 1 104 20 14M 920K > CPU2 2 0:57 37.61% bash > 36374 root 1 104 20 14M 800K > RUN 2 0:57 32.66% bash > > TBH, I think SCHED_ULE is a failure and the only reason more people > don't > think so is that processors are now laregely too fast for people to > care. > Most people don't notice the scheduler because they almost never have > more > tasks than processor threads, so even really dumb schedulers would > work out > "OK" 98% of the time. > > I know we don't have guiding principles for nice, but I would toss > out the > +/- five rule for it --- that any process more than 5 nice levels > lower > from a cpu-busy process shouldn't preempt the higher process. I > realize we > have rtprio, but it's a pain to use. Anyways, don't let this last > comment > distract. > > > > On Thu, Jul 8, 2021 at 3:20 AM Rozhuk Ivan <rozhuk.im@gmail.com> > wrote: > > > On Wed, 7 Jul 2021 13:47:47 -0400 > > George Mitchell <george+freebsd@m5p.com> wrote: > > > > > CPU: AMD Ryzen 5 2600X Six-Core Processor (3600.10-MHz K8-class > > > CPU) > > > (12 threads). > > > > > > FreeBSD 12.2-RELEASE-p7 r369865 GENERIC amd64 (SCHED_ULE) vs > > > FreeBSD 12.2-RELEASE-p7 r369865 M5P amd64 (SCHED_4BSD). > > > > > > Comparing "make buildworld" time with misc/dnetc running vs not > > > running. (misc/dnetc is your basic 100% compute-bound task, > > > running > > > at nice 20.) > > > > > > Three out of the four combinations build in roughly four hours, > > > but > > > SCHED_ULE with dnetc running takes close to twelve! (And that > > > was > > > overnight with basically nothing else running.) This is an even > > > worse disparity than I have seen in previous releases. > > > > I do not use dnetc, but shed_ule on 2700 compile wold faster than 4 > > hours. > > With ccache it takes ~10 minutes: world+kernel build and install > > and > > update loaders. > > > > > > # Make an SMP-capable kernel by default > > options SMP #b Symmetric MultiProcessor > > Kernel > > options NUMA #o Non-Uniform Memory > > Architecture > > support > > options EARLY_AP_STARTUP #o > > > > device cpufreq #m for non-ACPI CPU > > frequency > > control > > device cpuctl #m Provides access to MSRs, > > CPUID > > info and microcode update feature. > > > > > > # Kernel base > > options SCHED_ULE #b 4BSD/ULE scheduler > > options _KPOSIX_PRIORITY_SCHEDULING #b POSIX P1003_1B real- > > time > > extensions > > options PREEMPTION #b Enable kernel thread > > preemption > > > > > > and sysctl tunings on desktop only: > > # SCHEDULER > > kern.sched.steal_thresh=1 # Minimum load on remote > > CPU > > before we'll steal // workaround for freezes > > kern.sched.balance=0 # Enables the long-term > > load > > balancer > > kern.sched.balance_interval=1000 # Average period in stathz > > ticks > > to run the long-term balancer > > kern.sched.affinity=10000 # Number of hz ticks to > > keep > > thread affinity for > > > > > > > > top has been showing bad values for CPU% with SCHED_BSD for many years, on all architectures. I remember Bruce Evans once commenting that it had something to do with changes to clock handling in the kernel (maybe related to when eventtimers first came in, but I might be misrembering that detail). If you ask top to display straight cpu instead of wcpu the results are much more sane. I too wish that nice made a bigger difference, but that problem isn't limited to SCHED_ULE, nice is little more than a vague hint even when using SCHED_BSD. I eventually concluded that there's just no way to run a compute-heavy workload (such as buildworld -j<ncpu>) using nice and keep the machine responsive enough for interactive use. I switched to running builds with idprio, which isn't really onerous if you set sysctl security.bsd.unprivileged_idprio=1 in /etc/sysctl.conf. -- Ian