From nobody Mon Mar 27 16:31:18 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PldcG4YM1z41lGn for ; Mon, 27 Mar 2023 16:31:26 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PldcG1nhHz4LZf; Mon, 27 Mar 2023 16:31:26 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Authentication-Results: mx1.freebsd.org; none Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.17.1/8.17.1) with ESMTPS id 32RGVI57077106 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Mon, 27 Mar 2023 09:31:19 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.17.1/8.17.1/Submit) id 32RGVIj2077105; Mon, 27 Mar 2023 09:31:18 -0700 (PDT) (envelope-from sgk) Date: Mon, 27 Mar 2023 09:31:18 -0700 From: Steve Kargl To: Mateusz Guzik Cc: Matthias Andree , freebsd-hackers@freebsd.org Subject: Re: Periodic rant about SCHED_ULE Message-ID: Reply-To: sgk@troutmask.apl.washington.edu References: <8173cc7e-e934-dd5c-312a-1dfa886941aa@FreeBSD.org> <8cfdb951-9b1f-ecd3-2291-7a528e1b042c@m5p.com> List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 4PldcG1nhHz4LZf X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:73, ipnet:128.95.0.0/16, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N On Mon, Mar 27, 2023 at 04:47:04PM +0200, Mateusz Guzik wrote: > (A mssive amount trimmed to keep this short.) > Aight, now that I had a sober look at the code I think I cracked the case. > > The runq mechanism used by both 4BSD and ULE provides 64(!) queues, > where the priority is divided by said number and that's how you know > in which queue to land the thread. > > When deciding what to run, 4BSD uses runq_choose which iterates all > queues from the beginning. This means threads of lower priority keep > executing before the rest. In particular cpu hog lands with a high > priority, looking worse than make -j 8 buildkernel and only running > when there is nothing else ready to get the cpu. While this may sound > decent, it is bad -- in principle a steady stream of lower priority > threads can starve the hogs indefinitely. > > The problem was recognized when writing ULE, but improperly fixed -- > it ends up distributing all threads within given priority range across > the queues and then performing a lookup in a given queue. Here the > problem is that while technically everyone does get a chance to run, > the threads not using full slices are hosed for the time period as > they wait for the hog *a lot*. > > A hack patch to induce the bogus-but-better 4BSD behavior of draining > all runqs before running higher prio threads drops down build time to > ~9 minutes, which is shorter than 4BSD. > > However, the right fix would achieve that *without* introducing > starvation potential. > > I also note the runqs are a massive waste of memory and computing > power. I'm going to have to sleep on what to do here. > > For interested here is the hackery: > https://people.freebsd.org/~mjg/.junk/ule-poc-hacks-dont-use.diff > > sysctl kern.sched.slice_nice=0 > sysctl kern.sched.preempt_thresh=400 # arbitrary number higher than any prio Mateusz, Thanks for taking a deeper look at the schedulers and providing your analysis. If you come up with any patches that you would like to see have additional testing, feel free to ping on or off the mailing list. -- Steve