Re: Periodic rant about SCHED_ULE
- In reply to: Steve Kargl : "Re: Periodic rant about SCHED_ULE"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 24 Mar 2023 21:17:13 UTC
On Mar 24, 2023, at 13:25, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote: > On Fri, Mar 24, 2023 at 12:47:08PM -0700, Mark Millard wrote: >> Steve Kargl <sgk_at_troutmask.apl.washington.edu> wrote on >> Date: Wed, 22 Mar 2023 19:04:06 UTC : >> >>> I reported the issue with ULE some 15 to 20 years ago. >>> I gave up reporting the issue. The individuals with the >>> requisite skills to hack on ULE did not; and yes, I lack >>> those skills. The path of least resistance is to use >>> 4BSD. >>> >>> % cat a.f90 >>> ! >>> ! Silly numerically intensive computation. >>> ! >>> program foo >>> implicit none >>> integer, parameter :: m = 200, n = 1000, dp = kind(1.d0) >>> integer i >>> real(dp) x >>> real(dp), allocatable :: a(:,:), b(:,:), c(:,:) >>> call random_init(.true., .true.) >>> allocate(a(n,n), b(n,n)) >>> do i = 1, m >>> call random_number(a) >>> call random_number(b) >>> c = matmul(a,b) >>> x = sum(c) >>> if (x < 0) stop 'Whoops' >>> end do >>> end program foo >>> % gfortran11 -o z -O3 -march=native a.f90 >>> % time ./z >>> 42.16 real 42.04 user 0.09 sys >>> % cat foo >>> #! /bin/csh >>> # >>> # Launch NCPU+1 images with a 1 second delay >>> # >>> foreach i (1 2 3 4 5 6 7 8 9) >>> ./z & >>> sleep 1 >>> end >>> % ./foo >>> >>> In another xterm, you can watch the 9 images. >>> >>> % top >>> st pid: 1709; load averages: 4.90, 1.61, 0.79 up 0+00:56:46 11:43:01 >>> 74 processes: 10 running, 64 sleeping >>> CPU: 99.9% user, 0.0% nice, 0.1% system, 0.0% interrupt, 0.0% idle >>> Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, 14G Free >>> Swap: 16G Total, 16G Free >>> >>> PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND >>> 1699 kargl 1 56 0 68M 35M RUN 3 0:41 92.60% z >>> 1701 kargl 1 56 0 68M 35M RUN 0 0:41 92.33% z >>> 1689 kargl 1 56 0 68M 35M CPU5 5 0:47 91.63% z >>> 1691 kargl 1 56 0 68M 35M CPU0 0 0:45 89.91% z >>> 1695 kargl 1 56 0 68M 35M CPU2 2 0:43 88.56% z >>> 1697 kargl 1 56 0 68M 35M CPU6 6 0:42 88.48% z >>> 1705 kargl 1 55 0 68M 35M CPU1 1 0:39 88.12% z >>> 1703 kargl 1 56 0 68M 35M CPU4 4 0:39 87.86% z >>> 1693 kargl 1 56 0 68M 35M CPU7 7 0:45 78.12% z >>> >>> With 4BSD, you see the ./z's with 80% or greater CPU. All the ./z's exit >>> after 55-ish seconds. If you try this experiment on ULE, you'll get NCPU-1 >>> ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as the >>> two images ping-pong on one cpu. Back when I was testing ULE vs 4BSD, >>> this was/is due to ULE's cpu affinity where processes never migrate to >>> another cpu. Admittedly, this was several years ago. Maybe ULE has >>> gotten better, but George's rant seems to suggest otherwise. >> >> Note: I'm only beginning to explore your report/case. >> >> There is a significant difference in your report and >> George's report: his is tied to nice use (and I've >> replicated there being SCHED_4BSD vs. SCHED_ULE >> consequences in the same direction George reports >> but with much larger process counts involved). In >> those types of experiments, I without the nice use >> I did not find notable differences. But it is a >> rather different context than your examples. Thus >> the below as a start on separate experiments closer >> to what you report using. > > Yes, I recognizes George's case is different. However, > the common problem is ULE. My testcase shows/suggests > that ULE is unsuitable for a HPC platform. > >> Not (yet) having a Fortran set up I did some simple >> expriments with stress --cpu N (N processes looping >> sqrt calculations) and top. In top I sorted by pid >> to make it obvious if a fixed process was getting a >> fixed CPU or WCPU. (I tried looking at both CPU and >> WCPU, varying the time between samples as well. I >> also varied stress's --backoff N . This was on a >> ThreadRipper 1950X (32 hardware threads, so 16 cores) >> that was running: > > You only need a numerically intensive program that runs > for 30-45 seconds. Well, with 32 hardware threads instead of 8, the time frames likely need to be longer proportionally: 33 processes created and run, with overlapping time needed. > I use Fortran everyday and wrote the > silly example in 5 minutes. The matrix multiplication > of two 1000x1000 double precision matrices has two > benefits with this synthetic benchmark. It takes 40-ish > seconds on my hardware (AMD FX-8350) and it blows out the > cpu cache. I've not checked on the caching issue for what I've done below. Let me know if you expect it is important to check. >> This seems at least suggestive that, in my context, the >> specific old behavior that you report does not show up, >> at least on the timescales that I was observing at. It >> still might not be something you would find appropriate, >> but its does appear to at least be different. >> >> There is the possibility that stress --cpu N leads to >> more being involved than I expect and that such is >> contributing to the behavior that I've observed. > > I can repeat the openmpi testing, but it will have to > wait for a few weeks as I have a pressing deadline. I'll be curious to learn what you then find. > The openmpi program is a classic boss-worker scenario > (and an almost perfectly parallel application with litttle > communication overhead). boss starts and initializes the > environment and then launches numerical intensive > workers. If boss+n workers > ncpu, you get a boss and > a worker pinned to a cpu. If boss and worker ping-pong, > it stalls the entire program. From what I've seen, boss+1worker doing ping-pong at times would not be prevented from happening sometimes for a while but would not be sustained indefinitely. > Admittedly, I last tested years ago. ULE may have had > improvements. Actually I do have a fortran: gfortran12 (automatically). (My original search had a typo.) I'll have to adjust the parameters for your example: # gfortran12 -o z -O3 -march=native a.f90 # time ./z 27.91 real 27.85 user 0.06 sys but I've 32 hardware threads, so the loop waiting for 1 sec between for 33 examples would have the first ones exit before the last ones start. Looks like n=2000 would be sufficient: # gfortran12 -o z -O3 -march=native a.f90 # time ./z 211.25 real 211.06 user 0.18 sys For 33 processes, things are as I described when I look with the likes of: # top -a -opid -s5 Varying the time scale to shorter allows seeing process WCPU figures move around more between the processes more. Longer shows less of the WCPU variability across the processes. (As I remember, -s defaults to 3 seconds and has a minimum of 1 second.) Given the 32 hardware threads, I used 33 processes via: # more runz #! /bin/csh # # Launch NCPU+1 images with a 1 second delay # foreach d (1 2 3) foreach i (1 2 3 4 5 6 7 8 9 10) ./z & sleep 1 end end foreach j (1 2 3) ./z & sleep 1 end My guess is that if you end up seeing what you originally described, some environmental difference would be involved in why I see different behavior, something to then be tracked down for what is different in the 2 contexts. === Mark Millard marklmi at yahoo.com