Scheduler weirdness

Mon Oct 12 04:49:12 UTC 2009

On Mon, Oct 12, 2009 at 03:35:15PM +1100, Alex R wrote:
> Steve Kargl wrote:
> >On Mon, Oct 12, 2009 at 01:49:27PM +1100, Alex R wrote:
> >  
> >>Steve Kargl wrote:
> >>    
> >>>So, you have 4 cpus and 4 folding-at-home processes and you're
> >>>trying to use the system with other apps?  Switch to 4BSD.
> >>>
> >>> 
> >>>      
> >>I thought SCHED_ULE was meant to be a much better choice under an SMP 
> >>environment. Why are you suggesting he rebuild his kernel and use the 
> >>legacy scheduler?
> >>
> >>    
> >
> >If you have N cpus and N+1 numerical intensitive applications,
> >ULE may have poor performance compared to 4BSD.   In OP's case,
> >he has 4 cpus and 4 numerical intensity (?) applications.  He,
> >however, also is trying to use the system in some interactive
> >way.
> >
> >  
> Ah ok. Is this just an accepted thing by the freebsd dev's or are they 
> trying to fix it?
> 

Jeff appears to be extremely busy with other projects.  He is aware of
the problem, and I have set up my system to give him access when/if it
is so desired.

Here's the text of my last set of tests that I sent to him

OK, I've manage to recreate the problem.  User kargl launches a mpi
job on node10 that creates two images on node20.  This is command z
in the top(1) info.  30 seconds later, user sgk lauches a mpi process
on node10 that creates 8 images on node20.  This is command rivmp in
top(1) info.  With 8 available cpus, this is a (slightly) oversubscribed
node.

For 4BSD, I see

last pid:  1432;  load averages:  8.68,  5.65,  2.82                up 0+01:52:14  17:07:22
40 processes:  11 running, 29 sleeping
CPU:  100% user,  0.0% nice,  0.0% system,  0.0% interrupt,  0.0% idle
Mem: 32M Active, 12M Inact, 203M Wired, 424K Cache, 29M Buf, 31G Free
Swap: 4096M Total, 4096M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
 1428 sgk           1 124    0 81788K  5848K CPU3    6   1:13 78.81% rivmp
 1431 sgk           1 124    0 81788K  5652K RUN     1   1:13 78.52% rivmp
 1415 kargl         1 124    0 78780K  4668K CPU7    1   1:38 78.42% z
 1414 kargl         1 124    0 78780K  4664K CPU0    0   1:37 77.25% z
 1427 sgk           1 124    0 81788K  5852K CPU4    3   1:13 78.42% rivmp
 1432 sgk           1 124    0 81788K  5652K CPU2    4   1:13 78.27% rivmp
 1425 sgk           1 124    0 81788K  6004K CPU5    5   1:12 78.17% rivmp
 1426 sgk           1 124    0 81788K  5832K RUN     6   1:13 78.03% rivmp
 1429 sgk           1 124    0 81788K  5788K CPU6    7   1:12 77.98% rivmp
 1430 sgk           1 124    0 81788K  5764K RUN     2   1:13 77.93% rivmp

Notice, the accumulated times appear reasonable.  At this point in the
computations, rivmp is doing no communication between processes.  z is
the netpipe benchmark and is essentially sending messages between the
two processes over the memory bus.

For ULE, I see

last pid:  1169;  load averages:  7.56,  2.61,  1.02                up 0+00:03:15  17:13:01
40 processes:  11 running, 29 sleeping
CPU:  100% user,  0.0% nice,  0.0% system,  0.0% interrupt,  0.0% idle
Mem: 31M Active, 9392K Inact, 197M Wired, 248K Cache, 26M Buf, 31G Free
Swap: 4096M Total, 4096M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    CPU COMMAND
 1168 sgk           1 118    0 81788K  5472K CPU6    6   1:18 100.00% rivmp
 1169 sgk           1 118    0 81788K  5416K CPU7    7   1:18 100.00% rivmp
 1167 sgk           1 118    0 81788K  5496K CPU5    5   1:18 100.00% rivmp
 1166 sgk           1 118    0 81788K  5564K RUN     4   1:18 100.00% rivmp
 1151 kargl         1 118    0 78780K  4464K CPU3    3   1:48 99.27% z
 1152 kargl         1 110    0 78780K  4464K CPU0    0   1:18 62.89% z
 1164 sgk           1 113    0 81788K  5592K CPU1    1   0:55 80.76% rivmp
 1165 sgk           1 110    0 81788K  5544K RUN     0   0:52 62.16% rivmp
 1163 sgk           1 107    0 81788K  5624K RUN     2   0:40 50.68% rivmp
 1162 sgk           1 107    0 81788K  5824K CPU2    2   0:39 50.49% rivmp

In the above, processes 1162-1165 are clearly not receiving sufficient time
slices to keep up with the other 4 rivmp images.  From watching top at a
1 second interval, once the 4 rivmp hit 100% CPU, they stayed pinned to
their cpu and stay at 100% CPU.  It is also seen that processes 1152, 1165
and 1162, 1163 are stuck on cpus 0 and 2, respectively.

-- 
Steve