Heavy I/O blocks FreeBSD box for several seconds

Thu Jul 7 15:54:52 UTC 2011

On Thu, Jul 7, 2011 at 5:14 PM, Steve Kargl <
sgk at troutmask.apl.washington.edu> wrote:

> On Thu, Jul 07, 2011 at 10:27:53AM +0300, Andriy Gapon wrote:
> > on 06/07/2011 21:11 Nathan Whitehorn said the following:
> > > On 07/06/11 13:00, Steve Kargl wrote:
> > >> AFAICT, it is a cpu affinity issue.  If I launch n+1 MPI images
> > >> on a system with n cpus/cores, then 2 (and sometimes 3) images
> > >> are stuck on a cpu and those 2 (or 3) images ping-pong on that
> > >> cpu.  I recall trying to use renice(8) to force some load
> > >> balancing, but vaguely remember that it did not help.
> > >
> > > I've seen exactly this problem with multi-threaded math libraries, as
> well.
> >
> > Exactly the same?  Let's see.
> >
> > > Using parallel GotoBLAS on FreeBSD gives terrible performance because
> the
> > > threads keep migrating between CPUs, causing frequent cache misses.
> >
> > So Steve reports that if he has Nthr > Ncpu, then some threads are
> "over-glued"
> > to a particular CPU, which results in sub-optimal scheduling for those
> threads.
> >  I have to guess that Steve would want to see the threads being shuffled
> between
> > CPUs to produce more even CPU load.
>
> I'm using OpenMPI.  These are N > Ncpu processes not threads, and without
> the loss of generality let N = Ncpu + 1.  It is a classic master-slave
> situation where 1 process initializes all others.  The n-1 slave processes
> are then independent of each other.  After 20 minutes or so of number
> crunching, each slave sends a few 10s of KB of data to the master.  The
> master collects all the data, writes it to disk, and then sends the
> slaves the next set of computations to do.  The computations are nearly
> identical, so each slave finishes it task in the same amount of time. The
> problem appears to be that 2 slaves are bound to the same cpu and the
> remaining N - 3 slaves are bound to a specific cpu.  The N - 3 slaves
> finish their task, send data to the master, and then spin (chewing up
> nearly 100% cpu) waiting for the 2 ping-ponging slaves to finishes.
> This causes a stall in the computation.  When a complete computation
> takes days to complete, theses stall become problematic.  So, yes, I
> want the processes to get a more uniform access to cpus via migration
> to other cpus.  This is what 4BSD appears to do.
>
>
Spinning threads are a PITA for any scheduler, it's just that in your case
4BSD computes quantums differently. Is there any way to make the software
sleep instead of spinning?

> > On the other hand, you report that your threads keep being shuffled
> between CPUs
> > (I presume for Nthr == Ncpu case, where Nthr is a count of the
> number-crunching
> > threads).  And I guess that you want them to stay glued to particular
> CPUs.
> >
> > So how is this the same problem?  In fact, it sounds like somewhat
> opposite.
> > The only thing in common is that you both don't like how ULE works.
>
> Well, it may be similar in that N - 2 threads are bound to N - 2
> cpus, and the remaining 2 threads are ping ponging on the last
> remaining cpu.  I suspect that GotoBLAS has a large amount
> communication between threads, and once again the computations
> stalls waiting of the 2 threads to either finish battling for the
> 1 cpu or perhaps the process uses pthread_yield() in some clever
> way to try to get load balancing.
>
> --
> Steve
> _______________________________________________
> freebsd-current at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe at freebsd.org"
>

-- 
Good, fast & cheap. Pick any two.