a proposed callout API

Thu Nov 30 17:39:31 PST 2006

On Thu, 30 Nov 2006, Ivan Voras wrote:

> No trying to take sides here, but for us willing to learn here, what exactly 
> are the problems in Matt Dillon's suggestions? From a novice's POV, having 
> per-cpu queues looks (emphasis: looks) very scalable and performant.

The implications of adopting the model Matt proposes are quite far-reaching: 
callouts don't exist in isolation, but occur in the context of data structures 
and work occuring in many threads.  If callouts are pinned to a particular 
CPU, and can only be scheduled, rescheduled, and cancelled from that CPU, that 
implies either that all work associated with that callout is also pinned to 
the CPU, or that migration or message-passing be involved if the requirement 
comes up in a thread on another CPU.

Consider the case of TCP timers: a number of TCP timers get regularly 
rescheduled (delack, retransmit, etc).  If they can only be manipulated from 
cpu0 (i.e., protected by a synchronization primitive that can't be acquired 
from another CPU -- i.e., critical sections instead of mutexes), how do you 
handle the case where the a TCP packet for that connection is processed on 
cpu1 and needs to change the scheduling of the timer?  In a strict work/data 
structure pinning model, you would pin the TCP connection to cpu0, and only 
process any data leading to timer changes on that CPU.  Alternatively, you 
might pass a message from cpu1 to cpu0 to change the scheduling.

The idea of processing timers in multiple threads and pinning them to multiple 
CPUs clearly isn't a bad idea: we could likely benefit from parallelism (and 
generally, concurrency) in timer processing.  One of the things we discussed 
at the recent developer summit was subsystem callout threads (introducing the 
opportunity for parallism without committing to a particular CPU scheduling 
model), as well as per-CPU callout threads but protected using mutexes so that 
reschedule/cancel/etc can be performed form other CPUs still.  Changing the 
API so that scheduling/rescheduling/etc activities themselves must occur on a 
particular CPU has serious implications and commits us to an architectural 
approach for which there is little concensus.  If the goal is simply 
parallelism, it's possible to accomplish that without embedding assumptions 
about the synchronization model at this point.  Take a look at the USENIX 
paper by Paul Willmann (et al) at Rice for some rather interesting 
experimentation, measurement, and discussion precisely along these lines:

     http://www.ece.rice.edu/~willmann/pubs/paranet_tr06-872.pdf

Robert N M Watson
Computer Laboratory
University of Cambridge