a proposed callout API

Fri Dec 1 01:31:09 PST 2006

:The implications of adopting the model Matt proposes are quite far-reaching: 
:callouts don't exist in isolation, but occur in the context of data structures 
:and work occuring in many threads.  If callouts are pinned to a particular 
:...
:Consider the case of TCP timers: a number of TCP timers get regularly 
:rescheduled (delack, retransmit, etc).  If they can only be manipulated from 
:cpu0 (i.e., protected by a synchronization primitive that can't be acquired 
:from another CPU -- i.e., critical sections instead of mutexes), how do you 
:handle the case where the a TCP packet for that connection is processed on 
:cpu1 and needs to change the scheduling of the timer?  In a strict work/data 
:structure pinning model, you would pin the TCP connection to cpu0, and only 
:process any data leading to timer changes on that CPU.  Alternatively, you 
:might pass a message from cpu1 to cpu0 to change the scheduling.

    Yes, this is all very true.  One could think of this in a more abstract
    way if that would make things more clear:  All the work processing 
    related to a particular TCP connection is accumulated into a single
    'hopper'.  The hopper is what is being serialized with a mutex, or by
    cpu-locality, or even simply by thread-locality (dedicating a single
    thread to process a single hopper).  This means that all the work
    that has accumulated in the hopper can be processed while holding a
    single serializer instead of having to acquire and release a serializer
    for each work item within the hopper.

    That's the jist of it.  If you have enough hoppers, statistics takes
    care of the rest.  There is nothing that says the hoppers have to
    be pinned to particular cpu's, it just makes it easier for other
    system APIs if they are.

    For FreeBSD, I think the hopper abstraction might be the way to
    go.  You could then have worker threads running on each cpu (one per
    cpu) which compete for hoppers with pending work.  You can avoid
    wiring the hoppers to particular cpus (which FreeBSD people seem to
    dislike considerably) yet still reap the benefits of batch processing.

    TCP callout timers are a really good example here, because TCP callout
    timers are ONLY ever manipulated from within the TCP protocol stack,
    which means they are only manipulated in the context of a TCP work
    item (either a packet, or a timeout, or user requested work).  If you
    think about it, nearly *all* the manipulation of the TCP callout timers
    occurs during work item processing where you already hold the governing
    serializer.  That is the manipulation that needs to become optimal here.

    So the question for callouts then becomes.... can the serializer used
    for the work item processing be the SAME serializer that the callout
    API uses to control access to the callout structures?  

    In the DragonFly model the answer is: yes, easy, because the serializer
    is cpu-localized.

    In FreeBSD the same thing could be accomplished by implementing a
    callout wheel for each 'hopper', controlled by the same serializer.

    The only real performance issue is how to handle work item events
    caused by userland read() or write()'s.... do you have those operations
    send a message to the thread managing the hopper?  Or do you have 
    those operations obtain the hopper's serializer and enter the TCP
    stack directly?  For FreeBSD I would guess the latter... obtain the
    hopper's serializer and enter the TCP stack directly.  But if you
    were to implement it you could actually do it both ways and have a
    sysctl to select which method to use, then look at how that effects
    performance.

    The other main entry point for packets into the TCP stack is from the
    network interface.  The network interface layer is typically
    interrupt driven, and just as typically it is not (in my opinion) the
    best idea to try to call the TCP protocol stack from the network
    interrupt as it seriously elongates the code path and enlarges the
    cache fingerprint required to run through a network interface's 
    RX ring.  The RX ring is likely to contain dozens or even a hundred
    or more packets bound for a fewer (but still significant) number of
    TCP connections.

    Breaking up that processing into two separate loops... getting the 
    packets off the RX ring and placing them in the correct hopper, and
    processing the hopper's work queue, would yield a far better cache
    footprint.  Again, my opinion.

    --

    In any case, these methodologies basically exist in order to remove
    the need to acquire a serializer that is so fine-grained that the
    overhead of the serializer becomes a serious component of the overhead
    of the work being serialized.  That is *certainly* the case for
    the callout API.  Sans serializer, the callout API is basically one or
    two TAILQ manipulations and that is it.  You can't get much faster
    then that.  I don't think it is appropriate to try to abstract-away
    the serializer when the serializer becomes such a large component.
    That's like hiding something you don't like under your bed.

    --

    Something just came to my attention... are you guys actually using 
    high 'hz' values to govern your current callout API?  In particular,
    the c->c_time field?  If that is the case the size of your callwheel
    array may be insufficient to hold even short timeouts without wrapping. 

    That could create *serious* performance problems with the callwheel
    design.  And I do mean serious.  The entire purpose of having the 
    callwheel is to support the notion that most timeouts will be removed
    or reset before they actually occur, meaning before the iterator
    (softclock_handler() in kern_timeout.c) gets to the index.  If you
    wrap, the iterator may wind up having to skip literally thousands or
    hundreds of thousands of callout structures during its scan.

    So, e.g. a typical callwheel is sized to 16384 or 32768 entries
    ('print callwheelsize' from kgdb on a live kernel).  At 100hz 
    32768 entries gives us 327 seconds of range before callout entries
    start to wrap.  At 1000hz 32768 entries barely gives you 32 seconds
    of range.   All TCP timers except the idle timer are fairly short
    lived.  The idle timer could be an issue for you.  In fact, it could
    be an issue for us too... that's something I will have a look at in
    DragonFly.  

    You could also be hitting another big problem by using a too fine-grained
    timer/timeout resolution, and that is destroying the natural aggregation
    of work that occurs with coarse resolutions.  It doesn't make much sense
    to have a handful of callouts at 10ms, 11ms and 12ms for example.
    It would be better to have them all in one slot (like at 12ms) so they
    can all be processed in batch.  

    This is particularly true for anything that can be processed with a
    tight code loop, and the TCP protocol stack certainly applies there.
    I think Jeffrey Hsu actually counted instruction cycles for TCP
    processing through the short-cut tests (the optimal/critical path
    when incoming data packets are in-order and non-overlapping and such),
    and once he fixed some of the conditionals the number of instructions
    required to process a packet had been reduced dramatically and certainly
    fit in the L1 cache.

    Someething to think about, anyhow.  I'll read the paper you referenced.
    It looks interesting.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>