Network Stack Locking

Thu May 20 18:03:27 PDT 2004

    It's my guess that we will be able to remove the BGL from large
    portions of the DFly network stack sometime late June or early July,
    after USENIX, at which point it will be possible to test SMP aspects of 
    the localized cpu distribution method.  Right now the network stack is
    still under the BGL (as is most of the system, our approach to MP is
    first to isolate and localize the conflicting subsystems, then to release
    the BGL for that subsystem's thread(s)).

    It should be noted that the biggest advantages of the distributed
    approach are (1) The ability to operate on individual PCBs without
    having to do any token/mutex/other locking at all, (2) Cpu locality
    of reference in regards to cache mastership of the PCBs and related data,
    and (3) avoidance of data cache pollution across cpus (more cpus == 
    better utilization of individual L1/L2 caches and far greater
    scaleability).  The biggest disadvantage is the mandatory thread switch
    (but this is mitigated as load increases since each thread can work on
    several PCBs without further switches, and because our thread scheduler
    is extremely light weight under SMP conditions).  Messaging passing
    overhead is very low since most operations already require some sort of
    roll-up structure to be passed (e.g. an mbuf in the case of the network).

    We are running the full bore threaded, distributed network stack even
    on UP systems now (meaning: message passing and thread switching still
    occurs even though there is only one target thread for a particular
    protocol).  We have done fairly significant testing on GigE LANs and
    have not noticed any degredation in network performance so we are
    certain we are on the right track.

    I do not expect cpu balancing to be all that big an issue, actually,
    especially due to the typically short lived connection life that occurs
    in these scenarios.  But mutex avoidance is *REALLY* *HUGE* if you are
    processing a lot of TCP connections in parallel due to the small quantums
    of work involved.

    In anycase, if you are seriously considering any sort of distributed
    methodology you should also consider formalizing a messaging passing
    API for FreeBSD.  Even if you don't like our LWKT messaging API, I
    think you would love the DFly IPI messaging subsystem and it would be
    very easy to port as a first step.  We use it so much now in DFly
    that I don't think I could live without it.  e.g. for clock distribution,
    interrupt distribution, thread/cpu isolation, wakeup(), MP-safe messaging
    at higher levels (and hence packet routing), free()-return-to-
    originating-cpu (mutexless slab allocator), SMP MMU synchronization
    (the basic VM/pte-race issue with userland brought up by Alan Cox),
    basic scheduler operations, signal(), and the list goes on and on.
    In DFly, IPI messaging and message processing is required to be MP
    safe (it always occurs outside the BGL, like a cpu-localized fast
    interrupt), but a critical section still protects against reception
    processing so code that uses it can be made very clean.

						-Matt

:- They enable net.isr.enable by default, which provides inbound packet
:...
:  consider at least some aspects of Jeffrey Hsu's work on DragonFly
:  to explore providing for multiple netisr's bound to CPUs, then directing
:  traffic based on protocol aware hashing that permits us to maintain
:  sufficient ordering to meeting higher level protocol requirements while
:  avoiding the cost of maintaining full ordering.  This isn't something we
:  have to do immediately, but exploiting parallelism requires both
:  effective synchronization and effective balancing of load.
:
:  In the short term, I'm less interested in the avoidance of
:  synchronization of data adopted in the DragonFly approach, since I'd
:  like to see that approach validated on a larger chunk of the stack
:  (i.e., across the more incestuous pieces of the network stack), and also
:...
:  benefits (such as a very strong assertion model).  However, as aspects
:  of the DFBSD approach are validated (or not, as the case may be), we
:  should consider adopting things as they make sense.  The approaches
:  offer quite a bit of promise, but are also very experimental and will
:  require a lot of validation, needless to say.  I've done a little bit of
:  work to start applying the load distribution approach on FreeBSD, but
:  need to work more on the netisr infrastructure before I'll be able to
:  evaluate its effectiveness there.