Network Stack Locking
Matthew Dillon
dillon at apollo.backplane.com
Thu May 20 18:03:27 PDT 2004
It's my guess that we will be able to remove the BGL from large
portions of the DFly network stack sometime late June or early July,
after USENIX, at which point it will be possible to test SMP aspects of
the localized cpu distribution method. Right now the network stack is
still under the BGL (as is most of the system, our approach to MP is
first to isolate and localize the conflicting subsystems, then to release
the BGL for that subsystem's thread(s)).
It should be noted that the biggest advantages of the distributed
approach are (1) The ability to operate on individual PCBs without
having to do any token/mutex/other locking at all, (2) Cpu locality
of reference in regards to cache mastership of the PCBs and related data,
and (3) avoidance of data cache pollution across cpus (more cpus ==
better utilization of individual L1/L2 caches and far greater
scaleability). The biggest disadvantage is the mandatory thread switch
(but this is mitigated as load increases since each thread can work on
several PCBs without further switches, and because our thread scheduler
is extremely light weight under SMP conditions). Messaging passing
overhead is very low since most operations already require some sort of
roll-up structure to be passed (e.g. an mbuf in the case of the network).
We are running the full bore threaded, distributed network stack even
on UP systems now (meaning: message passing and thread switching still
occurs even though there is only one target thread for a particular
protocol). We have done fairly significant testing on GigE LANs and
have not noticed any degredation in network performance so we are
certain we are on the right track.
I do not expect cpu balancing to be all that big an issue, actually,
especially due to the typically short lived connection life that occurs
in these scenarios. But mutex avoidance is *REALLY* *HUGE* if you are
processing a lot of TCP connections in parallel due to the small quantums
of work involved.
In anycase, if you are seriously considering any sort of distributed
methodology you should also consider formalizing a messaging passing
API for FreeBSD. Even if you don't like our LWKT messaging API, I
think you would love the DFly IPI messaging subsystem and it would be
very easy to port as a first step. We use it so much now in DFly
that I don't think I could live without it. e.g. for clock distribution,
interrupt distribution, thread/cpu isolation, wakeup(), MP-safe messaging
at higher levels (and hence packet routing), free()-return-to-
originating-cpu (mutexless slab allocator), SMP MMU synchronization
(the basic VM/pte-race issue with userland brought up by Alan Cox),
basic scheduler operations, signal(), and the list goes on and on.
In DFly, IPI messaging and message processing is required to be MP
safe (it always occurs outside the BGL, like a cpu-localized fast
interrupt), but a critical section still protects against reception
processing so code that uses it can be made very clean.
-Matt
:- They enable net.isr.enable by default, which provides inbound packet
:...
: consider at least some aspects of Jeffrey Hsu's work on DragonFly
: to explore providing for multiple netisr's bound to CPUs, then directing
: traffic based on protocol aware hashing that permits us to maintain
: sufficient ordering to meeting higher level protocol requirements while
: avoiding the cost of maintaining full ordering. This isn't something we
: have to do immediately, but exploiting parallelism requires both
: effective synchronization and effective balancing of load.
:
: In the short term, I'm less interested in the avoidance of
: synchronization of data adopted in the DragonFly approach, since I'd
: like to see that approach validated on a larger chunk of the stack
: (i.e., across the more incestuous pieces of the network stack), and also
:...
: benefits (such as a very strong assertion model). However, as aspects
: of the DFBSD approach are validated (or not, as the case may be), we
: should consider adopting things as they make sense. The approaches
: offer quite a bit of promise, but are also very experimental and will
: require a lot of validation, needless to say. I've done a little bit of
: work to start applying the load distribution approach on FreeBSD, but
: need to work more on the netisr infrastructure before I'll be able to
: evaluate its effectiveness there.
More information about the freebsd-arch
mailing list