a proposed callout API
Matthew Dillon
dillon at apollo.backplane.com
Fri Dec 1 01:31:09 PST 2006
:The implications of adopting the model Matt proposes are quite far-reaching:
:callouts don't exist in isolation, but occur in the context of data structures
:and work occuring in many threads. If callouts are pinned to a particular
:...
:Consider the case of TCP timers: a number of TCP timers get regularly
:rescheduled (delack, retransmit, etc). If they can only be manipulated from
:cpu0 (i.e., protected by a synchronization primitive that can't be acquired
:from another CPU -- i.e., critical sections instead of mutexes), how do you
:handle the case where the a TCP packet for that connection is processed on
:cpu1 and needs to change the scheduling of the timer? In a strict work/data
:structure pinning model, you would pin the TCP connection to cpu0, and only
:process any data leading to timer changes on that CPU. Alternatively, you
:might pass a message from cpu1 to cpu0 to change the scheduling.
Yes, this is all very true. One could think of this in a more abstract
way if that would make things more clear: All the work processing
related to a particular TCP connection is accumulated into a single
'hopper'. The hopper is what is being serialized with a mutex, or by
cpu-locality, or even simply by thread-locality (dedicating a single
thread to process a single hopper). This means that all the work
that has accumulated in the hopper can be processed while holding a
single serializer instead of having to acquire and release a serializer
for each work item within the hopper.
That's the jist of it. If you have enough hoppers, statistics takes
care of the rest. There is nothing that says the hoppers have to
be pinned to particular cpu's, it just makes it easier for other
system APIs if they are.
For FreeBSD, I think the hopper abstraction might be the way to
go. You could then have worker threads running on each cpu (one per
cpu) which compete for hoppers with pending work. You can avoid
wiring the hoppers to particular cpus (which FreeBSD people seem to
dislike considerably) yet still reap the benefits of batch processing.
TCP callout timers are a really good example here, because TCP callout
timers are ONLY ever manipulated from within the TCP protocol stack,
which means they are only manipulated in the context of a TCP work
item (either a packet, or a timeout, or user requested work). If you
think about it, nearly *all* the manipulation of the TCP callout timers
occurs during work item processing where you already hold the governing
serializer. That is the manipulation that needs to become optimal here.
So the question for callouts then becomes.... can the serializer used
for the work item processing be the SAME serializer that the callout
API uses to control access to the callout structures?
In the DragonFly model the answer is: yes, easy, because the serializer
is cpu-localized.
In FreeBSD the same thing could be accomplished by implementing a
callout wheel for each 'hopper', controlled by the same serializer.
The only real performance issue is how to handle work item events
caused by userland read() or write()'s.... do you have those operations
send a message to the thread managing the hopper? Or do you have
those operations obtain the hopper's serializer and enter the TCP
stack directly? For FreeBSD I would guess the latter... obtain the
hopper's serializer and enter the TCP stack directly. But if you
were to implement it you could actually do it both ways and have a
sysctl to select which method to use, then look at how that effects
performance.
The other main entry point for packets into the TCP stack is from the
network interface. The network interface layer is typically
interrupt driven, and just as typically it is not (in my opinion) the
best idea to try to call the TCP protocol stack from the network
interrupt as it seriously elongates the code path and enlarges the
cache fingerprint required to run through a network interface's
RX ring. The RX ring is likely to contain dozens or even a hundred
or more packets bound for a fewer (but still significant) number of
TCP connections.
Breaking up that processing into two separate loops... getting the
packets off the RX ring and placing them in the correct hopper, and
processing the hopper's work queue, would yield a far better cache
footprint. Again, my opinion.
--
In any case, these methodologies basically exist in order to remove
the need to acquire a serializer that is so fine-grained that the
overhead of the serializer becomes a serious component of the overhead
of the work being serialized. That is *certainly* the case for
the callout API. Sans serializer, the callout API is basically one or
two TAILQ manipulations and that is it. You can't get much faster
then that. I don't think it is appropriate to try to abstract-away
the serializer when the serializer becomes such a large component.
That's like hiding something you don't like under your bed.
--
Something just came to my attention... are you guys actually using
high 'hz' values to govern your current callout API? In particular,
the c->c_time field? If that is the case the size of your callwheel
array may be insufficient to hold even short timeouts without wrapping.
That could create *serious* performance problems with the callwheel
design. And I do mean serious. The entire purpose of having the
callwheel is to support the notion that most timeouts will be removed
or reset before they actually occur, meaning before the iterator
(softclock_handler() in kern_timeout.c) gets to the index. If you
wrap, the iterator may wind up having to skip literally thousands or
hundreds of thousands of callout structures during its scan.
So, e.g. a typical callwheel is sized to 16384 or 32768 entries
('print callwheelsize' from kgdb on a live kernel). At 100hz
32768 entries gives us 327 seconds of range before callout entries
start to wrap. At 1000hz 32768 entries barely gives you 32 seconds
of range. All TCP timers except the idle timer are fairly short
lived. The idle timer could be an issue for you. In fact, it could
be an issue for us too... that's something I will have a look at in
DragonFly.
You could also be hitting another big problem by using a too fine-grained
timer/timeout resolution, and that is destroying the natural aggregation
of work that occurs with coarse resolutions. It doesn't make much sense
to have a handful of callouts at 10ms, 11ms and 12ms for example.
It would be better to have them all in one slot (like at 12ms) so they
can all be processed in batch.
This is particularly true for anything that can be processed with a
tight code loop, and the TCP protocol stack certainly applies there.
I think Jeffrey Hsu actually counted instruction cycles for TCP
processing through the short-cut tests (the optimal/critical path
when incoming data packets are in-order and non-overlapping and such),
and once he fixed some of the conditionals the number of instructions
required to process a packet had been reduced dramatically and certainly
fit in the L1 cache.
Someething to think about, anyhow. I'll read the paper you referenced.
It looks interesting.
-Matt
Matthew Dillon
<dillon at backplane.com>
More information about the freebsd-arch
mailing list