cvs commit: src/sys/dev/bge if_bge.c
Robert Watson
rwatson at FreeBSD.org
Sun Dec 24 01:01:10 PST 2006
On Sun, 24 Dec 2006, Scott Long wrote:
>> I try this experiement every few years, and generally don't measure much
>> improvement. I'll try it again with 10gbps early next year once back in
>> the office again. The more interesting transition is between the link
>> layer and the network layer, which is high on my list of topics to look
>> into in the next few weeks. In particular, reworking the ifqueue handoff.
>> The tricky bit is balancing latency, overhead, and concurrency...
>>
>> FYI, there are several sets of patches floating around to modify if_em to
>> hand off queues of packets to the link layer, etc. They probably need
>> updating, of course, since if_em has changed quite a bit in the last year.
>> In my implementaiton, I add a new input routine that accepts mbuf packet
>> queues.
>
> Have you tested this with more than just your simple netblast and netperf
> tests? Have you measured CPU usage during your tests? With 10Gb coming,
> pipelined processing of RX packets is becoming an interesting topic for all
> OSes from a number of companies. I understand your feeling about the
> bottleneck being higher up than at just if_input. We'll see how this holds
> up.
In my previous test runs, I was generally testing two general scenarios:
(1) Local sink - sinking small and large packet sizes to a single socket at a
high rate.
(2) Local source - sourcing small and large packet sizes via a single socket
at a high rate.
(3) IP forwarding - both unidirectional and bidirectional packet streams
acrossan IP forwarding host with small and large packet sizes.
>From the perspective of optimizing these particular paths, small packet sizes
best reveal processing overhead up to about the TCP/socket buffer layer on
modern hardware (DMA, etc). The uni/bidirectional axis is interesting because
it helps reveal the impact of the direct dispatch vs. netisr dispatch choice
for the IP layer with respect to exercising parallelism. I didn't explicitly
measure CPU, but as the configurations max out the CPUs in my test bed,
typically any significant CPU reduction is measurable in an improvement in
throughput. For example, I was easily able to measure the CPU reduction in
switching from using the socket reference to the file descriptor reference in
sosend() on small packet transmit, which was a relatively minor functional
change in locking and reference counting.
I have tentative plans to explicitly measuring cycle counts between context
switches and during dispatches, but have not yet implemented that in the new
setup. I expect to have a chance to set up these new test runs and get back
into experimenting with the dispatch model between the device driver, link
layer, and network layer sometime in mid-January. As the test runs are very
time-consuming, I'd welcome suggestions on the testing before, rather than
after, I run them. :-)
Robert N M Watson
Computer Laboratory
University of Cambridge
More information about the cvs-src
mailing list