Network stack changes

Sun Sep 22 19:59:44 UTC 2013

On 29.08.2013 02:24, Andre Oppermann wrote:
> On 28.08.2013 20:30, Alexander V. Chernikov wrote:
>> Hello list!
>
> Hello Alexander,
Hello Andre!
I'm very sorry to answer so late.
>
> you sent quite a few things in the same email.  I'll try to respond
> as much as I can right now.  Later you should split it up to have
> more in-depth discussions on the individual parts.
>
> If you could make it to the EuroBSDcon 2013 DevSummit that would be
> even more awesome.  Most of the active network stack people will be
> there too.
I've sent presentation describing nearly the same things to devsummit@ 
so I hope this can be discussed in Networking group.
I hope to attend DevSummit & EuroBSDcon.
>
>> There is a lot constantly raising discussions related to networking 
>> stack performance/changes.
>>
>> I'll try to summarize current problems and possible solutions from my 
>> point of view.
>> (Generally this is one problem: stack is 
>> slooooooooooooooooooooooooooow, but we need to know why and
>> what to do).
>
> Compared to others its not thaaaaaaat slow. ;)
>
>> Let's start with current IPv4 packet flow on a typical router:
>> http://static.ipfw.ru/images/freebsd_ipv4_flow.png
>>
>> (I'm sorry I can't provide this as text since Visio don't have any 
>> 'ascii-art' exporter).
>>
>> Note that we are using process-to-completion model, e.g. process any 
>> packet in ISR until it is either
>> consumed by L4+ stack or dropped or put to egress NIC queue.
>>
>> (There is also deferred ISR model implemented inside netisr but it 
>> does not change much:
>> it can help to do more fine-grained hashing (for GRE or other similar 
>> traffic), but
>> 1) it uses per-packet mutex locking which kills all performance
>> 2) it currently does not have _any_ hashing functions (see absence of 
>> flags in `netstat -Q`)
>> People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or 
>> modified PPPoe/GRE version)
>> report some profit, but without fixing (1) it can't help much
>> )
>>
>> So, let's start:
>>
>> 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine 
>> since there is nearly no contention
>> (the only thing that can happen is driver reconfiguration which is 
>> rare and, more signifficant, we
>> do this once
>> for the batch of packets received in given interrupt). However, due 
>> to some (im)possible deadlocks
>> current code
>> does per-packet ring unlock/lock (see ixgbe_rx_input()).
>> There was a discussion ended with nothing:
>> http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html
>>
>> 1*) Possible BPF users. Here we have one rlock if there are any 
>> readers present
>> (and mutex for any matching packets, but this is more or less OK. 
>> Additionally, there is WIP to
>> implement multiqueue BPF
>> and there is chance that we can reduce lock contention there).
>
> Rlock to rmlock?
Yes, probably.
>
>> There is also an "optimize_writers" hack permitting applications
>> like CDP to use BPF as writers but not registering them as receivers 
>> (which implies rlock)
>
> I believe longer term we should solve this with a protocol type 
> "ethernet"
> so that one can send/receive ethernet frames through a normal socket.
Yes. AF_LINK or any similar.
>
>> 2/3) Virtual interfaces (laggs/vlans over lagg and other simular 
>> constructions).
>> Currently we simply use rlock to make s/ix0/lagg0/ and, what is much 
>> more funny - we use complex
>> vlan_hash with another rlock to
>> get vlan interface from underlying one.
>>
>> This is definitely not like things should be done and this can be 
>> changed more or less easily.
>
> Indeed.
>
>> There are some useful terms/techniques in world of software/hardware 
>> routing: they have clear
>> 'control plane' and 'data plane' separation.
>> Former one is for dealing control traffic (IGP, MLD, IGMP snooping, 
>> lagg hellos, ARP/NDP, etc..) and
>> some data traffic (packets with TTL=1, with options, destined to 
>> hosts without ARP/NDP record, and
>> similar). Latter one is done in hardware (or effective software 
>> implementation).
>> Control plane is responsible to provide data for efficient data plane 
>> operations. This is the point
>> we are missing nearly everywhere.
>
> ACK.
>
>> What I want to say is: lagg is pure control-plane stuff and vlan is 
>> nearly the same. We can't apply
>> this approach to complex cases like 
>> lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0)
>> but we definitely can do this for most common setups like (igb* or 
>> ix* in lagg with or without vlans
>> on top of lagg).
>
> ACK.
>
>> We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can 
>> add some more. We even have
>> per-driver hooks to program HW filtering.
>
> We could.  Though for vlan it looks like it would be easier to remove the
> hardware vlan tag stripping and insertion.  It only adds complexity in 
> all
> drivers for no gain.
No. Actually as far as I understand it helps driver to perform TSO. 
Anyway, IMO we should use HW capabilities if we can.
(this probably does not add much speed on 1G, but on 10/20/40G this can 
help much more).
>
>> One small step to do is to throw packet to vlan interface directly 
>> (P1), proof-of-concept(working in
>> production):
>> http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html
>>
>> Another is to change lagg packet accounting:
>> http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html
>> Again, this is more like HW boxes do (aggregate all counters 
>> including errors) (and I can't imagine
>> what real error we can get from _lagg_).
> >
>> 4) If we are router, we can do either slooow ip_input() -> 
>> ip_forward() -> ip_output() cycle or use
>> optimized ip_fastfwd() which falls back to 'slow' path for 
>> multicast/options/local traffic (e.g.
>> works exactly like 'data plane' part).
>> (Btw, we can consider net.inet.ip.fastforwarding to be turned on by 
>> default at least for non-IPSEC
>> kernels)
>
> ACK.
>
>> Here we have to determine if this is local packet or not, e.g. 
>> F(dst_ip) returning 1 or 0. Currently
>> we are simply using standard rlock + hash of iface addresses.
>> (And some consumers like ipfw(4) do the same, but without lock).
>> We don't need to do this! We can build sorted array of IPv4 addresses 
>> or other efficient structure
>> on every address change and use it unlocked with delayed garbage 
>> collection (proof-of-concept attached)
>
> I'm a bit uneasy with unlocked access.  On very weakly ordered 
> architectures
> this could trip over cache coherency issues.  A rmlock is essentially 
> for free
> in the read case.
Well, I'm talking of
1) allocate _new_ memory (unlocked)
2) commit _new_ copy for given address list (rlock)
3) change pointer. As fa as I understand we can read either old or new value
4) use delayed GC (how much should be wait, until deletion)

Anyway, protecting (optimized) list with rmlock can do.
>
>> (There is another thing to discuss: maybe we can do this once 
>> somewhere in ip_input and mark mbuf as
>> 'local/non-local' ? )
>
> The problem is packet filters may change the destination address and thus
> can invalidate such a lookup.
Yes. So ether filter or ip_input() routing should re-inspect packet, 
exactly like this is done currently (ipfw fwd for IPv4/IPv6 code)
>
>> 5, 9) Currently we have L3 ingress/egress PFIL hooks protected by 
>> rmlocks. This is OK.
>>
>> However, 6) and 7) are not.
>> Firewall can use the same pfil lock as reader protection without 
>> imposing its own lock. currently
>> pfil&ipfw code is ready to do this.
>
> The problem with the global pfil rmlock is the comparatively long time it
> is held in a locked state.  Also packet filters may have to acquire 
> additional
> locks when they have to modify state tables.  Rmlocks are not made for 
> that
> because they pin the thread to the cpu they're currently on.  This is 
> what
> Gleb is complaining about.
Yes, additional locks is the problem
>
> My idea is to hold the pfil rmlock only for the lookup of the first/next
> packet filter that will run, not for the entire duration.  That would 
> solve
> the problem.  However packets filter then have to use their own locks 
> again,
> which could be rmlock too.
Well, we haven't changed anything yet :)
>
>> 8) Radix/rt* api. This is probably the worst place in entire stack. 
>> It is toooo generic, tooo slow
>> and buggy (do you use IPv6? you definitely know what I'm talking about).
>> A) It really is too generic and assumption that it can be 
>> (effectively) used for every family is
>> wrong. Two examples:
>> we don't need to lookup all 128 bits of IPv6 address. Subnets with 
>> mask >/64 are not used widely
>> (actually the only reason to use them are p2p links due to ND 
>> potential problems).
>> One of common solutions is to lookup 64bits, and build another trie 
>> (or other structure) in case of
>> collision.
>> Another example is MPLS where we can simply do direct array lookup 
>> based on ingress label.
>
> Yes.  While we shouldn't throw it out, it should be run as RIB and
> allow a much more protocol specific FIB for the hot packet path.
>
>> B) It is terribly slow (AFAIR luigi@ did some performance management, 
>> numbers available in one of
>> netmap pdfs)
>
> Again not thaaaat slow but inefficient enough.
I've found the paper I was talking about:
http://info.iet.unipi.it/~luigi/papers/20120601-dxr.pdf

It claims that our radix is able to do 6MPPS/core and it does not scale 
with number of cores.
>
>> C) It is not multipath-capable. Stateful (and non-working) multipath 
>> is definitely not the right way.
>
> Indeed.
>
>> 8*) rtentry
>> We are doing it wrong.
>> Currently _every_ lookup locks/unlocks given rte twice.
>> First lock is related to and old-old story for trusting IP redirects 
>> (and auto-adding host routes
>> for them). Hopefully currently it is disabled automatically when you 
>> turn forwarding on.
>
> They're disabled.
>
>> The second one is much more complicated: we are assuming that rte's 
>> with non-zero refcount value can
>> stop egress interface from being destroyed.
>> This is wrong (but widely used) assumption.
>
> Not really.  The reason for the refcount is not the ifp reference but
> other code parts that may hold direct pointers to the rtentry and do
> direct dereferencing to access information in it.
Yes, but what information?
>
>> We can use delayed GC instead of locking for rte's and this won't 
>> break things more than they are
>> broken now (patch attached).
>
> Nope.  Delayed GC is not the way to go here.  To do away with rtentry
> locking and refcounting we have change rtalloc(9) to return the 
> information
> the caller wants (e.g. ifp, ia, others) and not the rtentry address 
> anymore.
> So instead of rtalloc() we have rtlookup().
It depends on what we want to do next..
My idea (briefly) is to have
1) adjacency/nhops structures describing next hops with rewrite info and 
list of iface indices to do L2 multipath
2) "rtentry" to have link to array of nhops to do L3 multipath (more or 
less the same as Cisco CEF and others).

And, anyway, we still have to protect from interface departure.
>
>> We can't do the same for ifp structures since
>> a) virtual ones can assume some state in underlying physical NIC
>> b) physical ones just _can_ be destroyed (maybe regardless of user 
>> wants this or not, like: SFP
>> being unplugged from NIC) or simply lead to kernel crash due to SW/HW 
>> inconsistency
>
> Here I actually believe we can do a GC or stable storage based approach.
> Ifp pointers are kept in too many places and properly refcounting it is
> very (too) hard.  So whenever an interface gets destroyed or disappears
> it's callable function pointers are replaced with dummies returning an
> error.  The ifp in memory will stay for some time and even may be reused
Yes. But we are not holding any (relevant) lock while doing actual 
transmit (e.g. calling if_output after performing L2 rewrite in 
ether_output) so
some cores will see old pointers..
> for another new interface later again (Cisco does it that way in their 
> IOS).
>
>> One of possible solution is to implement stable refcounts based on 
>> PCPU counters, and apply thos
>> counters to ifp, but seem to be non-trivial.
>>
>>
>> Another rtalloc(9) problem is the fact that radix is used as both 
>> 'control plane' and 'data plane'
>> structure/api. Some users always want to put more information in rte, 
>> while others
>> want to make rte more compact. We just need _different_ structures 
>> for that.
>
> ACK.
>
>> Feature-rich, lot-of-data control plane one (to store everything we 
>> want to store, including, for
>> example, PID of process originating the route) - current radix can be 
>> modified to do this.
>> And address-family-depended another structure (array, trie, or 
>> anything) which contains _only_ data
>> necessary to put packet on the wire.
>
> ACK.
>
>> 11) arpresolve. Currently (this was decoupled in 8.x) we have
>> a) ifaddr rlock
>> b) lle rlock.
>>
>> We don't need those locks.
>> We need to
>> a) make lle layer per-interface instead of global (and this can also 
>> solve multiple fibs and L2
>> mappings done in fib.0 issue)
>
> Yes!
>
>> b) use rtalloc(9)-provided lock instead of separate locking
>
> No.  Interface rmlock.
Discussable :)
>
>> c) actually, we need to do rewrite this layer because
>> d) lle actually is the place to do real multipath:
>
> No, you can do multipath through more than one interface.  If lle is
> per interface that wont work and is not the right place.
>
>> briefly,
>> you have rte pointing to some special nexthop structure pointing to 
>> lle, which has the following data:
>> num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to 
>> prepend to header
>> Separate post will follow.
>
> This should be part of the RIB/FIB and select on of the ifp+nexthops
> to return on lookup.
Yes.
>
>> With the following, we can achieve lagg traffic distribution without 
>> actually using lagg_transmit
>> and similar stuff (at least in most common scenarious)
>
> This seems to be a rather nasty layering violation.
Not really. lagg is pure virtual stuff.
>
>> (for example, TCP output definitely can benefit from this, since we 
>> can account flowid once for TCP
>> session and use in in every mbuf)
> >
>> So. Imagine we have done all this. How we can estimate the difference?
>>
>> There was a thread, started a year ago, describing 'stock' 
>> performance and difference for various
>> modifications.
>> It is done on 8.x, however I've got similar results on recent 9.x
>>
>> http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032680.html
>>
>> Briefly:
>>
>> 2xE5645 @ Intel 82599 NIC.
>> Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no 
>> FLOWTABLE, no firewallIxia XM2
>> (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte IP packets 
>> from vlan10 (10.100.0.64 -
>> 10.100.0.156) to destinations in vlan11 (10.100.1.128 - 
>> 10.100.1.192). Static arps are configured
>> for all destination addresses. Traffic level is slightly above or 
>> slightly below system performance.
>>
>> we start from 1.4MPPS (if we are using several routes to minimize 
>> mutex contention).
>>
>> My 'current' result for the same test, on same HW, with the following 
>> modifications:
>>
>> * 1) ixgbe per-packet ring unlock removed
>> * P1) ixgbe is modified to do direct vlan input (so 2,3 are not used)
>> * 4) separate lockless in_localip() version
>> * 6) - using existing pfil lock
>> * 7) using lockless version
>> * 8) radix converted to use rmlock instead of rlock. Delayed GC is 
>> used instead of mutexes
>> * 10) - using existing pfil lock
>> * 11) using radix lock to do arpresolve(). Not using lle rlock
>>
>> (so the rmlocks are the only locks used on data path).
>>
>> Additionally, ipstat counters are converted to PCPU (no real 
>> performance implications).
>> ixgbe does not do per-packet accounting (as in head).
>> if_vlan counters are converted to PCPU
>> lagg is converted to rmlock, per-packet accounting is removed (using 
>> stat from underlying interfaces)
>> lle hash size is bumped to 1024 instead of 32 (not applicable here, 
>> but slows things down for large
>> L2 domains)
>>
>> The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for 
>> lagg (16 cores), nearly the same
>> for HT on and 22 cores.
>
> That's quite good, but we want more. ;)
>
>> ..
>> while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) 
>> on the same-class hardware and
>> _userland_ forwarding.
>
> Those numbers sound a bit far out.  Maybe if the packet isn't touched
> or looked at at all in a pure netmap interface to interface bridging
> scenario.  I don't believe these numbers.
http://www.intel.com/content/dam/www/public/us/en/documents/presentation/dpdk-packet-processing-ia-overview-presentation.pdf
Luigi talks about very fast L4 lookups in his (and other colleagues) work.
Anyway, even simple 8-8-8-8 multi-bit trie can be very fast

>
>> One of key features making all such products possible (DPDK, netmap, 
>> packetshader, Cisco SW
>> forwarding) - is use of batching instead of process-to-completion model.
>> Batching mitigates locking cost, batching does not wash out CPU 
>> cache, and so on.
>
> The work has to be done eventually.  Batching doesn't relieve from it.
> IMHO batch moving is only the last step would should look at.  It makes
> the stack rather complicated and introduces other issues like packet
> latency.
>
>> So maybe we can consider passing batches from NIC to at least L2 
>> layer with netisr? or even up to
>> ip_input() ?
>
> And then?  You probably won't win much in the end (if the lock path
> is optimized).
At least I can firewall them "all at once". Next steps depends on how we 
can solve egress ifp problem.
But yes, this is definitely not the first thing to do.
>
>> Another question is about making some sort of reliable GC like 
>> ("passive serialization" or other
>> similar not-to-pronounce-words about Linux and lockless objects).
>
> Rmlocks are our secret weapon and just as good.
>
>> P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing 
>> roughly how can this be done and what
>> benefit can be achieved.
>