m_pkthdr.rcvif dangling pointer problem
Vadim Goncharov
vadim_nuclight at mail.ru
Sun Jul 17 12:14:36 UTC 2011
Hi Gleb Smirnoff!
On Thu, 14 Jul 2011 19:44:57 +0400; Gleb Smirnoff wrote about 'm_pkthdr.rcvif dangling pointer problem':
> 1) Every m_pkthdr.rcvif should if_ref() the interface. Releasing
> references can be done in the mbuf allocator: mb_dtor_mbuf(),
> mb_dtor_pack(). I'm afraid this approach would degrate performance,
> since adding at least a couple of atomic ops on every mbuf for its
> lifetime. That is +2 atomic ops per packet.
Conceptually, this is the most correct solution, though not so fast.
Ways to improve are to be found from this starting point. However,
are that +2 atomic ops per packet really so expensive? How many of
atomic ops are already on that path? Any measures?
> 2) kib@ suggested to allocate ifnets from a UMA_ZONE_NOFREE zone.
> I've made a compilable & working patch:
> http://people.freebsd.org/~glebius/patches/ifnet.no_free
> But on second though I find this a bad idea, this is just fooling
> of INVARIANTS. Yes, we avoid thrashing of freed memory and rewriting
> it by some other kernel allocation. But still out pointer point to
> invalid ifnet. Even, if we make a check for IFF_DYING flag, we still
> can not guarantee that an interface had been re-allocated for a new
> instance. This would be not a panic condition, but subtle bugs in
> firewalls.
Indeed, for long-term solution this is not right approach.
> 3) As we now have a straight if_index table that can grow, what about
> storing the if_index in the m_pkthdr? Lookup of interface by index
> is fast enough if done lockless. Doing it lockless isn't perfect, but
> better than current pointer dereferncing. Optionally it could be
> done with locking and with putting a reference. To avoid situation
> with with getting to a re-allocated interface with the same index,
> we can use a unique cookie, that is incremented in if_alloc().
> Smth like:
> struct ifnet *
> mbuf_rcvif(struct mbuf *m)
> {
[...]
> ifp = ifnet_byindex_locked(m->m_pkthdr.rcvif_idx);
[...]
> The size of struct mbuf isn't going to change on amd64:
> @@ -111,7 +111,8 @@
> * Record/packet header in first mbuf of chain; valid only if M_PKTHDR is set.
> */
> struct pkthdr {
> - struct ifnet *rcvif; /* rcv interface */
> + uint32_t rcvif_idx; /* rcv interface index */
> + uint32_t rcvif_unique; /* rcv interface unique id */
> A proof-of-concept patch is available here:
> http://people.freebsd.org/~glebius/patches/pkthdr.rcvif_idx
> It doesn't cover entire kernel, LINT won't compile. It covers kern, netinet,
> netinet6, several interface drivers and netgraph.
> One of the problems is ongoing namespace pollution: not all code that include
> mbuf.h includes if_var.h and vice versa. I've cut ifnet from m_devget(), but
> my new function do declare it in parameter list :(. To deal with that I had
> to declare functions in mbuf.h but implement them in if.c.
> Other problem is net80211 code that abuses the rcvif pointer in mbuf packet
> header for its own purposes casting it to private type. This can be fixed
> utilizing mbuf_tags(9), I think. I haven't made this yet, that's why LINT
> doesn't compile.
This solution has several other drawbacks:
1) It slows down rcvif dereference on _every_ ifnet lookup during packet
lifetime. That's not just +2 ops but there could be many of them.
2) It requires converting *all* consumers of rcvif pointer in the kernel.
Solution requiring to change everything are usually smell bad, could
introduce bugs, and at least one case already found - net80211.
3) It comlicates not only the code, but requires additional 4 bytes of pkthdr
on non-64bit architectures (i386, MIPS, ARM?)
> Any alternative ideas on solving the problem are welcome!
Yes, I have one, a somewhat hybrid of solutions 1 and 3.
First, we leave all mbuf consumers and rcvif pointer as is.
We then add two new fields:
struct pkthdr {
+ uint16_t gencount; /* iftable version */
+ uint16_t fib; /* may be reserve some bits? */
We need only 16 bits, and fib must be moved from mbuf flags anyway.
I'm not sure this should be all 16 bits - we may want to use e.g. 12 bits
for FIB number and 4 bits for M_HASHTYPEBITS which was placed to flags with
the same hack as M_FIB ?.. This should be thought, too, but let's continue
with our problem.
So, we have a global variable:
int iftable_gencount; /* generation counter */
this should be updated (atomically) on every operation with interfaces
table, that is, on every interface departure and arrival. And when an
interface is created, it gets:
struct ifnet {
+ int gencount; /* iftable version */
at the moment of creation. This valuse then doesn't change.
After then, when a new mbuf w/pkthdr is allocated, it's mb_ctor_mbuf()
does:
pkthdr.gencount = iftable_gencount & 0xffff;
And this value also doesn't change during entire mbuf lifetime. At the same
time, along with setting pkthdr.gencount, an entry in the special array, let's
call it a[], is updated:
index: gencount=1 gencount=2 gencount=3
+-----------+-----------+-----------+---.......
|atomic_int |atomic_int |atomic_int |
+-----------+-----------+-----------+---.......
Those entries a[gencount] show the total number of mbufs with the
same pkthdr.gencount in the system (it is decreased by mb_dtor_mbuf() on
mbuf freeing). You can say, that's atomic too as in solution 1, but wait
a little.
Now, let's say em0 was created with gencount=1, ng0 was with gencount=2, and
now ng0 is being destroyed. Now, global iftable_gencount is now 3, and all new
mbufs in the system will get pkthdr.gencount=3, so numbers in previous
generations now could only decrease, not grow. The ng0 if if_ref()'ed once by
generation counting code and is not released until the SUM of all previous
generations is zero (that is, for both gencount=1 and gencount=2).
We observe that this counts could only decrease, so now this scheme could be
converted to be lockless! We copy array a[] into every CPU's private memory,
and all this copies are now accessed lockless, and now what matters for every
generation is sum of all per-CPU values (this costs, however, 256K of memory
on 32-bit archs).
Now, a callout(9)-garbage collector procedure is invoked every N secs and
scans arrays of all CPU without atomic(9) and sums that. If a value is stale
in some cache - nothing harmless, numbers could only decrease, the if_rele()
will be just deferred to next pass.
Of course, this requires memory, and a TCP-like arithmetics for wrapping
array in 2^16 space. So the only other problem which could be is when some
mbuf exists too long to survive 32767 interfaces creates and destroys, which
is very unlikely.
Unfortunatelly, I don't know how to write per-CPU code, so I can't give you
a PoC patch, but I hope I've explained idea detailed enough.
--
WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight at mail.ru
[Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight]
More information about the freebsd-net
mailing list