m_pkthdr.rcvif dangling pointer problem
Robert N. M. Watson
rwatson at freebsd.org
Sun Jul 24 08:44:04 UTC 2011
On 24 Jul 2011, at 04:51, Ryan Stone wrote:
> I ran headlong into this issue today when trying to test some network
> stack changes. It's depressingly easy to crash HEAD by periodically
> destroying vlan interfaces while you are sending traffic over those
> interfaces -- I get a panic within minutes.
>
>> http://people.freebsd.org/~glebius/patches/ifnet.no_free
>
> This patch makes my test system survive longer but does not resolve the issue.
Unfortunately, I'm a bit preoccupied currently, so haven't had a chance to follow up as yet, but just to follow up on the general issue here: this problem existed pre-SMP as well, and could be easily triggered by DUMMYNET and removable interfaces as well (as additional queuing delays just make the problem worse). In general: we need a solution that penalises interface removal, not common-case packet processing. As many packets have their source ifnet looked up in common-case processing (worth checking this assumption) because it's cheap, any solution that causes an interface lookup on every input packet (with synchronisation) is also an issue.
Instead, I think we should go for a more radical notion, which is a bit harder to implement in our stack: the network stack needs a race-free way to "drain" all mbufs referring to a particular ifnet, which does not cause existing processing to become more expensive. This is easy in some subsystems, but more complex in others -- and the composition of subsystems makes it all much harder since we need to know that (to be 100% correct) packets aren't getting passed between subsystems (and hence belong to neither) in a way that races with a sweep through the subsystems. It may be possible to get this 99.9% right simply by providing a series of callbacks into subsystems that cause queues to be walked and drained of packets matching the doomed ifnet. It may also be quite cheap to have subsystems that "hold" packets outside of explicit queues for some period (i.e., in a thread-local pointer out of the stack) add explicit invalidation tests (i.e., for IFF_DYING) before handing off to prevent those packets from traversing into other subsystems -- which can be done synchronisation-free, but still wouldn't 100% prevent the race
Just to give an example: netisr should offer a method for netisr_drain_ifnet(struct ifnet *) that causes netisr to walk all of its queues to find matching packets and free them. Due to direct dispatch and thread-local queues during processing, netisr should also check IFF_DYING before handing off.
If we do that, I wonder how robust the system then becomes...? This may not be too hard to test. But I'd rather we penalise ifnet removal than, say, the IP input path when it needs to check a source interface property.
Robert
More information about the freebsd-net
mailing list