Multiple cores/race conditions in IPv6 RA
Jason
j at scre.ws
Tue Dec 8 05:39:06 UTC 2015
Hi,
It appears the IPv6 router advertisement code paths were written fairly
lockless, assuming you would never process multiples concurrently. We
are seeing multiple page faults in various places processing the
messages and modifying the routing table. We have multiple L3 devices
and multiple v6 blocks broadcasting these messages to hardware with dual
uplinks in the same VLAN, which I believe is making us susceptible to
this. Though I believe the dual uplink is all that's required for this,
as it can be seen in configurations with a single v6 block.
We are running stable/10 @ r285800, and it doesn't appear anything
relevant has changed since then. Our other widely deployed version is
8.3-RELEASE, which does not see this issue. Upon bumping a machine from
8.3 -> 10 we can see it start to exhibit this behavior. The only change
I see that might be relevant is r243148, but these cores are relatively
rare, so testing is tough without a considerable deployment. So
basically I'm hoping someone with a trained eye can send us in the right
direction before we go down that road.
Every backtrace looks pretty much like this, with the location in
nd6_rtr differing:
panic: page fault
#0 doadump (textdump=1) at pcpu.h:219
#1 0xffffffff8075fa07 in kern_reboot (howto=260) at
/usr/src/sys/kern/kern_shutdown.c:451
#2 0xffffffff8075fe05 in vpanic (fmt=<value optimized out>, ap=<value
optimized out>) at /usr/src/sys/kern/kern_shutdown.c:758
#3 0xffffffff8075fc93 in panic (fmt=0x0) at
/usr/src/sys/kern/kern_shutdown.c:687
#4 0xffffffff80acdf9b in trap_fatal (frame=<value optimized out>,
eva=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:851
#5 0xffffffff80ace29d in trap_pfault (frame=0xfffffe0f959b0ff0,
usermode=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:674
#6 0xffffffff80acd93a in trap (frame=0xfffffe0f959b0ff0) at
/usr/src/sys/amd64/amd64/trap.c:440
#7 0xffffffff80ab3932 in calltrap () at
/usr/src/sys/amd64/amd64/exception.S:236
#8 0xffffffff808a5550 in nd6_ra_input (m=<value optimized out>,
off=<value optimized out>, icmp6len=<value optimized out>)
at /usr/src/sys/netinet6/nd6_rtr.c:739
#9 0xffffffff8087f31f in icmp6_input (mp=<value optimized out>,
offp=0xfffffe0f959b167c, proto=<value optimized out>)
at /usr/src/sys/netinet6/icmp6.c:808
#10 0xffffffff808949fc in ip6_input (m=0xfffff8002e743200) at
/usr/src/sys/netinet6/ip6_input.c:1019
#11 0xffffffff80832f02 in netisr_dispatch_src (proto=<value optimized
out>, source=<value optimized out>, m=0x1)
at /usr/src/sys/net/netisr.c:976
#12 0xffffffff8082a226 in ether_demux (ifp=<value optimized out>,
m=0xfffff8002e743200) at /usr/src/sys/net/if_ethersubr.c:851
#13 0xffffffff8082aece in ether_nh_input (m=<value optimized out>) at
/usr/src/sys/net/if_ethersubr.c:646
#14 0xffffffff80832f02 in netisr_dispatch_src (proto=<value optimized
out>, source=<value optimized out>, m=0x1)
at /usr/src/sys/net/netisr.c:976
I'll link to GH for the various relevant bits, because I know everyone
can agree it's the superior RCS. It appears to be that most of these
are caused by the dr struct being freed by concurrent processing:
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L578
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L654
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L728
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L739
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L800
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L1312
Thanks for any assistance,
Jason
More information about the freebsd-net
mailing list