lagg(4): LOR, deadlock and panic

Tue Jun 14 15:13:44 UTC 2016

tl;dr --> https://reviews.freebsd.org/D6845

Navdeep and I have been poking at an LOR that seems to be popping up
in -current that is related to lagg(4) and lagg_get_counter().

root at sysdev07:~ # ifconfig lagg0 create laggport ix0 laggproto lacp
192.168.100.11/24
lagg0: link state changed to DOWN
root at sysdev07:~ # ifconfig ix0 up
lock order reversal:
 1st 0xfffff8002d7c9190 if_addr_lock (if_addr_lock) @
/usr/home/sbruno/fbsd_head/sys/net/rtsock.c:1717
 2nd 0xfffff800271a5808 if_lagg rmlock (if_lagg rmlock) @
/usr/home/sbruno/fbsd_head/sys/modules/if_lagg/../../net/if_lagg.c:1057
stack backtrace:
#0 0xffffffff80aa5ab0 at witness_debugger+0x70
#1 0xffffffff80aa59a4 at witness_checkorder+0xe54
#2 0xffffffff80a42521 at _rm_rlock_debug+0x111
#3 0xffffffff82222b2c at lagg_get_counter+0x4c
#4 0xffffffff80b2ebd1 at if_data_copy+0xa1
#5 0xffffffff80b533bc at sysctl_rtsock+0x56c
#6 0xffffffff80a53f0a at sysctl_root_handler_locked+0x8a
#7 0xffffffff80a536c8 at sysctl_root+0x188
#8 0xffffffff80a53cbe at userland_sysctl+0x16e
#9 0xffffffff80a53b14 at sys___sysctl+0x74
#10 0xffffffff80eb5b3b at amd64_syscall+0x2db
#11 0xffffffff80e95c4b at Xfast_syscall+0xfb

Running a netstat -w 1 in the backgrouund while repeatedly creating
destroying the interface lagg0 will lead to either a panic or a deadlock:

e.g. netstat -w 1 > /dev/null &
while [ 1 ]; do
ifconfig lagg0 destroy
ifconfig lagg0 create laggport ix0 laggproto lacp 192.168.100.11/24
done

When the system deadlocks on the console, kdb sees the locks held like
this:
KDB: enter: Break to debugger
[ thread pid 11 tid 100007 ]
Stopped at      kdb_alt_break_internal+0x18e:   movq    $0,kdb_why
db> show allocks
No such command
db> show alllocks
Process 2173 (ifconfig) thread 0xfffff8002d125a00 (100186)
exclusive rm if_lagg rmlock (if_lagg rmlock) r = 0
(0xfffff8002717e408) locked @
/usr/home/sbruno/fbsd_head/sys/modules/if_lagg/../../net/if_lagg.c:1530
exclusive sleep mutex in6_multi_mtx (in6_multi_mtx) r = 0
(0xffffffff81d7e288) locked @
/usr/home/sbruno/fbsd_head/sys/netinet6/in6_mcast.c:1142
Process 792 (netstat) thread 0xfffff80027e67a00 (100167)
shared rw if_addr_lock (if_addr_lock) r = 0 (0xfffff80103e95190)
locked @ /usr/home/sbruno/fbsd_head/sys/net/rtsock.c:1717
shared rw ifnet_rw (ifnet_rw) r = 0 (0xffffffff81d7b760) locked @
/usr/home/sbruno/fbsd_head/sys/net/rtsock.c:1713
exclusive sleep mutex Giant (Giant) r = 0 (0xffffffff81d55e08) locked
@ /usr/home/sbruno/fbsd_head/sys/kern/kern_sysctl.c:164

This looks like the netstat is causing a call into the counter
function while the destruction or creation is ongoing.

Removing the LAGG_RLOCK() calls from lagg_get_counter() makes the
deadlock, LOR and panic go away, however this can't be that easy.  I'm
unsure what the RLOCK is for in lagg_get_counter().  It appears that
there is a higher lock in the ifnet access that is protecting
simultaneous access already, but I'm very ignorant of what's going on
here.

I don't see any other driver with locks in its get_counter()
functions, so I'm not sure what the best course of action here is.

Sean

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freebsd.org/pipermail/freebsd-net/attachments/20160614/c44ff6e4/attachment.sig>