Re: git: 87ee63bac69d - main - locks: add a runtime check for missing turnstile

From: Gleb Smirnoff <glebius_at_freebsd.org>
Date: Tue, 16 Jul 2024 20:19:04 UTC
  Mateusz,

On Mon, Jul 15, 2024 at 07:59:42PM +0200, Mateusz Guzik wrote:
M> > >      locks: add a runtime check for missing turnstile
M> > >
M> > >      There are sometimes bugs which result in the unlock fast path failing,
M> > >      which in turns causes a not-helpful crash report when dereferencing a
M> > >      NULL turnstile. Help debugging such cases by pointing out what happened
M> > >      along with some debug.
M> > >
M> > >      Sponsored by:   Rubicon Communications, LLC ("Netgate")

I am quite surprised to see such commit from you.  You are known to cut a few
instruction branches even pessimizing other people setups (yeah, talking about
commits like 5091ca26507b), and now you are adding an absolutely useless branch
for literally everyone.

The fingerprint of the non-INVARIANTS panic for a destroyed or already unlocked
mutex/rwlock is well known to anybody who has been hacking on the FreeBSD
kernel for the last 20 years.  I am very much surprised it is not familiar to
you.  The sample panic quoted below just immediately rings the bell for me, and
giving feedback on this commit for many people, too.

Please revert this.

Here is what we have at Netflix and I would suggest you to use same practice at
Netgate.  If there is a panic/issue that we anticipate at our setup and we want
extra debugging to be enabled without INVARIANTS, but we don't want to pollute
the open source FreeBSD with our own hacks, we provided NASSERT macro in
kassert.h, that works exactly same as KASSERT, but is enabled always. "N"
stands for Netflix, but it would equally well work for Netgate ;)  So, for a
KASSERT that we want to be always enabled, we just flip one character K -> N
and of course this diff is super easy to resolve during merges.

M> Use-after-free, overflow, underflow, bitflip or what have you all can
M> fail the fast path.
M> 
M> Once that happens and the kernel crashes with a null pointer deref,
M> here is a crash at netgate which prodded this:
M> calltrap() at calltrap+0x8/frame 0xfffffe0106720920
M> --- trap 0xc, rip = 0xffffffff80d5ab70, rsp = 0xfffffe01067209f0, rbp
M> = 0xfffffe0106720a00 ---
M> turnstile_broadcast() at turnstile_broadcast+0x40/frame 0xfffffe0106720a00
M> __rw_wunlock_hard() at __rw_wunlock_hard+0x9e/frame 0xfffffe0106720a30
M> nd6_resolve_slow() at nd6_resolve_slow+0x2d7/frame 0xfffffe0106720aa0
M> nd6_resolve() at nd6_resolve+0x125/frame 0xfffffe0106720b10
M> ether_output() at ether_output+0x4e7/frame 0xfffffe0106720ba0
M> ip_output_send() at ip_output_send+0xdc/frame 0xfffffe0106720be0
M> ip_output() at ip_output+0x1295/frame 0xfffffe0106720ce0
M> ip_forward() at ip_forward+0x3c2/frame 0xfffffe0106720d90
M> ip_input() at ip_input+0x705/frame 0xfffffe0106720df0
M> swi_net() at swi_net+0x138/frame 0xfffffe0106720e60
M> ithread_loop() at ithread_loop+0x257/frame 0xfffffe0106720ef0
M> fork_exit() at fork_exit+0x7f/frame 0xfffffe0106720f30
M> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0106720f30
M> --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
M> 
M> Neither the register dump nor anything in the backtrace indicate what happened.
M> 
M> Since the kernel is going down anyway, one may as well get some debug from it.

-- 
Gleb Smirnoff