8.3: kernel panic in bpf.c catchpacket()

Wed Oct 17 21:55:40 UTC 2012

On Oct 17, 2012, at 8:58 AM, Guy Helmer <guy.helmer at gmail.com> wrote:

> On Oct 12, 2012, at 8:54 AM, Guy Helmer <guy.helmer at gmail.com> wrote:
> 
>> 
>> On Oct 10, 2012, at 1:37 PM, Alexander V. Chernikov <melifaro at freebsd.org> wrote:
>> 
>>> On 10.10.2012 00:36, Guy Helmer wrote:
>>>> 
>>>> On Oct 8, 2012, at 8:09 AM, Guy Helmer <guy.helmer at gmail.com> wrote:
>>>> 
>>>>> I'm seeing a consistent new kernel panic in FreeBSD 8.3:
>>>>> I'm not seeing how bd_sbuf would be NULL here. Any ideas?
>>>> 
>>>> Since I've not had any replies, I hope nobody minds if I reply with more information.
>>>> 
>>>> This panic seems to be occasionally triggered now that my user land code is changing the packet filter a while after the bpd device has been opened and an initial packet filter was set (previously, my code did not change the filter after it was initially set).
>>>> 
>>>> I'm focusing on bpf_setf() since that seems to be the place that could be tickling a problem, and I see that bpf_setf() calls reset_d(d) to clear the hold buffer. I have manually verified that the BPFD lock is held during the call to reset_d(), and the lock is held every other place that the buffers are manipulated, so I haven't been able to find any place that seems vulnerable to losing one of the bpf buffers. Still searching, but any help would be appreciated.
>>> 
>>> Can you please check this code on -current?
>>> Locking has changed quite significantly some time ago, so there is good chance that you can get rid of this panic (or discover different one which is really "new") :).
>> 
>> I'm not ready to run this app on current, so I have merged revs 229898, 233937, 233938, 233946, 235744, 235745, 235746, 235747, 236231, 236251, 236261, 236262, 236559, and 236806 to my 8.3 checkout to get code that should be virtually identical to current without the timestamp changes.
>> 
>> Unfortunately, I have only been able to trigger the panic in my test lab once -- so I'm not sure whether a lack of problems with the updated code will be indicative of likely success in the field where this has been trigged regularly at some sites…
>> 
>> Thanks,
>> Guy
>> 
> 
> 
> FWIW, I was able to trigger the panic with the original 8.3 code again in my test lab. With these changes resulting from merging the revs mentioned above, I have not seen any panics in my test lab setup in two days of load testing, and AFAIK, packet capturing seems to be working fine.

Of course, the test system panic'ed with the same problem in catchpacket() an hour after I wrote this.

(kgdb) where
#0  doadump () at pcpu.h:224
#1  0xffffffff804c8280 in boot (howto=260) at ../../../kern/kern_shutdown.c:441
#2  0xffffffff804c8703 in panic (fmt=0x0) at ../../../kern/kern_shutdown.c:614
#3  0xffffffff8069ffad in trap_fatal (frame=0xffffffff809edbc0, eva=Variable "eva" is not available.
)
    at ../../../amd64/amd64/trap.c:825
#4  0xffffffff806a02e1 in trap_pfault (frame=0xffffff800014a8a0, usermode=0)
    at ../../../amd64/amd64/trap.c:741
#5  0xffffffff806a06bf in trap (frame=0xffffff800014a8a0)
    at ../../../amd64/amd64/trap.c:478
#6  0xffffffff80687cd4 in calltrap () at ../../../amd64/amd64/exception.S:228
#7  0xffffffff8069dc06 in bcopy () at ../../../amd64/amd64/support.S:124
#8  0xffffffff8056f69e in catchpacket (d=0xffffff005aaaf000, 
    pkt=0xffffff0001f46200 "", pktlen=522, snaplen=Variable "snaplen" is not available.
) at ../../../net/bpf.c:2240
#9  0xffffffff8056fc66 in bpf_mtap (bp=0xffffff0001be8c80, 
    m=0xffffff0001f46200) at ../../../net/bpf.c:2064
#10 0xffffffff80579c15 in ether_input (ifp=0xffffff0001b73800, 
    m=0xffffff0001f46200) at ../../../net/if_ethersubr.c:635
#11 0xffffffff802b694a in em_rxeof (rxr=0xffffff0001bca200, count=99, done=0x0)
    at ../../../dev/e1000/if_em.c:4404
#12 0xffffffff802b6db8 in em_handle_que (context=Variable "context" is not available.
)
    at ../../../dev/e1000/if_em.c:1494
#13 0xffffffff80506d85 in taskqueue_run_locked (queue=0xffffff0001be1580)
    at ../../../kern/subr_taskqueue.c:250
---Type <return> to continue, or q <return> to quit---q 
Quit
(kgdb) frame 8
#8  0xffffffff8056f69e in catchpacket (d=0xffffff005aaaf000, 
    pkt=0xffffff0001f46200 "", pktlen=522, snaplen=Variable "snaplen" is not available.
) at ../../../net/bpf.c:2240
warning: Source file is more recent than executable.

2240		bpf_append_bytes(d, d->bd_sbuf, curlen, &hdr, sizeof(hdr));
(kgdb) print *d
$1 = {bd_next = {le_next = 0xffffff0023fff400, le_prev = 0xffffff0001be8c90}, 
  bd_sbuf = 0x0, bd_hbuf = 0xffffff8000ffa000 "??~P", bd_fbuf = 0x0, 
  bd_slen = 0, bd_hlen = 2068, bd_bufsize = 8388608, 
  bd_bif = 0xffffff0001be8c80, bd_rtout = 1, bd_rfilter = 0xffffff0001e6f580, 
  bd_wfilter = 0x0, bd_bfilter = 0x0, bd_rcount = 7, bd_dcount = 0, 
  bd_promisc = 1 '\001', bd_state = 0 '\0', bd_immediate = 1 '\001', 
  bd_writer = 0 '\0', bd_hdrcmplt = 1, bd_direction = 1, bd_feedback = 0, 
  bd_async = 0, bd_sig = 23, bd_sigio = 0x0, bd_sel = {si_tdlist = {
      tqh_first = 0x0, tqh_last = 0x0}, si_note = {kl_list = {
        slh_first = 0x0}, kl_lock = 0xffffffff80497920 <knlist_mtx_lock>, 
      kl_unlock = 0xffffffff804978f0 <knlist_mtx_unlock>, 
      kl_assert_locked = 0xffffffff804945d0 <knlist_mtx_assert_locked>, 
      kl_assert_unlocked = 0xffffffff804945e0 <knlist_mtx_assert_unlocked>, 
      kl_lockarg = 0xffffff005aaaf0d8}, si_mtx = 0x0}, bd_lock = {
    lock_object = {lo_name = 0xffffff0001a5fce0 "bpf", lo_flags = 16973824, 
      lo_data = 0, lo_witness = 0x0}, mtx_lock = 18446742974226712768}, 
  bd_callout = {c_links = {sle = {sle_next = 0x0}, tqe = {tqe_next = 0x0, 
        tqe_prev = 0x0}}, c_time = 0, c_arg = 0x0, c_func = 0, 
    c_lock = 0xffffff005aaaf0d8, c_flags = 0, c_cpu = 0}, bd_label = 0x0, 
  bd_fcount = 7, bd_pid = 89517, bd_locked = 0, bd_bufmode = 1, bd_wcount = 0, 
  bd_wfcount = 0, bd_wdcount = 0, bd_zcopy = 0, bd_compat32 = 0 '\0'}

Now, I am thinking the malloc() of the sbuf is failing but not sure how/why -- I thought malloc(size, M_BPF, M_WAITOK) should not fail?

Guy