Re: FreeBSD-main-amd64-test - Build #19839 - Failure

From: Kristof Provost <kp_at_FreeBSD.org>
Date: Fri, 19 Nov 2021 01:50:05 UTC
On 18 Nov 2021, at 17:21, John Baldwin wrote:
> On 11/17/21 5:50 PM, jenkins-admin@FreeBSD.org wrote:
>> FreeBSD-main-amd64-test - Build #19839 
>> (4082b189d2ce00674439226c9d5a8bdcafd23d01) - Failure
>>
>> Build information: 
>> https://ci.FreeBSD.org/job/FreeBSD-main-amd64-test/19839/
>> Full change log: 
>> https://ci.FreeBSD.org/job/FreeBSD-main-amd64-test/19839/changes
>> Full build log: 
>> https://ci.FreeBSD.org/job/FreeBSD-main-amd64-test/19839/console
>>
>> Status explanation:
>> "Failure" - the build is suspected being broken by the following 
>> changes
>> "Still Failing" - the build has not been fixed by the following 
>> changes and
>>                    this is a notification to note that these changes 
>> have
>>                    not been fully tested by the CI system
>>
>> Change summaries:
>> (Those commits are likely but not certainly responsible)
>>
>> b928e924f74b0b8f882a9b735611421a93113640 by jhb:
>> rtld-elf: Use _get_tp in __tls_get_addr for aarch64 and riscv64.
>>
>> a8d885296a9dc517e731723081c83d97d2aa598f by jhb:
>> linux_linkat: Don't invert AT_* flags.
>>
>> 8b2ce7a3bbd0a754d31ff3943d918b4c84c831a3 by jhb:
>> linux_name_to_handle_at: Support AT_EMPTY_PATH.
>>
>> 1962164584a91078418afcd7c979afef13df8c4d by jhb:
>> imgact_elf: Use bool instead of boolean_t.
>>
>> 4082b189d2ce00674439226c9d5a8bdcafd23d01 by jhb:
>> elf*_brand_inuse: Change return type to bool.
>
> I don't the panic is related to these commits.
>
>> The end of the build log:
>>
>> [...truncated 4.27 MB...]
>> --- trap 0xc, rip = 0x2100c7156d9a, rsp = 0x4423f48, rbp = 0x4423f60 
>> ---
>
> The useful parts of the panic were earlier in the log:
>
> 01:50:13 sys/netpfil/common/dummynet:pf_queue_v6  ->  epair5a: 
> Ethernet address: 02:b5:98:ea:1d:0a
> 01:50:14 epair5b: Ethernet address: 02:b5:98:ea:1d:0b
> 01:50:14 epair5a: link state changed to UP
> 01:50:14 epair5b: link state changed to UP
> 01:50:14 epair5a: link state changed to DOWN
> 01:50:27 epair5b: link state changed to DOWN
> 01:50:27 passed  [13.831s]
> 01:50:27
> 01:50:27
> 01:50:27 Fatal trap 12: page fault while in kernel mode
> 01:50:27 cpuid = 0; apic id = 00
> 01:50:27 fault virtual address	= 0x10
> 01:50:27 fault code		= supervisor read data, page not present
> 01:50:27 instruction pointer	= 0x20:0xffffffff80e3c60f
> 01:50:27 stack pointer	        = 0x28:0xfffffe00a5d9ec30
> 01:50:27 frame pointer	        = 0x28:0xfffffe00a5d9ed00
> 01:50:27 code segment		= base 0x0, limit 0xfffff, type 0x1b
> 01:50:27 			= DPL 0, pres 1, long 1, def32 0, gran 1
> 01:50:27 processor eflags	= interrupt enabled, resume, IOPL = 0
> 01:50:27 current process		= 0 (dummynet)
> 01:50:27 trap number		= 12
> 01:50:27 panic: page fault
> 01:50:27 cpuid = 0
> 01:50:27 time = 1637200227
> 01:50:27 KDB: stack backtrace:
> 01:50:27 db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
> 0xfffffe00a5d9e9f0
> 01:50:27 vpanic() at vpanic+0x17f/frame 0xfffffe00a5d9ea40
> 01:50:27 panic() at panic+0x43/frame 0xfffffe00a5d9eaa0
> 01:50:27 trap_fatal() at trap_fatal+0x385/frame 0xfffffe00a5d9eb00
> 01:50:27 trap_pfault() at trap_pfault+0xab/frame 0xfffffe00a5d9eb60
> 01:50:27 calltrap() at calltrap+0x8/frame 0xfffffe00a5d9eb60
> 01:50:27 --- trap 0xc, rip = 0xffffffff80e3c60f, rsp = 
> 0xfffffe00a5d9ec30, rbp = 0xfffffe00a5d9ed00 ---
> 01:50:27 ip6_input() at ip6_input+0x4f/frame 0xfffffe00a5d9ed00
> 01:50:27 netisr_dispatch_src() at netisr_dispatch_src+0xb1/frame 
> 0xfffffe00a5d9ed60
> 01:50:27 dummynet_send() at dummynet_send+0x1dd/frame 
> 0xfffffe00a5d9eda0
> 01:50:27 dummynet_task() at dummynet_task+0x493/frame 
> 0xfffffe00a5d9ee40
> 01:50:27 taskqueue_run_locked() at taskqueue_run_locked+0xaa/frame 
> 0xfffffe00a5d9eec0
> 01:50:27 taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 
> 0xfffffe00a5d9eef0
> 01:50:27 fork_exit() at fork_exit+0x80/frame 0xfffffe00a5d9ef30
> 01:50:27 fork_trampoline() at fork_trampoline+0xe/frame 
> 0xfffffe00a5d9ef30
>
> So I suspect this is a race with teardown of the epair interfaces and 
> whatever
> traffic the sys/netpfil/common/dummynet:pf_queue_v6 test was sending 
> when it
> stopped.  I've cc'd kp@ in case he has any ideas?
>
I think I’ve seen that before, while I was doing the pf/dummynet work.
Dummynet will queue packets, delay them and eventually send them out. If 
between the enqueuing and dequeuing the packet the interface goes away 
we can end up with this panic.

I have a patch to teach dummynet to walk its queues when an interface is 
removed and to drop any packets destined for that interface. I wound up 
not committing it because I was unable to reproduce the problem at the 
time.

Can we reliably trigger this panic? I’d love to be able to add a test 
case for it, especially because the patch is non-trivial.

Here’s that patch. Its commit message has exactly the same backtrace.

Br,
Kristof