Re: aarch64 main [so: 15] panic's in kyua's sys/net/if_lagg_test:status_stress [confirmed with snapshot kernel]

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 12 Sep 2023 03:11:18 UTC
On Sep 11, 2023, at 19:40, Mark Millard <marklmi@yahoo.com> wrote:

> On Sep 11, 2023, at 01:13, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> It will be some time before I can try this with
>> an official snapshot instead of a personal build.
>> The build is based on b6ce41118bb1 :
>> 
>> # uname -apKU
>> FreeBSD CA78C-WDK23-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT aarch64 1500000 #17 main-n265279-b6ce41118bb1-dirty: Sun Sep 10 14:36:47 PDT 2023     root@CA78C-WDK23-ZFS:/usr/obj/BUILDs/main-CA78C-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA78C arm64 aarch64 1500000 1500000
>> 
>> So it was a non-debug build, although I do not
>> strip symbols and such in my builds.
>> 
>> . . .
>> sys/net/if_lagg_test:create  ->  passed  [0.105s]
>> sys/net/if_lagg_test:create_destroy_stress  ->  skipped: Skipping this test because it easily panics the machine  [0.019s]
>> sys/net/if_lagg_test:lacp_linkstate_destroy_stress  ->  passed  [60.045s]
>> sys/net/if_lagg_test:set_ether  ->  passed  [0.066s]
>> sys/net/if_lagg_test:status_stress  ->  
>> 
>> The core.txt.5 is not great, unfortunately:
>> 
>> panic: vm_fault failed: 0xffff0000006b96dc error 1
>> 
>> GNU gdb (GDB) 13.1 [GDB v13.1 for FreeBSD]
>> . . .
>> Reading symbols from /boot/kernel/kernel...
>> Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...
>> 
>> Unread portion of the kernel message buffer:
>> (dump_iface + 0x2c0)
>> elr: 0xffff0000006b96dc (dump_sa + 0x1c)
>> spsr: 0x0000000000400045
>> far: 0x44572d4338374144
>> esr: 0x0000000096000004
>> panic: vm_fault failed: 0xffff0000006b96dc error 1
>> cpuid = 2
>> time = 1694414226
>> KDB: stack backtrace:
>> db_trace_self() at db_trace_self
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
>> vpanic() at vpanic+0x1a0
>> panic() at panic+0x44
>> data_abort() at data_abort+0x304
>> handle_el1h_sync() at handle_el1h_sync+0x14
>> --- exception, esr 0x96000004
>> dump_sa() at dump_sa+0x1c
>> dump_iface() at dump_iface+0x2bc
>> dump_cb() at dump_cb+0x18
>> if_foreach_sleep() at if_foreach_sleep+0x244
>> rtnl_handle_getlink() at rtnl_handle_getlink+0xec
>> rtnl_handle_message() at rtnl_handle_message+0x19c
>> nl_taskqueue_handler() at nl_taskqueue_handler+0x674
>> taskqueue_run_locked() at taskqueue_run_locked+0x194
>> taskqueue_thread_loop() at taskqueue_thread_loop+0xcc
>> fork_exit() at fork_exit+0x88
>> fork_trampoline() at fork_trampoline+0x14
>> KDB: enter: panic
>> 
>> get_curthread () at /usr/main-src/sys/arm64/include/pcpu.h:77
>> 77              __asm __volatile("ldr   %0, [x18]" : "=&r"(td));
>> (kgdb) #0  get_curthread () at /usr/main-src/sys/arm64/include/pcpu.h:77
>> #1  doadump (textdump=0, textdump@entry=4003518992)
>>   at /usr/main-src/sys/kern/kern_shutdown.c:405
>> #2  0xffff0000000f7704 in db_dump (dummy=<optimized out>,      dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>)
>>   at /usr/main-src/sys/ddb/db_command.c:591
>> #3  0xffff0000000f74e0 in db_command (last_cmdp=<optimized out>,      cmd_table=<optimized out>, dopager=true)
>>   at /usr/main-src/sys/ddb/db_command.c:504
>> #4  0xffff0000000f71b8 in db_command_loop ()
>>   at /usr/main-src/sys/ddb/db_command.c:551
>> #5  0xffff0000000fad9c in db_trap (type=<optimized out>, code=<optimized out>)
>>   at /usr/main-src/sys/ddb/db_main.c:268
>> #6  0xffff0000004f4ec4 in kdb_trap (type=60, code=0, tf=<optimized out>)
>>   at /usr/main-src/sys/kern/subr_kdb.c:790
>> #7  <signal handler called>
>> #8  <signal handler called>
>> #9  <signal handler called>
>> #10 <signal handler called>
>> #11 <signal handler called>
>> #12 <signal handler called>
>> #13 <signal handler called>
>> #14 <signal handler called>
>> #15 <signal handler called>
>> #16 <signal handler called>
>> #17 <signal handler called>
>> #18 <signal handler called>
>> #19 <signal handler called>
>> #20 <signal handler called>
>> #21 <signal handler called>
>> #22 <signal handler called>
>> Backtrace stopped: Cannot access memory at address 0x10
>> (kgdb) 
>> 
>> 
>> So some transcribing of a picture in order to
>> show register values that were reported:
>> 
>> Fatal data abort:
>>   x0: 0xffff000leea0e7f0 (_DYNAMIC * 0x6d816648)
>>   x1: 0x0000000000000001
>>   x2: 0x44572d4338374143
>>   x3: 0xffff0000005d3f90 (ifdead_ioctl + 0x0)
>>   x4: 0xffffa00b7f0d185e
>>   x5: 0xffffa0023fe4b992
>>   x6: 0x000000006767616c
>>   x7: 0x00706174016f7575
>>   x8: 0x00000000000001a4
>>   x9: 0x0000000000210005
>>  x10: 0×0000000000000800
>>  x11: 0xfefefefefefefeff
>>  x12: 0x0000000000000008
>>  x13: 0x0000000000000000
>>  x14: 0x00000000000000ff
>>  x15: 0x0000000000000700
>>  x16: 0x0000000000000008
>>  x17: 0x0000000000000007
>>  x18: 0xffff0001eea0e500 (_DYNAMIC + 0x6d816358)
>>  x19: 0xffff000leea0e7f0 (_DYNAMIC * 0x6d816648)
>>  x20: 0xffffa00b7f0d1800
>>  x21: 0xffffa00b7f0d1858
>>  x22: 0x000000000000000c
>>  x23: 0X0000000000000005
>>  x24: 0×0000000000000000
>>  x25: 0xffff000000c68000 (sysctl___kern_features_netlink + 0x10)
>>  x26: 0x0000000000000000
>>  x27: 0xffff000000ce9000 (cap_linkat_source_rights + 0x8)
>>  x28: 0xffff0000006bb0a0 (dump_cb + 0x0)
>>  x29: 0xffff0001eea0e520 (_DYNAMIC + 0x6d816378)
>>   sp: 0xffff0001eea0e500
>>   lr: 0xffff0000006b8fe0 (dump_iface + 0x2c0)
>>  elr: 0xffff0000006b96dc (dump_sa + 0x1c)
>> spsr: 0x0000000000400045
>>  far: 0x44572d4338374144
>>  esr: 0x0000000096000004
>> panic: m_fault failed: 0xffff0000006b96dc error 1
>> 
>> I expect that this is similar to reports I'd made
>> back in 14.0-CURRENT days. As I remember, snapshot
>> builds of the time also got the panic.
>> 
>> I will note that an earlier 14.0-BETA1 snapshot
>> kernel test run did not panic at this point in the
>> sequence (or at any point). But I do not know how
>> repeatable the panics are in the various contexts.
>> 
>> I'll note that I've tried to have the various ports
>> installed (poudriere built) that are listed at:
>> 
>> https://github.com/freebsd/freebsd-ci/blob/master/scripts/build/build-test_image-head.sh#L69-L84
>> 
>> (The ones that build for aarch64, anyway.)
>> 
>> I had in /etc/kyua/kyua.conf :
>> 
>> test_suites.FreeBSD.disks = '/dev/md0 /dev/md1 /dev/md2 /dev/md3 /dev/md4 /dev/md5'
>> 
>> and used:
>> 
>> # more ~/prekyua-aarch64-mdconfig.sh 
>> #! /bin/sh
>> truncate -s 4g /var/tmp/for-md0.dat
>> truncate -s 4g /var/tmp/for-md1.dat
>> truncate -s 4g /var/tmp/for-md2.dat
>> truncate -s 4g /var/tmp/for-md3.dat
>> truncate -s 4g /var/tmp/for-md4.dat
>> truncate -s 4g /var/tmp/for-md5.dat
>> mdconfig -f /var/tmp/for-md0.dat -u md0
>> mdconfig -f /var/tmp/for-md1.dat -u md1
>> mdconfig -f /var/tmp/for-md2.dat -u md2
>> mdconfig -f /var/tmp/for-md3.dat -u md3
>> mdconfig -f /var/tmp/for-md4.dat -u md4
>> mdconfig -f /var/tmp/for-md5.dat -u md5
>> 
>> I also did a:
>> 
>> # kldload linux64
>> 
>> before doing:
>> 
>> # /usr/bin/kyua test -k /usr/tests/Kyuafile
>> 
>> (Not true of linux64.ko in 14.0-CURRENT days.)
> 
> # uname -apKU
> FreeBSD CA78C-WDK23-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT aarch64 1500000 #0 main-n265205-03a7c36ddbc0: Thu Sep  7 03:05:31 UTC 2023     root@releng3.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC arm64 aarch64 1500000 1500000
> 
> # /usr/bin/kyua test -k /usr/tests/Kyuafile sys/net/if_lagg_test:status_stress
> sys/net/if_lagg_test:status_stress  ->  
> 
> got:
> 
> panic: vm_fault failed: 0xffff0000006813b4 error 1
> 
> GNU gdb (GDB) 13.1 [GDB v13.1 for FreeBSD]
> . . .
> Reading symbols from /boot/kernel/kernel...
> Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...
> 
> Unread portion of the kernel message buffer:
> <6>ue0: 3 link states coalesced
> <6>ue0: link state changed to UP
> <6>lagg0: link state changed to DOWN
> <6>ue0: link state changed to DOWN
> Fatal data abort:
>  x0: 0xffff00015df8d800 (infiniband_input.printedonce + 0x11eff68)
>  x1: 0x0000000000000001
>  x2: 0xdeadc0dedeadc0de
>  x3: 0xffff000000593e34 (ifdead_ioctl + 0x0)
>  x4: 0xffffa0004fb6285e
>  x5: 0xffffa0004fc00192
>  x6: 0x000000006767616c
>  x7: 0x6e6d760070617401
>  x8: 0x00000000000001a4
>  x9: 0xffffa0004fc00000
> x10: 0x0000000000210005
> x11: 0x000000007ffffffe
> x12: 0x0000000000000008
> x13: 0x0000000000000000
> x14: 0x0000000000010000
> x15: 0x0000000000000001
> x16: 0x0000000000010000
> x17: 0x0000000000000007
> x18: 0xffff00015df8d500
> <6>ue0: link state changed to UP
> (infiniband_input.printedonce + 0x11efc68)
> x19: 0xffff00015df8d800 (infiniband_input.printedonce + 0x11eff68)
> x20: 0xffffa0004fb62800
> x21: 0xffffa0004fb62858
> x22: 0x000000000000000c
> x23: 0x0000000000000005
> x24: 0x0000000000000000
> x25: 0xffff000000c58000 (sysctl___net_netlink_debug + 0x40)
> x26: 0x0000000000000000
> x27: 0xffff000000cd9000 (sdt_vfs_vop_vop_spare5_return + 0x10)
> x28: 0xffff000000cd9000 (sdt_vfs_vop_vop_spare5_return + 0x10)
> x29: 0xffff00015df8d520 (infiniband_input.printedonce + 0x11efc88)
>  sp: 0xffff00015df8d500
>  lr: 0xffff000000680cbc (dump_iface + 0x2c0)
> elr: 0xffff0000006813b4 (dump_sa + 0x1c)
> spsr: 0x0000000000400045
> far: 0xdeadc0dedeadc0df
> esr: 0x0000000096000004
> panic: vm_fault failed: 0xffff0000006813b4 error 1
> cpuid = 3
> time = 1694485392
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x19c
> panic() at panic+0x44
> data_abort() at data_abort+0x35c
> handle_el1h_sync() at handle_el1h_sync+0x14
> --- exception, esr 0x96000004
> dump_sa() at dump_sa+0x1c
> dump_iface() at dump_iface+0x2bc
> dump_cb() at dump_cb+0x18
> if_foreach_sleep() at if_foreach_sleep+0x254
> rtnl_handle_getlink() at rtnl_handle_getlink+0xec
> rtnl_handle_message() at rtnl_handle_message+0x19c
> nl_taskqueue_handler() at nl_taskqueue_handler+0x5dc
> taskqueue_run_locked() at taskqueue_run_locked+0x17c
> taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> 
> get_curthread () at /usr/src/sys/arm64/include/pcpu.h:77
> 77              __asm __volatile("ldr   %0, [x18]" : "=&r"(td));
> (kgdb) #0  get_curthread () at /usr/src/sys/arm64/include/pcpu.h:77
> #1  doadump (textdump=0, textdump@entry=1576585744)
>    at /usr/src/sys/kern/kern_shutdown.c:405
> #2  0xffff0000000ec18c in db_dump (dummy=<optimized out>,      dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>)
>    at /usr/src/sys/ddb/db_command.c:591
> #3  0xffff0000000ebf88 in db_command (last_cmdp=<optimized out>,      cmd_table=<optimized out>, dopager=true)
>    at /usr/src/sys/ddb/db_command.c:504
> #4  0xffff0000000ebc80 in db_command_loop ()
>    at /usr/src/sys/ddb/db_command.c:551
> #5  0xffff0000000ef440 in db_trap (type=<optimized out>, code=<optimized out>)
>    at /usr/src/sys/ddb/db_main.c:268
> #6  0xffff0000004b4860 in kdb_trap (type=60, code=0, tf=<optimized out>)
>    at /usr/src/sys/kern/subr_kdb.c:790
> #7  <signal handler called>
> #8  <signal handler called>
> #9  <signal handler called>
> #10 <signal handler called>
> #11 <signal handler called>
> #12 <signal handler called>
> #13 <signal handler called>
> #14 <signal handler called>
> #15 <signal handler called>
> #16 <signal handler called>
> #17 <signal handler called>
> #18 <signal handler called>
> #19 <signal handler called>
> #20 <signal handler called>
> #21 <signal handler called>
> #22 <signal handler called>
> #23 <signal handler called>
> Backtrace stopped: Cannot access memory at address 0x10
> (kgdb) 
> 
> (Again, kgdb's stack frames #7 and larger are not particularly
> useful.)
> 
> Possibly interesting are the slightly different values:
> 
>  x2: 0xdeadc0dedeadc0de
> and:
> far: 0xdeadc0dedeadc0df
> 

So, I again tried the 14.0-BETA1 snapshot:

# uname -apKU
FreeBSD generic 14.0-BETA1 FreeBSD 14.0-BETA1 aarch64 1400097 #0 releng/14.0-n265060-4e027ca1514f: Fri Sep  8 11:17:15 UTC 2023     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC arm64 aarch64 1400097 1400097

and again it did not panic:

# /usr/bin/kyua test -k /usr/tests/Kyuafile sys/net/if_lagg_test:status_stress
sys/net/if_lagg_test:status_stress  ->  passed  [60.111s]

Results file id is usr_tests.20230909-084231-927014
Results saved to /root/.kyua/store/results.usr_tests.20230909-084231-927014.db

1/1 passed (0 failed)


The problem seems specific in some way to main [so: 15
at this point].

Given that my personal non-debug builds of main [so: 15]
get a panic and the debug build in the snapshot does
as well, it likely is not a debug vs. non-debug issue.
(Although, I do not strip symbols or such in my builds.)


===
Mark Millard
marklmi at yahoo.com