FreeBSD10.3-RELEASE. Kernel panic.

Thu Oct 20 04:23:39 UTC 2016

On 19/10/2016 3:23 AM, Cassiano Peixoto wrote:
> Hi guys,
> 
> I have some update about this issue. After my last email i had 3 crashes.
> Two of them had the same message on kernel debug:
> 
> (kgdb) list *0xffffffff8228c918
> 0xffffffff8228c918 is in trim_map_seg_compare
> (/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:108).
> 103    trim_map_seg_compare(const void *x1, const void *x2)
> 104    {
> 105        const trim_seg_t *s1 = x1;
> 106        const trim_seg_t *s2 = x2;
> 107
> 108        if (s1->ts_start < s2->ts_start) {
> 109            if (s1->ts_end > s2->ts_start)
> 110                return (0);
> 111            return (-1);
> 112        }
> Current language:  auto; currently minimal
> (kgdb) bt
> #0  doadump (textdump=<value optimized out>) at pcpu.h:221
> #1  0xffffffff80ad8e69 in kern_reboot (howto=260) at
> /usr/src/sys/kern/kern_shutdown.c:366
> #2  0xffffffff80ad941b in vpanic (fmt=<value optimized out>, ap=<value
> optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
> #3  0xffffffff80ad9253 in panic (fmt=0x0) at
> /usr/src/sys/kern/kern_shutdown.c:690
> #4  0xffffffff80fa0d31 in trap_fatal (frame=0xfffffe02374957f0,
> eva=4294967343) at /usr/src/sys/amd64/amd64/trap.c:841
> #5  0xffffffff80fa0f23 in trap_pfault (frame=0xfffffe02374957f0,
> usermode=0) at /usr/src/sys/amd64/amd64/trap.c:691
> #6  0xffffffff80fa04cc in trap (frame=0xfffffe02374957f0) at
> /usr/src/sys/amd64/amd64/trap.c:442
> #7  0xffffffff80f84141 in calltrap () at
> /usr/src/sys/amd64/amd64/exception.S:236
> #8  0xffffffff8228c918 in trim_map_seg_compare (x1=0xfffffe0237495920,
> x2=0x100000007) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:108
> #9  0xffffffff821a98e1 in avl_find (tree=<value optimized out>,
> value=<value optimized out>, where=0x0) at
> /usr/src/sys/cddl/contrib/opensolaris/common/avl/avl.c:268
> #10 0xffffffff8228ce9e in trim_map_write_start (zio=<value optimized out>)
> at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:363
> #11 0xffffffff822592df in zio_vdev_io_start (zio=0xfffff802191ea000) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:2866
> #12 0xffffffff82255b26 in zio_execute (zio=<value optimized out>) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1556
> #13 0xffffffff822551e9 in zio_nowait (zio=0xfffff802191ea000) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1610
> #14 0xffffffff8223c738 in vdev_queue_io_done (zio=<value optimized out>) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c:884
> #15 0xffffffff822594a9 in zio_vdev_io_done (zio=0xfffff8006daad000) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:2895
> #16 0xffffffff82255b26 in zio_execute (zio=<value optimized out>) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1556
> #17 0xffffffff80b363ca in taskqueue_run_locked (queue=<value optimized
> out>) at /usr/src/sys/kern/subr_taskqueue.c:449
> #18 0xffffffff80b372d8 in taskqueue_thread_loop (arg=<value optimized out>)
> at /usr/src/sys/kern/subr_taskqueue.c:703
> #19 0xffffffff80a90055 in fork_exit (callout=0xffffffff80b371f0
> <taskqueue_thread_loop>, arg=0xfffff8001006b920, frame=0xfffffe0237495c00)
> at /usr/src/sys/kern/kern_fork.c:1038
> #20 0xffffffff80f8467e in fork_trampoline () at
> /usr/src/sys/amd64/amd64/exception.S:611
> #21 0x0000000000000000 in ?? ()
> (kgdb) up 8
> #8  0xffffffff8228c918 in trim_map_seg_compare (x1=0xfffffe0237495920,
> x2=0x100000007) at
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/trim_map.c:108
> 108        if (s1->ts_start < s2->ts_start) {
> 
> But my last crash had a different message:
> 
> (kgdb) list *0xffffffff80b3a89c
> 0xffffffff80b3a89c is in turnstile_broadcast
> (/usr/src/sys/kern/subr_turnstile.c:837).
> 832
> 833        /*
> 834         * Transfer the blocked list to the pending list.
> 835         */
> 836        mtx_lock_spin(&td_contested_lock);
> 837        TAILQ_CONCAT(&ts->ts_pending, &ts->ts_blocked[queue], td_lockq);
> 838        mtx_unlock_spin(&td_contested_lock);
> 839
> 840        /*
> 841         * Give a turnstile to each thread.  The last thread gets
> Current language:  auto; currently minimal
> (kgdb) bt
> #0  doadump (textdump=<value optimized out>) at pcpu.h:221
> #1  0xffffffff80ad8e69 in kern_reboot (howto=260) at
> /usr/src/sys/kern/kern_shutdown.c:366
> #2  0xffffffff80ad941b in vpanic (fmt=<value optimized out>, ap=<value
> optimized out>) at /usr/src/sys/kern/kern_shutdown.c:759
> #3  0xffffffff80ad9253 in panic (fmt=0x0) at
> /usr/src/sys/kern/kern_shutdown.c:690
> #4  0xffffffff80fa0d31 in trap_fatal (frame=0xfffffe0237384870, eva=48) at
> /usr/src/sys/amd64/amd64/trap.c:841
> #5  0xffffffff80fa0f23 in trap_pfault (frame=0xfffffe0237384870,
> usermode=0) at /usr/src/sys/amd64/amd64/trap.c:691
> #6  0xffffffff80fa04cc in trap (frame=0xfffffe0237384870) at
> /usr/src/sys/amd64/amd64/trap.c:442
> #7  0xffffffff80f84141 in calltrap () at
> /usr/src/sys/amd64/amd64/exception.S:236
> #8  0xffffffff80b3a89c in turnstile_broadcast (ts=0x0, queue=1) at
> /usr/src/sys/kern/subr_turnstile.c:837
> #9  0xffffffff80ad48cf in __rw_wunlock_hard (c=0xfffff8024f3c2960,
> tid=<value optimized out>, file=<value optimized out>, line=<value
> optimized out>)
>     at /usr/src/sys/kern/kern_rwlock.c:1027
> #10 0xffffffff80e1a75c in vm_map_delete (map=<value optimized out>,
> start=<value optimized out>, end=<value optimized out>) at
> /usr/src/sys/vm/vm_map.c:2960
> #11 0xffffffff80e1828e in vmspace_exit (td=<value optimized out>) at
> /usr/src/sys/vm/vm_map.c:3077
> #12 0xffffffff80a88686 in exit1 (td=0xfffff80015533a00, rval=268849920,
> signo=0) at /usr/src/sys/kern/kern_exit.c:398
> #13 0xffffffff80a87e1d in sys_sys_exit (td=0x0, uap=<value optimized out>)
> at /usr/src/sys/kern/kern_exit.c:178
> #14 0xffffffff80fa168e in amd64_syscall (td=<value optimized out>,
> traced=0) at subr_syscall.c:135
> #15 0xffffffff80f8442b in Xfast_syscall () at
> /usr/src/sys/amd64/amd64/exception.S:396
> #16 0x0000000800b661aa in ?? ()
> Previous frame inner to this frame (corrupt stack?)
> (kgdb) up 8
> #8  0xffffffff80b3a89c in turnstile_broadcast (ts=0x0, queue=1) at
> /usr/src/sys/kern/subr_turnstile.c:837
> 837        TAILQ_CONCAT(&ts->ts_pending, &ts->ts_blocked[queue], td_lockq);
> 
> As you can see we are dealing with random crashes. I feel i'm not moving
> forward here. it's not a hardware problem because i have 3 different
> servers with same issue.
> 
> Donald, did you have a chance to try 11-RELEASE? Any other behavior?
> 
> Anyone have some idea that could help?
> 
> Thanks.
> 
> 
> On Thu, Oct 13, 2016 at 12:24 PM, Cassiano Peixoto <
> peixotocassiano at gmail.com> wrote:
> 
>> Hi guys,
>>
>> First of all, thanks to share your thoughts about this issue. I think it’s
>> really important to find out a solution for this issue together.
>>
>> I can see two behaviors related, but for me the root cause is the same:
>>
>> 1- mpd5 process stuck with umtxn flag
>> 2- system crash
>>
>> I’ve tested recently on FreeBSD 10.3 and FreeBSD-11-RC3. I’ve tried all
>> suggested tunings with no success.
>>
>> My environment is:
>> -  About 430 clients connected (but i can add more)
>> - Using ZFS
>> - igb NICs.
>> - Generic kernel
>>
>> Two days ago i updated my system to FreeBSD 11-RELEASE-p1 and after this
>> my system seems stable for almost 3 days. No crashes anymore. I need more
>> days to feel confident if something has changed. But anyway, my crashes
>> before happened every day.
>>
>> If it crashs again i’ll apply Donald recommendation and let you guys know.
>>
>> Let’s keep in touch, to try to at last fix it.
>>
>> Thanks.
>>
>> On Wed, Oct 12, 2016 at 8:24 PM, Donald Baud via freebsd-net <
>> freebsd-net at freebsd.org> wrote:
>>
>>> On 10/12/16 3:24 PM, Zaphod Beeblebrox wrote:
>>>
>>> While my mp5 servers are possibly less busy (I havn't had common
>>>> crashes), I have noticed a "group" of problems.
>>>>
>>>> 1. The carrier dropping communication (ie: fiber cut or l2 switch
>>>> breakage) of the L2TP streams can leave mpd5 in a state where it will not
>>>> die and will not destroy interfaces (requires reboot to clear).
>>>>
>>> I've encountered that once on 10.3 and I had tweaked some sysctl values
>>> while monitoring :
>>>> vmstat -z | head -1; vmstat -z | grep -i netgraph
>>>
>>> you might want to search other people's experience with the following
>>> values:
>>> # net.graph.maxdgram   #this is set in /etc/sysctl.conf
>>> # net.graph.recvspace    #this is set in /etc/sysctl.conf
>>> # net.graph.maxdata  #this is set in /boot/loader.conf
>>> # net.graph.maxalloc #this is set in /boot/loader.conf
>>>
>>> I'll leave others to comment on what's best to set as values with their
>>> experience on FreeBSD10.3.
>>> In my case, as I had explained, one of the recipes that worked for me is
>>> to comment out and leave those kernel values to their default.
>>>
>>> I've read in mpd5 mailing list some saying that FreeBSD-11 have had
>>> upgrades on the netgraph modules.
>>> I am now using FreeBSD-11 and It looks like I don't need any of the
>>> kernel tweaks that I've described.
>>>
>>> Also, may I suggest you troubleshoot the fiber-cut or L2 switch breakage
>>> by playing with some ipfw values to simulate a fiber-cut.:
>>> ex: ipfw add 100 deny ip from 10.10.10.10 to me
>>>
>>>> 2. There are race conditions between quagga and mpd5 for adding/dropping
>>>> routes.
>>>>
>>> While troubleshooting the crashes of the mpd5, I have removed net/quagga
>>> and installed net/bird instead.
>>> I am now using net/bird I've written a little howto to get you started
>>> with net/bird
>>> see: https://forums.freebsd.org/threads/56988/
>>>
>>> 3. if A is a pppoe client and B is the mpd5 server, A cannot access TCP
>>>> services on B.  It can access tcp services _beyond_ B, but not on B. (there
>>>> is a ticket open for this).
>>>>
>>>> On Wed, Oct 12, 2016 at 10:51 AM, Donald Baud via freebsd-net <
>>>> freebsd-net at freebsd.org <mailto:freebsd-net at freebsd.org>> wrote:
>>>>
>>>>
>>>>     On 10/12/16 1:13 AM, Julian Elischer wrote:
>>>>
>>>>         On 11/10/2016 8:56 PM, Donald Baud via freebsd-net wrote:
>>>>
>>>>             I've been plagued with these =daily= panics until I tried
>>>>             the following recipes and the server has been up for 30
>>>>             days so far:
>>>>
>>>>             Normally I should expermient more to see which one of the
>>>>             receipes is really the fix, but I'm just glad that the
>>>>             server is stable for now.
>>>>
>>>>
>>>>         this is really great information.
>>>>         It makes debugging a lot more possible.
>>>>         I know it is a hard question, but do you have a way to
>>>>         simulate this workload?
>>>>
>>>>         I have no real way to simulate this kind of workload
>>>>
>>>>
>>>>     Sadly, I don't have a way to simulate the workload but I am very
>>>>     interested to help fix these crashes since as Cassiano said, this
>>>>     makes mpd5/freebsd useless for pppoe/l2tp termination.
>>>>
>>>>     At this point, I would suggest that Cassiano and Андрей confirm
>>>>     that they don't get panics when they apply the recipes that I am
>>>>     using.
>>>>
>>>>     I am still running many other cisco-vpdn gateways that I would
>>>>     convert into mpd5/freebsd but my plan was stalled with the daily
>>>>     crashes.
>>>>     I'll wait a couple of weeks to be sure that my recipes are a valid
>>>>     workaround before converting my remaining cisco gateways to mpd5.
>>>>
>>>>     -Dbaud
>>>>
>>>>
>>>>
>>>>             recipe-1: Don't let mpd5 start automatically when server
>>>>             boots:
>>>>             i.e. in: /etc/rc.conf
>>>>             mpd5_enable="NO"
>>>>             and wait about 5 minutes after server boots then issue:
>>>>             /usr/local/etc/rc.d/mpd5 onestart
>>>>
>>>>
>>>>             recipe-2: recompile the kernel with the NETGRAPH_DEBUG
>>>> option:
>>>>             options         NETGRAPH
>>>>             options         NETGRAPH_DEBUG
>>>>             options         NETGRAPH_KSOCKET
>>>>             options         NETGRAPH_L2TP
>>>>             options         NETGRAPH_SOCKET
>>>>             options         NETGRAPH_TEE
>>>>             options         NETGRAPH_VJC
>>>>             options         NETGRAPH_PPP
>>>>             options         NETGRAPH_IFACE
>>>>             options         NETGRAPH_MPPC_COMPRESSION
>>>>             options         NETGRAPH_MPPC_ENCRYPTION
>>>>             options         NETGRAPH_TCPMSS
>>>>             options         IPFIREWALL
>>>>
>>>>             recipe-3: recompile the kernel and disable the IPv6 and
>>>>             SCTP options:
>>>>             nooptions       INET6
>>>>             nooptions       SCTP
>>>>
>>>>             recipe-4: Don't use any of the sysctl optimizations
>>>>             in other words I commented out all values in sysctl.conf:
>>>>             # net.graph.maxdgram=20480  (this is the default)
>>>>             # net.graph.recvspace=20480  (this is the default)
>>>>
>>>>             recipe-5: Don't use any of the loader.conf optimizations
>>>>             in other words I commented out all values in loader.conf
>>>>             # net.graph.maxdata=4096  (this is the default)
>>>>             # net.graph.maxalloc=4096 (this is the default)
>>>>
>>>>             ================================
>>>>             In my case, I had the panics with 10.3 and 11-PRERELEASE
>>>>             11.0-PRERELEASE FreeBSD 11.0-PRERELEASE #2 r305587
>>>>
>>>>             With those recipes, I have been running without any crash
>>>>             for a month and counting.  Thats' 300 l2tp tunnels and
>>>>             1400 l2tp sessions generating 700Mbit/s.
>>>>
>>>>
>>>>             -DBaud
>>>>
>>>>
>>>>             On Tuesday, October 11, 2016 7:30 AM, Cassiano Peixoto
>>>>             <peixotocassiano at gmail.com
>>>>             <mailto:peixotocassiano at gmail.com>> wrote:
>>>>             Hi,
>>>>
>>>>             There are many users complaining about this:
>>>>
>>>>             https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=186114
>>>>             <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=186114>
>>>>
>>>>             I've been dealing with this issue for one year with no
>>>>             solution. mpd5 as
>>>>             pppoe server on FreeBSD is useless with this bug.
>>>>
>>>>             I really would like to see it working again, i think it's
>>>>             quite important
>>>>             to both project and many users.
>>>>
>>>>             Thanks.
>>>>
>>>>             On Tue, Oct 11, 2016 at 3:24 AM, Eugene Grosbein
>>>>             <eugen at grosbein.net <mailto:eugen at grosbein.net>> wrote:
>>>>
>>>>                 11.10.2016 11:02, Андрей Леушкин пишет:
>>>>
>>>>                     Hello. I have problem with "FreeBSD nas
>>>>                     10.3-RELEASE FreeBSD 10.3-RELEASE
>>>>                     #0: Fri Oct  7 21:12:56 YEKT 2016
>>>>                     nas at nas:/usr/obj/usr/src/sys/nasv3
>>>>                        amd64"
>>>>
>>>>                     Kernel panic is repeated at intervals of 2-3 days.
>>>>                     At first I thought that
>>>>                     the problem is in the hardware, but the problem
>>>>                     did not go away after
>>>>                     replacing the server platform.
>>>>
>>>>                     Coredumps and more info on link
>>>>                     https://drive.google.com/open?
>>>> id=0BxciMy2q7ZjTTkIxem9wTE1tM2M
>>>>                     <https://drive.google.com/open
>>>> ?id=0BxciMy2q7ZjTTkIxem9wTE1tM2M>
>>>>
>>>>                     Sorry for my english.
>>>>                     I'll wait for an answer.
>>>>
>>>>                 This is known and long-stanging problem in the FreeBSD
>>>>                 network stack.
>>>>                 It shows up when you have lots of network interfaced
>>>>                 created/removed
>>>>                 frequently
>>>>                 like in your case of Network Access Server (PPtP,
>>>>                 PPPoE etc).
>>>>
>>>>                 Generally, people run into this problem using mpd5
>>>>                 network daemon.
>>>>                 mpd5 uses NETGRAPH kernel subsystem to process traffic
>>>> and
>>>>                 if an interface disappears (f.e., ,user disconnected)
>>>>                 while kernel still processes traffic obtained from
>>>>                 this interface, it
>>>>                 panices.
>>>>
>>>>                 There were lots of reports of this problem. Noone
>>>>                 seems to be working on
>>>>                 it at the moment.
>>>>                 You should fill a PR using Bugzilla and attach your
>>>>                 logs to it.
>>>>
>>>>                 Eugene Grosbein
>>>>
>>>>
>>> _______________________________________________
>>> freebsd-net at freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>>>
>>
>>
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
> 

For anyone experiencing these mpd hangs/crashes, if you believe your
issue is the same as that described in Issue 186114 [1], please add your
comments there including full system version information and crash
backtraces (*as attachments*) if experiencing panics.

Resolution of this problem is contingent on a clear test/reproduction
cases (ideally as reduced as possible).

[1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=186114

./koobs