From nobody Fri Sep 01 13:05:28 2023 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4RcdZP5cmFz4sTBX; Fri, 1 Sep 2023 13:06:05 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-yw1-x112b.google.com (mail-yw1-x112b.google.com [IPv6:2607:f8b0:4864:20::112b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4RcdZP2wNRz4PYt; Fri, 1 Sep 2023 13:06:05 +0000 (UTC) (envelope-from mavbsd@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-yw1-x112b.google.com with SMTP id 00721157ae682-58dfe2d5b9aso25722257b3.1; Fri, 01 Sep 2023 06:06:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1693573564; x=1694178364; darn=freebsd.org; h=content-transfer-encoding:in-reply-to:subject:from:references:cc:to :content-language:user-agent:mime-version:date:message-id:sender :from:to:cc:subject:date:message-id:reply-to; bh=OCO1weCbfhqf+u9KyqLeVA0T1sU7cwxAhDE8XxLi/ds=; b=e4aRg83cwfN9LAtHlIB/nPz2usVqOGd3O3u70uAUoAosxPTdp+nAjJvad/Sehe90HV d2+ltSnT3QkXh9z1v56ZR82lvmyH/uHWRIjk+mEVzDsduc0be84EvQ0S/sDqVRePdIJA 1Sgw6pVf0zowquc0tnh4bTvxU9K5O1WuyGf/lgbbBuAGGUbZFdTda9A+6vW5toZ8e+qd q4sEHpmZ0ODc1c2AoAIsD/1XXXksXmgCc2j/pso6i9AglaTHTN+gxsBSFtkKnA8ypul0 6z6Ak+Oxn4ohtQZfhY6UJlQ1CLsqpJEEbqMApQrAmiQAVI8xo2P20ElcRc1RK0cpHMIt YWUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693573564; x=1694178364; h=content-transfer-encoding:in-reply-to:subject:from:references:cc:to :content-language:user-agent:mime-version:date:message-id:sender :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=OCO1weCbfhqf+u9KyqLeVA0T1sU7cwxAhDE8XxLi/ds=; b=RP69FlOqjPfW42sTMsewkneLq581HI+dJjkcMZdk1WlGU2SoH+QCPKT0LXaLv4U2t8 p3Pu2HjOiVvH6JIy4v6SIM0Cx9QXkwAb/0XDrUv0+s8Db9vu8H+b6ia+k/PODd+n4vj+ CzJXOKInh9nV30BuNW03uwZgquwW7sg5f8t9JMN1g5XwAMCQPVAwLCLMJRBH0odsdB1o 2q7dqqUyZSEI3X3CN4k/YMLuE3zxdlGERZpZhalLdS7hAPLkhxCeKSG7H5G3uO4piLPS mLPJKjM1afeU0ADqo/cNnGbM8wFMX6BkHZ0gMSzdttHN52Vk6qKmJxe3Z3YNkD94wM9Z 4Gig== X-Gm-Message-State: AOJu0YyOiodCh8QxdNilxaJoNpDAF95p8QkJS+TDjbyFVez24MjVCOST d1l/uhasWxpm9yEr0q5gaqw= X-Google-Smtp-Source: AGHT+IEJ3TIMA85ApzZyUj5Ouj0AxbFDLYzACrN6KgFRpJQwnTDMFB0ikgp5DP7rzXSP/zQeMU7EBg== X-Received: by 2002:a81:a013:0:b0:592:975b:376a with SMTP id x19-20020a81a013000000b00592975b376amr3636637ywg.0.1693573564161; Fri, 01 Sep 2023 06:06:04 -0700 (PDT) Received: from [10.230.45.5] ([38.32.73.2]) by smtp.gmail.com with ESMTPSA id t76-20020a0dea4f000000b005922c29c025sm1031745ywe.108.2023.09.01.06.06.02 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 01 Sep 2023 06:06:03 -0700 (PDT) Message-ID: <3a6f0cf5-2c1d-74fc-8223-1c106dbde2a2@FreeBSD.org> Date: Fri, 1 Sep 2023 09:05:28 -0400 List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-all@freebsd.org X-BeenThere: dev-commits-src-all@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Content-Language: en-US To: Cy Schubert Cc: Warner Losh , Gleb Smirnoff , Drew Gallatin , Martin Matuska , src-committers , dev-commits-src-all@freebsd.org, dev-commits-src-main@freebsd.org References: <20230901001959.E463A1C3@slippy.cwsent.com> From: Alexander Motin Subject: Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad In-Reply-To: <20230901001959.E463A1C3@slippy.cwsent.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US] X-Rspamd-Queue-Id: 4RcdZP2wNRz4PYt On 31.08.2023 20:19, Cy Schubert wrote: > Cy Schubert writes: >> In message <20230831234437.0EB38133@slippy.cwsent.com>, Cy Schubert writes: >>> In message <20230831234023.917DFAB@slippy.cwsent.com>, Cy Schubert writes: >>>>>> One of the two machines is hung. >>>>>> >>>>>> cwfw# ping bob >>>>>> PING bob (10.1.1.7): 56 data bytes >>>>>> ^C >>>>>> --- bob ping statistics --- >>>>>> 2 packets transmitted, 0 packets received, 100.0% packet loss >>>>>> cwfw# console bob >>>>>> [Enter `^Ec?' for help] >>>>>> [halt sent] >>>>>> KDB: enter: Break to debugger >>>>>> [ thread pid 31259 tid 100913 ] >>>>>> Stopped at kdb_break+0x48: movq $0,0xa1069d(%rip) >>>>>> db> bt >>>>>> Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000 >>>>>> kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0 >>>>>> uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310 >>>>>> intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef3 >> 80 >>>>>> intr_execute_handlers() at intr_execute_handlers+0x63/frame >>>>>> 0xfffffe00c53ef3b0 >>>>>> Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0 >>>>>> --- interrupt, rip =3D 0xffffffff806d5c70, rsp =3D 0xfffffe00c53ef480 >> , >>> rb >>>> = >>>>> p =3D >>>>>> 0xfffffe00c53ef480 --- >>>>>> getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480 >>>>>> arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0 >>>>>> arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0 >>>>>> dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580 >>>>>> dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0 >>>>>> dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670 >>>>>> dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0 >>>>>> zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750 >>>>>> zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7 >> a0 >>>>>> zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0 >>>>>> zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0 >>>>>> zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame >>>>>> 0xfffffe00c53ef9e0 >>>>>> vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30 >>>>>> vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0 >>>>>> namei() at namei+0x2e1/frame 0xfffffe00c53efb20 >>>>>> vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0 >>>>>> kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0 >>>>>> ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30 >>>>>> int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff8 >> 9d >>> c >>>>>> db> >>>>>> >>>>>> I'll let it continue. Hopefully the watchdog timer will pop and we ge >> t >>> a >>>>>> dump. >>>>>> >>>>> >>>>> >>>>> Might also be interesting to see if this moves around or is really hung >>>>> getting the time. I suspect it's live lock given this traceback. >>>> >>>> It's moving around. >>>> >>>> >>>> db> c >>>> [halt sent] >>>> KDB: enter: Break to debugger >>>> [ thread pid 31259 tid 100913 ] >>>> Stopped at kdb_break+0x48: movq $0,0xa1069d(%rip) >>>> db> bt >>>> Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000 >>>> kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef410 >>>> uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef450 >>>> intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef4c0 >>>> intr_execute_handlers() at intr_execute_handlers+0x63/frame >>>> 0xfffffe00c53ef4f0 >>>> Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef4f0 >>>> --- interrupt, rip = 0xffffffff807117c4, rsp = 0xfffffe00c53ef5c0, rbp = >>>> 0xfffffe00c53ef5c0 --- >>>> lock_delay() at lock_delay+0x14/frame 0xfffffe00c53ef5c0 >>>> _sx_xlock_hard() at _sx_xlock_hard+0x1a9/frame 0xfffffe00c53ef660 >>>> _sx_xlock() at _sx_xlock+0xb0/frame 0xfffffe00c53ef6a0 >>>> zfs_zget() at zfs_zget+0x1fa/frame 0xfffffe00c53ef750 >>>> zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0 >>>> zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0 >>>> zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0 >>>> zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame >>>> 0xfffffe00c53ef9e0 >>>> vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30 >>>> vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0 >>>> namei() at namei+0x2e1/frame 0xfffffe00c53efb20 >>>> vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0 >>>> kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0 >>>> ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30 >>>> int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc >>>> db> c >>> >>> It seems to have settled there for now. >> >> I manually panicked the machine. It captured a good dump of the above >> backtrace. >> >> This is the amd64 machine that was building i386 packages under poudriere. >> I purposely set -J high to stress it enough to panic the machine. >> >> This ker c¦`bfnel includes amotin's patch from https://github.com/openzfs/zfs/pull >> /15228. This manually triggered panic is different from the one it suffered >> two days ago. >> >> The machine building amd64 packages is still running nicely but it's >> building some large packages. Because of this there's not as much stress on >> zfs. The panicked machine was building a lot of small packages. As such it >> was doing a lot of zfs clones, destroys, and rollbacks. > > An initial look at the dump: > > (kgdb) bt > #0 __curthread () at /opt/src/git-src/sys/amd64/include/pcpu_aux.h:57 > #1 doadump (textdump=textdump@entry=1) > at /opt/src/git-src/sys/kern/kern_shutdown.c:405 > #2 0xffffffff806c1b30 in kern_reboot (howto=260) > at /opt/src/git-src/sys/kern/kern_shutdown.c:526 > #3 0xffffffff806c202f in vpanic (fmt=0xffffffff80b8653c "from debugger", > ap=ap@entry=0xfffffe00c53eef30) > at /opt/src/git-src/sys/kern/kern_shutdown.c:970 > #4 0xffffffff806c1dd3 in panic (fmt=) > at /opt/src/git-src/sys/kern/kern_shutdown.c:894 > #5 0xffffffff80413937 in db_panic (addr=, > have_addr=, count=, modif=) > at /opt/src/git-src/sys/ddb/db_command.c:531 > #6 0xffffffff80412ddd in db_command (last_cmdp=, > cmd_table=, dopager=true) > at /opt/src/git-src/sys/ddb/db_command.c:504 > #7 0xffffffff80412a9d in db_command_loop () > at /opt/src/git-src/sys/ddb/db_command.c:551 > #8 0xffffffff80416176 in db_trap (type=, code= out>) > at /opt/src/git-src/sys/ddb/db_main.c:268 > #9 0xffffffff80710993 in kdb_trap (type=type@entry=3, code=code@entry=0, > tf=tf@entry=0xfffffe00c53ef340) at /opt/src/git-src/sys/kern/subr_kdb.c: > 790 > #10 0xffffffff80a72cb9 in trap (frame=0xfffffe00c53ef340) > at /opt/src/git-src/sys/amd64/amd64/trap.c:608 > #11 > #12 kdb_enter (why=, msg=) > at /opt/src/git-src/sys/kern/subr_kdb.c:556 > #13 kdb_break () at /opt/src/git-src/sys/kern/subr_kdb.c:346 > #14 0xffffffff804dc7f7 in uart_intr_break (arg=0xfffff80006b84400) > at /opt/src/git-src/sys/dev/uart/uart_core.c:298 > #15 uart_intr (arg=0xfffff80006b84400) > at /opt/src/git-src/sys/dev/uart/uart_core.c:439 > #16 0xffffffff8067ae8b in intr_event_handle (ie=ie@entry=0xfffff80005814c00, > > frame=frame@entry=0xfffffe00c53ef500) > at /opt/src/git-src/sys/kern/kern_intr.c:1410 > #17 0xffffffff80a37653 in intr_execute_handlers (isrc=0xfffff8000580c9c8, > --Type for more, q to quit, c to continue without paging--c > frame=0xfffffe00c53ef500) > at /opt/src/git-src/sys/x86/x86/intr_machdep.c:354 > #18 > #19 lock_delay (la=la@entry=0xfffffe00c53ef600) > at /opt/src/git-src/sys/kern/subr_lock.c:124 > #20 0xffffffff806ccfb9 in _sx_xlock_hard (sx=sx@entry=0xfffff80123712340, > x=, opts=opts@entry=0, > file=file@entry=0xffffffff81910597 "/opt/src/git-src/sys/contrib/openzfs > /module/os/freebsd/zfs/zfs_znode.c", line=line@entry=949) > at /opt/src/git-src/sys/kern/kern_sx.c:693 > #21 0xffffffff806ccdb0 in _sx_xlock (sx=sx@entry=0xfffff80123712340, > opts=opts@entry=0, > file=0xffffffff81910597 "/opt/src/git-src/sys/contrib/openzfs/module/os/ > freebsd/zfs/zfs_znode.c", line=line@entry=949) > at /opt/src/git-src/sys/kern/kern_sx.c:330 > #22 0xffffffff816b4ffa in zfs_zget (zfsvfs=zfsvfs@entry=0xfffff80123712000, > obj_num=80130, zpp=zpp@entry=0xfffffe00c53ef760) > at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_znode. > c:949 > #23 0xffffffff816a018d in zfs_dirent_lookup (dzp=dzp@entry=0xfffff80131e6dae > 0, > name=0xfffffe00c53ef8c0 "xlocale", zpp=zpp@entry=0xfffffe00c53ef7b0, > flag=flag@entry=2) > at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_dir.c: > 190 > #24 0xffffffff816a028f in zfs_dirlook (dzp=dzp@entry=0xfffff80131e6dae0, > name=0xfffffe00c4bbdc80 "", name@entry=0xfffffe00c53ef8c0 "xlocale", > zpp=zpp@entry=0xfffffe00c53ef860) > at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_dir.c: > 246 > #25 0xffffffff816b0d20 in zfs_lookup (dvp=0xfffff800c2da9380, > nm=nm@entry=0xfffffe00c53ef8c0 "xlocale", vpp=0xfffffe00c53efd10, > cnp=cnp@entry=0xfffffe00c53efd38, nameiop=0, cr=, > flags=0, > cached=1) > at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_ > os.c:923 > #26 0xffffffff816ab907 in zfs_freebsd_lookup (ap=0xfffffe00c53ef9f0, > cached=1) > at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_ > os.c:4558 > #27 zfs_freebsd_cachedlookup (ap=0xfffffe00c53ef9f0) > at /opt/src/git-src/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_ > os.c:4566 > #28 0xffffffff80797d86 in VOP_CACHEDLOOKUP (dvp=0xfffff800c2da9380, > vpp=0xfffffe00c53efd10, cnp=0xfffffe00c53efd38) at ./vnode_if.h:101 > #29 vfs_cache_lookup (ap=) > at /opt/src/git-src/sys/kern/vfs_cache.c:3068 > #30 0xffffffff807a9487 in VOP_LOOKUP (dvp=0xfffff800c2da9380, > vpp=0xfffffe00c53efd10, cnp=0xfffffe00c53efd38) at ./vnode_if.h:67 > #31 vfs_lookup (ndp=ndp@entry=0xfffffe00c53efcb8) > at /opt/src/git-src/sys/kern/vfs_lookup.c:1252 > #32 0xffffffff807a8531 in namei (ndp=ndp@entry=0xfffffe00c53efcb8) > at /opt/src/git-src/sys/kern/vfs_lookup.c:684 > #33 0xffffffff807d0095 in vn_open_cred (ndp=ndp@entry=0xfffffe00c53efcb8, > flagp=flagp@entry=0xfffffe00c53efdc4, cmode=cmode@entry=0, > vn_open_flags=vn_open_flags@entry=16, cred=0xfffff801ea492800, > fp=0xfffff801b9a54690) at /opt/src/git-src/sys/kern/vfs_vnops.c:331 > #34 0xffffffff807c61f7 in kern_openat (td=0xfffffe00c4eca000, fd=-100, > path=0xffff8a24 , > pathseg=UIO_USERSPACE, flags=1048577, mode=) > at /opt/src/git-src/sys/kern/vfs_syscalls.c:1165 > #35 0xffffffff80a993a6 in syscallenter (td=0xfffffe00c4eca000) > at /opt/src/git-src/sys/amd64/ia32/../../kern/subr_syscall.c:187 > #36 ia32_syscall (frame=0xfffffe00c53eff40) > at /opt/src/git-src/sys/amd64/ia32/ia32_syscall.c:222 > #37 0xffffffff80a47f8f in int0x80_syscall_common () > at /opt/src/git-src/sys/amd64/ia32/ia32_exception.S:75 > #38 0x0000000026a12000 in ?? () > #39 0x0000000000000000 in ?? () > (kgdb) Cy, considering the machine didn't crash or deadlock immediately, and that the only trace you've got from it shows only a lock waiting in VFS layer, it can be anything, including something in vnode life cycle. If you have more than one CPU, then trace from the one arbitrary CPU where serial interrupt is routed may show only some consequences, not the source of problem. To diagnose some deadlocks we would need something like `procstat -akk`. I've asked you in private email yesterday to try experiment with setting sync=disabled on your pools/datasets to bypass ZIL code and see whether the errors persist. It means you may loose few seconds of updates in case of crash, so I do not propose to do it normally, but if you see the problems even after that, it may mean there is more than one problem and they should be handled separately. -- Alexander Motin