Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad
- In reply to: Cy Schubert : "Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 01 Sep 2023 03:44:11 UTC
On Thu, 31 Aug 2023 20:38:01 -0700 Cy Schubert <Cy.Schubert@cschubert.com> wrote: > On Thu, 31 Aug 2023 21:18:30 -0500 > Kyle Evans <kevans@FreeBSD.org> wrote: > > > On 8/30/23 20:01, Alexander Motin wrote: > > > On 30.08.2023 20:09, Gleb Smirnoff wrote: > > >> Hi Martin, > > >> > > >> we hit this panic with this update: > > >> > > >> panic: VERIFY(avl_is_empty(&lwb->lwb_vdev_tree)) failed > > >> > > >> cpuid = 8 > > >> time = 1693432004 > > >> KDB: stack backtrace: > > >> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame > > >> 0xfffffe09519aa800 > > >> vpanic() at vpanic+0x132/frame 0xfffffe09519aa930 > > >> spl_panic() at spl_panic+0x3a/frame 0xfffffe09519aa990 > > >> zil_free_lwb() at zil_free_lwb+0x1de/frame 0xfffffe09519aa9c0 > > >> zil_sync() at zil_sync+0x3cb/frame 0xfffffe09519aaaa0 > > >> [Wed Aug 30 21:46:44 2023]dmu_objset_sync() at > > >> dmu_objset_sync+0x51b/frame 0xfffffe09519aab70 > > >> dsl_pool_sync() at dsl_pool_sync+0x11d/frame 0xfffffe09519aabf0 > > >> spa_sync() at spa_sync+0xc68/frame 0xfffffe09519aae20 > > >> txg_sync_thread() at txg_sync_thread+0x2eb/frame 0xfffffe09519aaef0 > > >> fork_exit() at fork_exit+0x82/frame 0xfffffe09519aaf30 > > >> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe09519aaf30 > > >> --- trap 0x26d90a2f, rip = 0xbba54c81b25fd8d4, rsp = > > >> 0xff60c2f6f9874bcd, rbp = 0x37940e3e9793bd40 -- > > >> > > >> I have two cores dumped for this panic. > > >> > > >> Our previous version is based on > > >> 9c0a2d522f3a18fc795a7dd6266d2be2aff4b084, which > > >> I guess has the most recent ZFS merge openzfs/zfs@009d3288d. > > >> > > > Hi Gleb, > > > > > > It is the first time I see a panic like this. I'll think about it > > > tomorrow. But I'd appreciate any information on what is your workload > > > and what are you doing related to ZIL (O_SYNC, fsync(), sync=always, > > > etc) to trigger it? What is your pool configuration? > > > > > > > Hello! > > > > Probably not related, but I seem to have hit a use-after-free in ZIL > > bits; is this something you're interested in? Backtrace is: > > > > #17 zio_nowait (zio=0xdeadc0dedeadc0de) at > > /usr/src.kbsd/sys/contrib/openzfs/module/zfs/zio.c:2367 > > #18 0xffff00000160720c in zil_lwb_write_issue > > (zilog=zilog@entry=0xffffa00021631800, lwb=0xffffa00933d34c40) > > at /usr/src.kbsd/sys/contrib/openzfs/module/zfs/zil.c:1935 > > --Type <RET> for more, q to quit, c to continue without paging-- > > #19 0xffff0000016031c0 in zil_commit_writer (zilog=0xffffa00021631800, > > zcw=0xffffa000fb01e140) at > > /usr/src.kbsd/sys/contrib/openzfs/module/zfs/zil.c:2985 > > #20 zil_commit_impl (zilog=0xffffa00021631800, foid=<optimized out>) at > > /usr/src.kbsd/sys/contrib/openzfs/module/zfs/zil.c:3489 > > #21 0xffff000001602f44 in zil_commit (zilog=<optimized out>, > > foid=<optimized out>) at > > /usr/src.kbsd/sys/contrib/openzfs/module/zfs/zil.c:3452 > > #22 0xffff0000015fdac8 in zfs_fsync (zp=0xffffa007e27b23a0, > > syncflag=<optimized out>, cr=<optimized out>) at > > /usr/src.kbsd/sys/contrib/openzfs/module/zfs/zfs_vnops.c:75 > > #23 0xffff00000056a864 in VOP_FSYNC (vp=0xffffa001d2003000, waitfor=1, > > td=0xffff00015aa94620) at ./vnode_if.h:763 > > #24 kern_fsync (td=0xffff00015aa94620, fd=<optimized out>, > > fullsync=true) at /usr/src.kbsd/sys/kern/vfs_syscalls.c:3556 > > #25 0xffff000000800630 in syscallenter (td=0xffff00015aa94620) at > > /usr/src.kbsd/sys/arm64/arm64/../../kern/subr_syscall.c:187 > > #26 svc_handler (td=0xffff00015aa94620, frame=<optimized out>) at > > /usr/src.kbsd/sys/arm64/arm64/trap.c:198 > > #27 do_el0_sync (td=0xffff00015aa94620, frame=<optimized out>) at > > /usr/src.kbsd/sys/arm64/arm64/trap.c:625 > > > > > > If I pop up to the zil_lwb_write_issue frame: > > > > (kgdb) print *lwb > > $4 = {lwb_zilog = 0xdeadc0dedeadc0de, lwb_blk = {blk_dva = {{dva_word = > > {16045693110842147038, 16045693110842147038}}, {dva_word = > > {16045693110842147038, 16045693110842147038}}, {dva_word = > > {16045693110842147038, > > 16045693110842147038}}}, blk_prop = 16045693110842147038, > > blk_pad = {16045693110842147038, 16045693110842147038}, blk_phys_birth = > > 16045693110842147038, blk_birth = 16045693110842147038, > > blk_fill = 16045693110842147038, blk_cksum = {zc_word = > > {16045693110842147038, 16045693110842147038, 16045693110842147038, > > 16045693110842147038}}}, lwb_slim = 3735929054, lwb_slog = 3735929054, > > lwb_error = -559038242, > > lwb_nmax = -559038242, lwb_nused = -559038242, lwb_nfilled = > > -559038242, lwb_sz = -559038242, lwb_state = 3735929054, lwb_buf = > > 0xdeadc0dedeadc0de <error: Cannot access memory at address > > 0xdeadc0dedeadc0de>, > > lwb_child_zio = 0xdeadc0dedeadc0de, lwb_write_zio = > > 0xdeadc0dedeadc0de, lwb_root_zio = 0xdeadc0dedeadc0de, > > lwb_issued_timestamp = -2401050962867404578, lwb_issued_txg = > > 16045693110842147038, > > lwb_alloc_txg = 16045693110842147038, lwb_max_txg = > > 16045693110842147038, lwb_node = {list_next = 0xdeadc0dedeadc0de, > > list_prev = 0xdeadc0dedeadc0de}, lwb_issue_node = {list_next = > > 0xdeadc0dedeadc0de, > > list_prev = 0xdeadc0dedeadc0de}, lwb_itxs = {list_size = > > 16045693110842147038, list_offset = 16045693110842147038, list_head = > > {list_next = 0xdeadc0dedeadc0de, list_prev = 0xdeadc0dedeadc0de}}, > > lwb_waiters = { > > list_size = 16045693110842147038, list_offset = > > 16045693110842147038, list_head = {list_next = 0xdeadc0dedeadc0de, > > list_prev = 0xdeadc0dedeadc0de}}, lwb_vdev_tree = {avl_root = > > 0xdeadc0dedeadc0de, > > avl_compar = 0xdeadc0dedeadc0de, avl_offset = 16045693110842147038, > > avl_numnodes = 16045693110842147038}, lwb_vdev_lock = {lock_object = { > > lo_name = 0xdeadc0dedeadc0de <error: Cannot access memory at > > address 0xdeadc0dedeadc0de>, lo_flags = 3735929054, lo_data = > > 3735929054, lo_witness = 0xdeadc0dedeadc0de}, sx_lock = > > 16045693110842147038}} > > > > It seems to have clearly been stomped on by uma trashing. Encountered > > while running a pkgbase build, I think while it was in the packaging > > phase. I note in particular in that frame: > > > > (kgdb) p/x lwb->lwb_issued_timestamp > > $4 = 0xdeadc0dedeadc0de > > > > So I guess it was freed sometime during one of the previous two > > zio_nowait() calls. > > > > Thanks, > > > > Kyle Evans > > > > I think the four panics on my two downstairs machines two days ago, the > panic today on the one of the same two machines today, and the email > data corruption my laptop has been experiencing, and the backtraces I've > seen Gleb and you post give me the uneasy sense that there may be some > kind of general memory corruption due to bad or corrupted addresses > being dereferenced. > Though on second thought these could be separate and unrelated problems. The fact that the spin loop that forced me to manually panic my one machine reminds me of a similar situation on a mainframe where page tables were being overwritten by kernel code being developed at the time which also resulted in a spin loop over a corrupted lock. -- Cheers, Cy Schubert <Cy.Schubert@cschubert.com> FreeBSD UNIX: <cy@FreeBSD.org> Web: https://FreeBSD.org NTP: <cy@nwtime.org> Web: https://nwtime.org e^(i*pi)+1=0