Re: kernel panic while copying files

From: Gary Jennejohn <gljennjohn_at_gmail.com>
Date: Tue, 08 Jun 2021 15:54:05 UTC
On Tue, 8 Jun 2021 06:27:04 -0600
Warner Losh <imp@bsdimp.com> wrote:

> On Tue, Jun 8, 2021 at 2:47 AM Gary Jennejohn <gljennjohn@gmail.com> wrote:
> 
[snip old stuff]
> > Here the kgdb backtrace:
> >
> > Unread portion of the kernel message buffer:
> > panic: Duplicate free of 0xfffff800356b9000 from zone
> > 0xfffffe00dcbdd800(da_ccb) slab 0xfffff800356b9fd8(0)
> > cpuid = 8
> > time = 1623140519
> > KDB: stack backtrace:
> > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
> > 0xfffffe00c5f398c0
> > vpanic() at vpanic+0x181/frame 0xfffffe00c5f39910
> > panic() at panic+0x43/frame 0xfffffe00c5f39970
> > uma_dbg_free() at uma_dbg_free+0x1e1/frame 0xfffffe00c5f399b0
> > uma_zfree_arg() at uma_zfree_arg+0x147/frame 0xfffffe00c5f39a00
> > camperiphdone() at camperiphdone+0x1b7/frame 0xfffffe00c5f39b20
> > xpt_done_process() at xpt_done_process+0x3dd/frame 0xfffffe00c5f39b60
> > xpt_done_td() at xpt_done_td+0xf5/frame 0xfffffe00c5f39bb0
> > fork_exit() at fork_exit+0x80/frame 0xfffffe00c5f39bf0
> > fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00c5f39bf0
> > --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
> > KDB: enter: panic
> >
> > __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
> > 55              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n"
> > (offsetof(struct pcpu,
> > (kgdb) bt
> > #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
> > #1  doadump (textdump=textdump@entry=0)
> >     at /usr/src/sys/kern/kern_shutdown.c:399
> > #2  0xffffffff8040c39a in db_dump (dummy=<optimized out>,
> >     dummy2=<unavailable>, dummy3=<unavailable>, dummy4=<unavailable>)
> >     at /usr/src/sys/ddb/db_command.c:575
> > #3  0xffffffff8040c192 in db_command (last_cmdp=<optimized out>,
> >     cmd_table=<optimized out>, dopager=dopager@entry=1)
> >     at /usr/src/sys/ddb/db_command.c:482
> > #4  0xffffffff8040beed in db_command_loop ()
> >     at /usr/src/sys/ddb/db_command.c:535
> > #5  0xffffffff8040f616 in db_trap (type=<optimized out>, code=<optimized  
> > out>)  
> >     at /usr/src/sys/ddb/db_main.c:270
> > #6  0xffffffff8066b1c4 in kdb_trap (type=type@entry=3, code=code@entry=0,
> >     tf=<optimized out>, tf@entry=0xfffffe00c5f397f0)
> >     at /usr/src/sys/kern/subr_kdb.c:727
> > #7  0xffffffff809a4e96 in trap (frame=0xfffffe00c5f397f0)
> >     at /usr/src/sys/amd64/amd64/trap.c:604
> > #8  <signal handler called>
> > #9  kdb_enter (why=0xffffffff80a61a23 "panic", msg=<optimized out>)
> >     at /usr/src/sys/kern/subr_kdb.c:506
> > #10 0xffffffff806207a2 in vpanic (fmt=<optimized out>, ap=<optimized out>,
> >     ap@entry=0xfffffe00c5f39950) at /usr/src/sys/kern/kern_shutdown.c:907
> > #11 0xffffffff80620533 in panic (
> >     fmt=0xffffffff80d635c8 <cnputs_mtx> ".\024\244\200\377\377\377\377")
> >     at /usr/src/sys/kern/kern_shutdown.c:843
> > #12 0xffffffff808e12b1 in uma_dbg_free (zone=0xfffffe00dcbdd800,
> >     slab=0xfffff800356b9fd8, item=0xfffff800356b9000)
> >     at /usr/src/sys/vm/uma_core.c:5664
> > #13 0xffffffff808d9de7 in item_dtor (zone=0xfffffe00dcbdd800,
> >     item=0xfffff800356b9000, size=544, udata=0x0, skip=SKIP_NONE)
> >     at /usr/src/sys/vm/uma_core.c:3418
> > #14 uma_zfree_arg (zone=0xfffffe00dcbdd800, item=0xfffff800356b9000,
> >     udata=udata@entry=0x0) at /usr/src/sys/vm/uma_core.c:4374
> > #15 0xffffffff802da503 in uma_zfree (zone=0xffffffff80d635c8 <cnputs_mtx>,
> >     item=0x200) at /usr/src/sys/vm/uma.h:404
> >  
> 
> OK. This is a bad stack trace. camperiphdone doesn't call uma_zfree()...
> It does call
> xpt_free_ccb, though, and that's likely what's going wrong. And that
> matches the line
> numbers. Most likely this is llvm's tail call optimizations...  Can you
> compile the kernel
> either -O0 or with -fno-optimize-sibling-calls? That will give a better
> call stack.
> 
> However, it's likely the new UMA stuff trasz committed (or it's providing
> better
> diagnostics than the old malloc based code which seems more likely) that
> can be
> disabled by the tunable kern.cam.da.enable_uma_ccbs=0.
> 
> The lines in question:
>         saved_ccb = (union ccb *)done_ccb->ccb_h.saved_ccb_ptr;
>         bcopy(saved_ccb, done_ccb, sizeof(*done_ccb));
>         xpt_free_ccb(saved_ccb);
> 
> So we overwrite the done_ccb with the saved_ccb's contents and then free
> the saved ccb.
> That's likely OKish, though.
> 
> We copy entire CCBs around in this code a lot, and I've not traced through
> it. But we're
> sending a scsi start unit in response to some error that is being reported
> via cam_periph_error()
> 
> #16 0xffffffff802d9117 in camperiphdone (periph=0xfffff800061e2c00,
> >     done_ccb=0xfffff800355d6cc0) at /usr/src/sys/cam/cam_periph.c:1427
> > #17 0xffffffff802dfebd in xpt_done_process (ccb_h=0xfffff800355d6cc0)
> >     at /usr/src/sys/cam/cam_xpt.c:5491
> > #18 0xffffffff802e1ec5 in xpt_done_td (
> >     arg=arg@entry=0xffffffff80d33d80 <cam_doneqs>)
> >     at /usr/src/sys/cam/cam_xpt.c:5546
> > #19 0xffffffff805dad80 in fork_exit (callout=0xffffffff802e1dd0
> > <xpt_done_td>,
> >     arg=0xffffffff80d33d80 <cam_doneqs>, frame=0xfffffe00c5f39c00)
> >     at /usr/src/sys/kern/kern_fork.c:1083
> > #20 <signal handler called>
> >
> > Apparently caused by recent changes to CAM.
> >
> > Let me know if you want more information.
> >  
> 
> what's *periph say?
> 

Well, flags is CAM_PERIPH_RECOVERY_INPROG|CAM_PERIPH_LOCK_WANTED
|CAM_PERIPH_LOCKED.

Here's the entire dump:
(kgdb) p/x *(struct cam_periph *)periph
$1 = {periph_start = 0xffffffff80328a00, periph_oninval = 0xffffffff803287e0,
  periph_dtor = 0xffffffff803288a0, periph_name = 0xffffffff80e545d7,
  path = 0xfffff8000611cdc0, softc = 0xfffffe01117ca000,
  sim = 0xfffff80005873600, unit_number = 0x0, type = 0x0, flags = 0x26,
  scheduled_priority = 0xffffffff, immediate_priority = 0xffffffff,
  periph_allocating = 0x0, periph_allocated = 0x1, refcount = 0x5, ccb_list = {
    slh_first = 0x0}, periph_links = {sle_next = 0xfffff8005d68be00},
  unit_links = {tqe_next = 0x0, tqe_prev = 0xffffffff810530a8},
  deferred_callback = 0x0, deferred_ac = 0x0, periph_run_task = {ta_link = {
      stqe_next = 0x0}, ta_pending = 0x0, ta_priority = 0x0, ta_flags = 0x0,
    ta_func = 0xffffffff802dd4d0, ta_context = 0xfffff8005d68bd00},
  ccb_zone = 0xfffffe00dc9d9800}

path does not seem very useful because it contains a bunch of pointers.

(kgdb) x/s ((struct cam_periph *)periph)->periph_name
0xffffffff80e545d7:     "da"

> camperiphdone is only use when we send a scsi start/stop is sent to try to
> recover
> in a couple of cases:
> switch(ccb->csio.scsi_status) { ...
>         case SCSI_STATUS_CMD_TERMINATED:
>         case SCSI_STATUS_CHECK_COND:
>                 error = camperiphscsisenseerror(ccb, orig_ccb,
> ...
> is the path. It might be profitable to use the debugger to trap here to see
> where we're
> kicking off this error recovery process.
> 

Ok, I set a breakpoint in ddb at camperiphdone.  Once ddb was in that
routine I set a breakpoint in camperiphscsisenseerror, but that was
never hit.  So, apparently that routine is not being entered and the
kernel paniced.

With kern.cam.da.enable_uma_ccbs=0 there is no panic.

Here's the kgdb backtrace with the -O0 kernel:

(kgdb) bt
#0  0xffffffff8081d706 in doadump (textdump=0)
    at /usr/src/sys/kern/kern_shutdown.c:398
#1  0xffffffff804ef15a in db_dump (dummy=-2138500043, dummy2=false, dummy3=-1,
    dummy4=0xfffffe00c62a11b0 "") at /usr/src/sys/ddb/db_command.c:575
#2  0xffffffff804eef5f in db_command (
    last_cmdp=0xffffffff8114d380 <db_last_command>, cmd_table=0x0, dopager=1)
    at /usr/src/sys/ddb/db_command.c:482
#3  0xffffffff804eeb38 in db_command_loop ()
    at /usr/src/sys/ddb/db_command.c:535
#4  0xffffffff804f38ef in db_trap (type=3, code=0)
    at /usr/src/sys/ddb/db_main.c:270
#5  0xffffffff80891d02 in kdb_trap (type=3, code=0, tf=0xfffffe00c62a1680)
    at /usr/src/sys/kern/subr_kdb.c:727
#6  0xffffffff80dd53c3 in trap (frame=0xfffffe00c62a1680)
    at /usr/src/sys/amd64/amd64/trap.c:604
#7  0xffffffff80dd6718 in trap_check (frame=0xfffffe00c62a1680)
    at /usr/src/sys/amd64/amd64/trap.c:664
#8  <signal handler called>
#9  breakpoint () at /usr/src/sys/amd64/include/cpufunc.h:66
#10 0xffffffff808910d0 in kdb_enter (why=0xffffffff80eaaf0b "panic",
    msg=0xffffffff80eaaf0b "panic") at /usr/src/sys/kern/subr_kdb.c:505
#11 0xffffffff8081dbfe in vpanic (
    fmt=0xffffffff80e80f73 "Duplicate free of %p from zone %p(%s) slab %p(%d)", ap=0xfffffe00c62a1850) at /usr/src/sys/kern/kern_shutdown.c:906
#12 0xffffffff8081d6b0 in panic (
    fmt=0xffffffff80e80f73 "Duplicate free of %p from zone %p(%s) slab %p(%d)")
    at /usr/src/sys/kern/kern_shutdown.c:843
#13 0xffffffff80caaec5 in uma_dbg_free (zone=0xfffffe00dc9d9800,
    slab=0xfffff80007ee0fd8, item=0xfffff80007ee0000)
    at /usr/src/sys/vm/uma_core.c:5664
#14 0xffffffff80c9faf5 in item_dtor (zone=0xfffffe00dc9d9800,
    item=0xfffff80007ee0000, size=544, udata=0x0, skip=SKIP_NONE)
    at /usr/src/sys/vm/uma_core.c:3418
#15 0xffffffff80c9eec7 in uma_zfree_arg (zone=0xfffffe00dc9d9800,
    item=0xfffff80007ee0000, udata=0x0) at /usr/src/sys/vm/uma_core.c:4374
#16 0xffffffff802e5a89 in uma_zfree (zone=0xfffffe00dc9d9800,
    item=0xfffff80007ee0000) at /usr/src/sys/vm/uma.h:404
#17 0xffffffff802dcfa6 in xpt_free_ccb (free_ccb=0xfffff80007ee0000)
    at /usr/src/sys/cam/cam_xpt.c:4674
#18 0xffffffff802db639 in camperiphdone (periph=0xfffff8005d68bd00,
    done_ccb=0xfffff80007797cc0) at /usr/src/sys/cam/cam_periph.c:1427
#19 0xffffffff802e59b6 in xpt_done_process (ccb_h=0xfffff80007797cc0)
    at /usr/src/sys/cam/cam_xpt.c:5491
#20 0xffffffff802e811e in xpt_done_td (arg=0xffffffff81143c00 <cam_doneqs>)
    at /usr/src/sys/cam/cam_xpt.c:5546
#21 0xffffffff807ac0ea in fork_exit (callout=0xffffffff802e7f20 <xpt_done_td>,
    arg=0xffffffff81143c00 <cam_doneqs>, frame=0xfffffe00c62a1c00)
    at /usr/src/sys/kern/kern_fork.c:1083
#22 <signal handler called>

-- 
Gary Jennejohn