Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot
- Reply: Mark Millard via freebsd-current : "Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot"
- In reply to: Mark Millard via freebsd-current : "Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 23 May 2021 07:46:28 UTC
On 2021-May-23, at 00:08, Mark Millard via freebsd-current <freebsd-current at freebsd.org> wrote: > On 2021-May-22, at 22:16, Warner Losh <imp at bsdimp.com> wrote: > >> On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm <freebsd-arm@freebsd.org> wrote: >> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/ >> # diff -r /usr/ports/ /mnt/ | more >> nvme0: cpl does not map to outstanding cmd >> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0 >> panic: received completion for unknown cmd >> >> cid 0x7e has no currently active command. The cid is used by the driver >> to map completions back to requests. >> >> So, there's usually 3 possibilities that I've seen this with. >> >> (1) There's a missing cache flush so you get a bogus cpl back because something stale >> was read. It's unlikely to be this one because the rest of this look like a successful >> command completed: sc = 0 is successful completion and sct is a generic command queued. >> >> (2) We're looking at the completion record twice because we failed to properly update the >> head pointer and we've already completed the command. I've only ever seen this in a >> panic situation where we interrupt the completion routine because something else >> paniced. >> >> (3) There's something that's corrupting the act_tr array in the qpair. I've not seen this, >> but if something else smashes that area (zeroing it in this case), then that could cause >> an error like this. > > Of note may be that I buildworld and buildkernel with extra > tuning enabled, targeting the cortex-a72. In one past example > this lead to finding a missing synchronization related to XHCI > handling that was fixed. (The fix was not aarch64 specific at > all.) For that: A cortex-a53 did not show the problem with or > without that tuning. A cortex-a72 showed the problem only with > the cortex-a72 tuning, not with targeting a cortex-a53 tuning > or generic armv7, for example. > > Not that I've any evidence specifically suggesting such would > be involved here. But it might be good to keep in mind as a > possaibility. > >> Or it could be something new I've not seen nor thought about before. >> >> cpuid = 3 >> time = 1621743752 >> KDB: stack backtrace: >> db_trace_self() at db_trace_self >> db_trace_self_wrapper() at db_trace_self_wrapper+0x30 >> vpanic() at vpanic+0x188 >> panic() at panic+0x44 >> nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc >> nvme_timeout() at nvme_timeout+0x3c >> softclock_call_cc() at softclock_call_cc+0x124 >> softclock() at softclock+0x60 >> ithread_loop() at ithread_loop+0x2a8 >> fork_exit() at fork_exit+0x74 >> fork_trampoline() at fork_trampoline+0x14 >> KDB: enter: panic >> [ thread pid 12 tid 100028 ] >> Stopped at kdb_enter+0x48: undefined f904411f >> db> >> >> Based on the "nvme" references, I expect this is tied to >> handling the Optane 480 GiByte that is in the PCIe slot >> and is the boot/only media for the machine doing the diff. >> >> "db> dump" seems to have worked. >> >> After the reboot, zpool scrub found no errors. >> >> For reference: >> >> # uname -apKU >> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021 root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72 arm64 aarch64 1400013 1400013 >> >> If you have the dump, I suggest starting to make sure that the act_tr array looks sane. Make >> sure all the live pointers point to a sane looking tr. Make sure that tr is on the active list, etc >> >> It will take a fair amount of driver reading, though, to see how we got here. I'd also check to >> make sure that qpair->num_enttries > cpl.cid (0x3e in this case). >> > > Okay. I got this while trying to test an odd diff -r over NFS > issue with the more recent software. So the two will potentially > compete for time. > > As investigation will be exploratory for me, not familiar, I'll > probably publish periodic notes on things as I go along looking > at stuff. > > My first is that the /var/crash/core.txt.0 has a gdb backtrace: > > . . . > #10 0xffff00000047900c in panic ( > fmt=0x12 <error: Cannot access memory at address 0x12>) > at /usr/main-src/sys/kern/kern_shutdown.c:843 > #11 0xffff0000002226b4 in nvme_qpair_process_completions ( > qpair=qpair@entry=0xffffa00008724300) > at /usr/main-src/sys/dev/nvme/nvme_qpair.c:617 > #12 0xffff000000223354 in nvme_timeout (arg=arg@entry=0xffffa0000b053980) > at /usr/main-src/sys/dev/nvme/nvme_qpair.c:938 > #13 0xffff000000495bf8 in softclock_call_cc (c=0xffffa0000b0539a0, > cc=cc@entry=0xffff000000de3500 <cc_cpu+768>, direct=0) > at /usr/main-src/sys/kern/kern_timeout.c:696 > #14 0xffff000000495fb0 in softclock (arg=0xffff000000de3500 <cc_cpu+768>) > at /usr/main-src/sys/kern/kern_timeout.c:816 > #15 0xffff0000004356dc in intr_event_execute_handlers (p=<optimized out>, > ie=0xffffa000058bc700) at /usr/main-src/sys/kern/kern_intr.c:1168 > #16 ithread_execute_handlers (p=<optimized out>, ie=0xffffa000058bc700) > at /usr/main-src/sys/kern/kern_intr.c:1181 > #17 ithread_loop (arg=<optimized out>, arg@entry=0xffffa000058aef60) > at /usr/main-src/sys/kern/kern_intr.c:1269 > #18 0xffff000000431f6c in fork_exit ( > callout=0xffff000000435430 <ithread_loop>, arg=0xffffa000058aef60, > frame=0xffff0000eb7cc990) at /usr/main-src/sys/kern/kern_fork.c:1083 > #19 <signal handler called> > > So via kgdb . . . > > (kgdb) up 11 > #11 0xffff0000002226b4 in nvme_qpair_process_completions (qpair=qpair@entry=0xffffa00008724300) at /usr/main-src/sys/dev/nvme/nvme_qpair.c:617 > 617 KASSERT(0, ("received completion for unknown cmd")); > > (kgdb) print/x cpl.cid > $4 = 0x7e > (kgdb) print/x qpair->num_entries > $5 = 0x100 > > Based on also seeing the code: > > qpair->act_tr = malloc_domainset(sizeof(struct nvme_tracker *) * > qpair->num_entries, M_NVME, DOMAINSET_PREF(qpair->domain), > M_ZERO | M_WAITOK); > > (kgdb) print qpair->act_tr > $6 = (struct nvme_tracker **) 0xffffa00008725800 > (kgdb) x/256g 0xffffa00008725800 > 0xffffa00008725800: 0x0000000000000000 0x0000000000000000 > 0xffffa00008725810: 0x0000000000000000 0x0000000000000000 > . . . > 0xffffa00008725fe0: 0x0000000000000000 0x0000000000000000 > 0xffffa00008725ff0: 0x0000000000000000 0x0000000000000000 > > It was all zeros (null pointers). No "live" pointers and, so, > no tr's to inspect. > > As none of this is familiar context beyond general programming > concepts, it may be some time before I find anything else > potentially of interest to report. If you have other specific > things you would like me to look at, let me know. > A fairly obvious thing I should have provided: (kgdb) print/x *qpair $15 = {ctrlr = 0xffff0000fe154000, id = 0x3, domain = 0x0, cpu = 0x2, vector = 0x3, rid = 0x4, res = 0xffffa000086ded80, tag = 0xffffa0000877b780, num_entries = 0x100, num_trackers = 0x80, sq_tdbl_off = 0x1018, cq_hdbl_off = 0x101c, phase = 0x1, sq_head = 0x1f, sq_tail = 0x20, cq_head = 0x20, num_cmds = 0x420, num_intr_handler_calls = 0xe66c, num_retries = 0x0, num_failures = 0x0, cmd = 0xffff000100ebb000, cpl = 0xffff000100ebf000, dma_tag = 0xffffa0000b093e00, dma_tag_payload = 0xffffa000059ef000, queuemem_map = 0xffffa00005a07700, cmd_bus_addr = 0xacbb000, cpl_bus_addr = 0xacbf000, free_tr = {tqh_first = 0xffffa0000b053a80, tqh_last = 0xffffa0000869da80}, outstanding_tr = {tqh_first = 0xffffa0000b053980, tqh_last = 0xffffa0000b053980}, queued_req = { stqh_first = 0x0, stqh_last = 0xffffa000087243c8}, act_tr = 0xffffa00008725800, is_enabled = 0x1, lock = {lock_object = {lo_name = 0xffff00000090321f, lo_flags = 0x1030000, lo_data = 0x0, lo_witness = 0xffffa0043fd96080}, mtx_lock = 0x0}} Looks like I need to boot into the non-debug builds for the other problem I'm testing for repeatability after a commit. === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)