Re: Dell Latitude 7400 - nvme0: Missing interrupt
- In reply to: Pavel Timofeev : "Re: Dell Latitude 7400 - nvme0: Missing interrupt"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 14 Jan 2022 18:53:06 UTC
вс, 17 окт. 2021 г. в 17:52, Pavel Timofeev <timp87@gmail.com>: > > > вс, 17 окт. 2021 г. в 11:19, Alexander Motin <mavbsd@gmail.com>: > >> It may be a noise, but comparing logs I see in reboot case also >> "acpi_ec0: not getting interrupts, switched to polled mode". I am >> thinking whether the problem may be caused not by SSD, but by some >> resource conflict/misconfiguration in the system. Pavel, can you >> compare `devinfo -vr` and `lspci -vvvvv` in both cases. looking for any >> differences? Are you running the latest BIOS? >> >> On 12.10.2021 15:56, Warner Losh wrote: >> > One piece of data that would be good to have: >> > >> > nvmecontrol identify nvme0 >> > >> > There's an optional feature that none of my drives have, but that the >> Linux >> > driver does oddly >> > weird things when enabled. The output of that command will help me >> > understand if that may >> > be in play. Maybe we need to do oddly weird things too :) >> > >> > Warner >> > >> > On Sun, Oct 10, 2021 at 11:00 PM Warner Losh <imp@bsdimp.com> wrote: >> > >> >> >> >> >> >> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev <timp87@gmail.com> >> wrote: >> >> >> >>> сб, 9 окт. 2021 г. в 14:59, Warner Losh <imp@bsdimp.com>: >> >>> >> >>>> >> >>>> >> >>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev <timp87@gmail.com> >> wrote: >> >>>> >> >>>>> >> >>>>> >> >>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh <imp@bsdimp.com>: >> >>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <timp87@gmail.com> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <imp@bsdimp.com>: >> >>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <timp87@gmail.com >> > >> >>>>>>>> wrote: >> >>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Warner Losh <imp@bsdimp.com>: >> >>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev < >> timp87@gmail.com> >> >>>>>>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>>> Pavel Timofeev <timp87@gmail.com>: >> >>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> Chuck Tuffli <ctuffli@gmail.com>: >> >>>>>>>>>>>> >> >>>>>>>>>>>>> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev < >> >>>>>>>>>>> timp87@gmail.com> wrote: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Hello >> >>>>>>>>>>>>>> I've got a Dell Latitude 7400 and tried installing the >> latest >> >>>>>>>>>>>>> 14.0-CURRENT >> >>>>>>>>>>>>>> (main-n248636-d20e9e02db3) on it. >> >>>>>>>>>>>>>> Despite other things the weird one which concerns me is >> >>>>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>>>> message I get sometimes on the console. >> >>>>>>>>>>>>>> It seems like I get it only after the reboot of the laptop, >> >>>>>>>>>>> i. e. not >> >>>>>>>>>>>>>> getting that message if I power cycle the laptop, at least >> I >> >>>>>>>>>>> haven't >> >>>>>>>>>>>>> seen >> >>>>>>>>>>>>>> them for now in such cases. >> >>>>>>>>>>>>>> So when the laptop is rebooted I can't even take advantage >> of >> >>>>>>>>>>>>>> nvmecontrol(8) quickly. >> >>>>>>>>>>>>>> Well, it still works, but it takes tens of seconds to >> return >> >>>>>>>>>>> the output. >> >>>>>>>>>>>>> ... >> >>>>>>>>>>>>>> dmesg when power cycled - >> >>>>>>>>>>>>>> >> >>>>>>>>>>> >> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ >> >>>>>>>>>>>>>> dmesg when rebooted - >> >>>>>>>>>>>>>> >> >>>>>>>>>>> >> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> I'm sort of curious about the time stamps for the log >> messages >> >>>>>>>>>>> in the >> >>>>>>>>>>>>> failing case. Something like: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> --chuck >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> Well, I can't see timestamps in the verbose boot log. Am I >> >>>>>>>>>>> missing some >> >>>>>>>>>>>> configuration for that? >> >>>>>>>>>>>> >> >>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >> >>>>>>>>>>>> nvme0: <Generic NVMe Device> mem >> >>>>>>>>>>>> >> >>>>>>>>>>> >> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >> >>>>>>>>>>>> 0.0 on pci6 >> >>>>>>>>>>>> nvme0: attempting to allocate 5 MSI-X vectors (17 supported) >> >>>>>>>>>>>> nvme0: using IRQs 133-137 for MSI-X >> >>>>>>>>>>>> nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20 >> >>>>>>>>>>>> nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0, >> >>>>>>>>>>> MPSMAX 0 >> >>>>>>>>>>>> nvme0: Version: 0x00010300: 1.3 >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace >> >>>>>>>>>>>> GEOM: new disk nvd0 >> >>>>>>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors) >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> Ah, sorry, provided wrong output. >> >>>>>>>>>>> Here is what you requested: >> >>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device> >> mem >> >>>>>>>>>>> >> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >> >>>>>>>>>>> at device >> >>>>>>>>>>> 0.0 on pci6 >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to >> allocate 5 >> >>>>>>>>>>> MSI-X >> >>>>>>>>>>> vectors (17 supported) >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for >> >>>>>>>>>>> MSI-X >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: >> MQES >> >>>>>>>>>>> 1023, CQR, >> >>>>>>>>>>> TO 20 >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030: >> DSTRD >> >>>>>>>>>>> 0, NSSRS, >> >>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0 >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: >> 1.3 >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix >> >>>>>>>>>>> 512GB> NVMe >> >>>>>>>>>>> namespace >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0 >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 >> 512 >> >>>>>>>>>>> byte >> >>>>>>>>>>> sectors) >> >>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and >> >>>>>>>>>> hw.cam.nda.nvd_compat=1 >> >>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd >> >>>>>>>>>> was? Or does >> >>>>>>>>>> it work? >> >>>>>>>>>> >> >>>>>>>>>> Something weird is going on in the interrupt assignment, I >> think, >> >>>>>>>>>> but I >> >>>>>>>>>> wanted to get any nvd vs nda issues out of the way first. >> >>>>>>>>>> >> >>>>>>>>>> Warner >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead >> >>>>>>>>> of hw.cam.nda.nvd_compat? >> >>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now. >> >>>>>>>>> >> >>>>>>>>> So I tried to set hw.nvme.use_nvd to 1 as suggested, but I >> still >> >>>>>>>>> see >> >>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>> and now also >> >>>>>>>>> Root mount waiting for: CAM >> >>>>>>>>> messages besides those >> >>>>>>>>> >> >>>>>>>> >> >>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by >> default >> >>>>>>>> these >> >>>>>>>> days. >> >>>>>>>> >> >>>>>>>> I'll take a look on monday starting at the differences in >> interrupt >> >>>>>>>> assignment that >> >>>>>>>> are apparent when you cold boot vs reboot. >> >>>>>>>> >> >>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also >> >>>>>>>> didn't really >> >>>>>>>> expect it to be. >> >>>>>>>> >> >>>>>>>> Warner >> >>>>>>>> >> >>>>>>>> >> >>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even >> >>>>>>> worse now. >> >>>>>>> So clean poweron works as before. >> >>>>>>> But if rebooted nvme drive refuses to work, while before the code >> >>>>>>> upgrade it was just complaining about missing interrupts. >> >>>>>>> >> >>>>>>> currently dmesg show this: >> >>>>>>> nvme0: <Generic NVMe Device> mem >> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >> at device >> >>>>>>> 0.0 on pci6 >> >>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace >> >>>>>>> nvd0: 488386MB (1000215216 512 byte sectors) >> >>>>>>> nvme0: <Generic NVMe Device> mem >> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >> at device >> >>>>>>> 0.0 on pci6 >> >>>>>>> >> >>>>>> >> >>>>>> Why is this showing up twice? Or is everything above this line left >> >>>>>> over from the first, working boot? >> >>>>>> >> >>>>>> >> >>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288 >> >>>>>>> nvme0: timeout with nothing complete, resetting >> >>>>>>> nvme0: Resetting controller due to a timeout. >> >>>>>>> nvme0: RECOVERY_WAITING >> >>>>>>> nvme0: resetting controller >> >>>>>>> nvme0: aborting outstanding admin command >> >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 >> >>>>>>> cdw11:00000000 >> >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >> >>>>>>> nvme0: nvme_identify_controller failed! >> >>>>>>> nvme0: waiting >> >>>>>>> >> >>>>>> >> >>>>>> Clearly something bad is going on with the drive here... We looked >> >>>>>> into the completion queues when we didn't get an interrupt and >> there was >> >>>>>> nothing complete there.... >> >>>>>> >> >>>>>> The only thing I can think of is that this means there's a phase >> error >> >>>>>> between the drive and the system. I recently removed a second >> reset and >> >>>>>> made it an option NVME_2X_RESET. Can you see if adding >> >>>>>> 'options NVME_2X_RESET' to your kernel config fixes this? >> >>>>>> >> >>>>>> Warner >> >>>>>> >> >>>>>> >> >>>>>>> nvme0: <Generic NVMe Device> mem >> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >> at device >> >>>>>>> 0.0 on pci6 >> >>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561 >> >>>>>>> nvme0: timeout with nothing complete, resetting >> >>>>>>> nvme0: Resetting controller due to a timeout. >> >>>>>>> nvme0: RECOVERY_WAITING >> >>>>>>> nvme0: resetting controller >> >>>>>>> nvme0: aborting outstanding admin command >> >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 >> >>>>>>> cdw11:00000000 >> >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >> >>>>>>> nvme0: nvme_identify_controller failed! >> >>>>>>> nvme0: waiting >> >>>>>>> >> >>>>>>> >> >>>>> >> >>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's >> >>>>> like: >> >>>>> nvme0: <Generic NVMe Device> mem >> >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >> at device >> >>>>> 0.0 on pci6 >> >>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423 >> >>>>> nvme0: timeout with nothing complete, resetting >> >>>>> nvme0: Resetting controller due to a timeout. >> >>>>> nvme0: RECOVERY_WAITING >> >>>>> nvme0: resetting controller >> >>>>> nvme0: aborting outstanding admin command >> >>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 >> cdw11:00000000 >> >>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >> >>>>> nvme0: nvme_identify_controller failed! >> >>>>> nvme0: waiting >> >>>>> >> >>>>> Well, neither Windows not Linux have any problems with the device. I >> >>>>> understand they may be hiding it or workaround somehow. >> >>>>> >> >>>> >> >>>> Yea, I'm trying to figure out why your machine is different than >> mine, >> >>>> and what Windows or Linux do that is different. It may be dodgy >> hardware, >> >>>> but others have no trouble... >> >>>> >> >>>> I'll try setting NVME_2X_RESET in the kernel config and report back >> in a >> >>>>> while. >> >>>>> >> >>>> >> >>>> Thanks. If it helps, that tells me something. If it doesn't, that >> tells >> >>>> me something else. >> >>>> >> >>>> I suspect that it is somewhere else in the system, tbh, but I need to >> >>>> find it systematically. >> >>>> >> >>>> Warner >> >>>> >> >>> >> >>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't >> changed >> >>> anything. I. e it didn't help. >> >>> >> >> >> >> While it would have been nice to have this be the fix, I'm not that >> >> surprised either. >> >> It was the biggest change of late, apart from the big re-arrangement >> that >> >> I'd done. >> >> >> >> So the other changes have moved from the occasional missing interrupt >> >> message >> >> (which the old code would get when a command wasn't completed in the >> >> timeout >> >> period, but that we found to be done when we scanned the completion >> queue. >> >> Now >> >> the device is detected fine (as before), but then doesn't do I/O at all >> >> (including not >> >> completing the identify command!) and is worse. This is unexpected and >> I'm >> >> trying >> >> understand what happens on reboot that 'changes'the working state and >> why >> >> the >> >> new code behaves so differently. >> >> >> >> Warner >> >> >> > >> >> -- >> Alexander Motin >> > > > Thanks for the reply. > It's using the latest firmware. This is the first thing I do in such case. > > > Attaching devinfo and lspci output. > These are diffs showing the difference between clean boot and a reboot: > > $ diff -u devinfo.ok devinfo.nok > --- devinfo.ok 2021-10-17 17:48:07.964346000 -0600 > +++ devinfo.nok 2021-10-17 17:48:07.886881000 -0600 > @@ -214,10 +214,6 @@ > nvme0 pnpinfo vendor=0x1c5c device=0x1639 subvendor=0x1c5c > subdevice=0x1639 class=0x010802 at slot=0 function=0 dbsf=pci0:59:0:0 > handle=\_SB_.PCI0.RP13.PXSX > Interrupt request lines: > 0x85 > - 0x86 > - 0x87 > - 0x88 > - 0x89 > pcib7 memory window: > 0xcc100000-0xcc103fff > 0xcc104000-0xcc104fff > > $ diff -u lspci.ok lspci.nok > --- lspci.ok 2021-10-17 17:48:15.894470000 -0600 > +++ lspci.nok 2021-10-17 17:48:15.341379000 -0600 > @@ -132,7 +132,7 @@ > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA > PME(D0+,D1-,D2-,D3hot+,D3cold+) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ > - Address: 00000000fee06000 Data: 0033 > + Address: 00000000fee06000 Data: 0034 > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00 > DevCap: MaxPayload 128 bytes, PhantFunc 0 > ExtTag- RBE- FLReset+ > > Hi, I hope everyone is doing well. So several BIOS updates passed, now the BIOS version 1.15.1, but it works the same. At least on CURRENT built several days ago (main-n252414-0e8181c0123). What is interesting iwn(4) and iwlwifi(4) work the same way - only full power cycle makes wifi functional, simple reboot brakes it in most cases. Does anybody have any idea?