Re: Dell Latitude 7400 - nvme0: Missing interrupt
- Reply: Warner Losh : "Re: Dell Latitude 7400 - nvme0: Missing interrupt"
- In reply to: Warner Losh : "Re: Dell Latitude 7400 - nvme0: Missing interrupt"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 13 Oct 2021 00:51:29 UTC
вт, 12 окт. 2021 г. в 13:56, Warner Losh <imp@bsdimp.com>: > One piece of data that would be good to have: > > nvmecontrol identify nvme0 > > There's an optional feature that none of my drives have, but that the > Linux driver does oddly > weird things when enabled. The output of that command will help me > understand if that may > be in play. Maybe we need to do oddly weird things too :) > > Warner > > On Sun, Oct 10, 2021 at 11:00 PM Warner Losh <imp@bsdimp.com> wrote: > >> >> >> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev <timp87@gmail.com> wrote: >> >>> сб, 9 окт. 2021 г. в 14:59, Warner Losh <imp@bsdimp.com>: >>> >>>> >>>> >>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev <timp87@gmail.com> wrote: >>>> >>>>> >>>>> >>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh <imp@bsdimp.com>: >>>>> >>>>>> >>>>>> >>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <timp87@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <imp@bsdimp.com>: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <timp87@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Warner Losh <imp@bsdimp.com>: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev <timp87@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Pavel Timofeev <timp87@gmail.com>: >>>>>>>>>>> >>>>>>>>>>> > >>>>>>>>>>> > Chuck Tuffli <ctuffli@gmail.com>: >>>>>>>>>>> > >>>>>>>>>>> >> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev < >>>>>>>>>>> timp87@gmail.com> wrote: >>>>>>>>>>> >> > >>>>>>>>>>> >> > Hello >>>>>>>>>>> >> > I've got a Dell Latitude 7400 and tried installing the >>>>>>>>>>> latest >>>>>>>>>>> >> 14.0-CURRENT >>>>>>>>>>> >> > (main-n248636-d20e9e02db3) on it. >>>>>>>>>>> >> > Despite other things the weird one which concerns me is >>>>>>>>>>> >> > nvme0: Missing interrupt >>>>>>>>>>> >> > message I get sometimes on the console. >>>>>>>>>>> >> > It seems like I get it only after the reboot of the laptop, >>>>>>>>>>> i. e. not >>>>>>>>>>> >> > getting that message if I power cycle the laptop, at least >>>>>>>>>>> I haven't >>>>>>>>>>> >> seen >>>>>>>>>>> >> > them for now in such cases. >>>>>>>>>>> >> > So when the laptop is rebooted I can't even take advantage >>>>>>>>>>> of >>>>>>>>>>> >> > nvmecontrol(8) quickly. >>>>>>>>>>> >> > Well, it still works, but it takes tens of seconds to >>>>>>>>>>> return the output. >>>>>>>>>>> >> ... >>>>>>>>>>> >> > dmesg when power cycled - >>>>>>>>>>> >> > >>>>>>>>>>> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ >>>>>>>>>>> >> > dmesg when rebooted - >>>>>>>>>>> >> > >>>>>>>>>>> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh >>>>>>>>>>> >> >>>>>>>>>>> >> I'm sort of curious about the time stamps for the log >>>>>>>>>>> messages in the >>>>>>>>>>> >> failing case. Something like: >>>>>>>>>>> >> >>>>>>>>>>> >> $ grep "nv\(me\|d\)" /var/log/messages >>>>>>>>>>> >> >>>>>>>>>>> >> --chuck >>>>>>>>>>> >> >>>>>>>>>>> > >>>>>>>>>>> > Well, I can't see timestamps in the verbose boot log. Am I >>>>>>>>>>> missing some >>>>>>>>>>> > configuration for that? >>>>>>>>>>> > >>>>>>>>>>> > $ grep "nv\(me\|d\)" /var/log/messages >>>>>>>>>>> > nvme0: <Generic NVMe Device> mem >>>>>>>>>>> > >>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >>>>>>>>>>> > 0.0 on pci6 >>>>>>>>>>> > nvme0: attempting to allocate 5 MSI-X vectors (17 supported) >>>>>>>>>>> > nvme0: using IRQs 133-137 for MSI-X >>>>>>>>>>> > nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20 >>>>>>>>>>> > nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0, >>>>>>>>>>> MPSMAX 0 >>>>>>>>>>> > nvme0: Version: 0x00010300: 1.3 >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvme0: Missing interrupt >>>>>>>>>>> > nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace >>>>>>>>>>> > GEOM: new disk nvd0 >>>>>>>>>>> > nvd0: 488386MB (1000215216 512 byte sectors) >>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Ah, sorry, provided wrong output. >>>>>>>>>>> Here is what you requested: >>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device> mem >>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >>>>>>>>>>> at device >>>>>>>>>>> 0.0 on pci6 >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to allocate 5 >>>>>>>>>>> MSI-X >>>>>>>>>>> vectors (17 supported) >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for >>>>>>>>>>> MSI-X >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: MQES >>>>>>>>>>> 1023, CQR, >>>>>>>>>>> TO 20 >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030: DSTRD >>>>>>>>>>> 0, NSSRS, >>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0 >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: 1.3 >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix >>>>>>>>>>> 512GB> NVMe >>>>>>>>>>> namespace >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0 >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 512 >>>>>>>>>>> byte >>>>>>>>>>> sectors) >>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and >>>>>>>>>> hw.cam.nda.nvd_compat=1 >>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd >>>>>>>>>> was? Or does >>>>>>>>>> it work? >>>>>>>>>> >>>>>>>>>> Something weird is going on in the interrupt assignment, I think, >>>>>>>>>> but I >>>>>>>>>> wanted to get any nvd vs nda issues out of the way first. >>>>>>>>>> >>>>>>>>>> Warner >>>>>>>>>> >>>>>>>>> >>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead >>>>>>>>> of hw.cam.nda.nvd_compat? >>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now. >>>>>>>>> >>>>>>>>> So I tried to set hw.nvme.use_nvd to 1 as suggested, but I still >>>>>>>>> see >>>>>>>>> nvme0: Missing interrupt >>>>>>>>> and now also >>>>>>>>> Root mount waiting for: CAM >>>>>>>>> messages besides those >>>>>>>>> >>>>>>>> >>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by >>>>>>>> default these >>>>>>>> days. >>>>>>>> >>>>>>>> I'll take a look on monday starting at the differences in interrupt >>>>>>>> assignment that >>>>>>>> are apparent when you cold boot vs reboot. >>>>>>>> >>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also >>>>>>>> didn't really >>>>>>>> expect it to be. >>>>>>>> >>>>>>>> Warner >>>>>>>> >>>>>>>> >>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even >>>>>>> worse now. >>>>>>> So clean poweron works as before. >>>>>>> But if rebooted nvme drive refuses to work, while before the code >>>>>>> upgrade it was just complaining about missing interrupts. >>>>>>> >>>>>>> currently dmesg show this: >>>>>>> nvme0: <Generic NVMe Device> mem >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >>>>>>> 0.0 on pci6 >>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace >>>>>>> nvd0: 488386MB (1000215216 512 byte sectors) >>>>>>> nvme0: <Generic NVMe Device> mem >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >>>>>>> 0.0 on pci6 >>>>>>> >>>>>> >>>>>> Why is this showing up twice? Or is everything above this line left >>>>>> over from the first, working boot? >>>>>> >>>>>> >>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288 >>>>>>> nvme0: timeout with nothing complete, resetting >>>>>>> nvme0: Resetting controller due to a timeout. >>>>>>> nvme0: RECOVERY_WAITING >>>>>>> nvme0: resetting controller >>>>>>> nvme0: aborting outstanding admin command >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 >>>>>>> cdw11:00000000 >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >>>>>>> nvme0: nvme_identify_controller failed! >>>>>>> nvme0: waiting >>>>>>> >>>>>> >>>>>> Clearly something bad is going on with the drive here... We looked >>>>>> into the completion queues when we didn't get an interrupt and there was >>>>>> nothing complete there.... >>>>>> >>>>>> The only thing I can think of is that this means there's a phase >>>>>> error between the drive and the system. I recently removed a second reset >>>>>> and made it an option NVME_2X_RESET. Can you see if adding >>>>>> 'options NVME_2X_RESET' to your kernel config fixes this? >>>>>> >>>>>> Warner >>>>>> >>>>>> >>>>>>> nvme0: <Generic NVMe Device> mem >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >>>>>>> 0.0 on pci6 >>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561 >>>>>>> nvme0: timeout with nothing complete, resetting >>>>>>> nvme0: Resetting controller due to a timeout. >>>>>>> nvme0: RECOVERY_WAITING >>>>>>> nvme0: resetting controller >>>>>>> nvme0: aborting outstanding admin command >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 >>>>>>> cdw11:00000000 >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >>>>>>> nvme0: nvme_identify_controller failed! >>>>>>> nvme0: waiting >>>>>>> >>>>>>> >>>>> >>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's >>>>> like: >>>>> nvme0: <Generic NVMe Device> mem >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >>>>> 0.0 on pci6 >>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423 >>>>> nvme0: timeout with nothing complete, resetting >>>>> nvme0: Resetting controller due to a timeout. >>>>> nvme0: RECOVERY_WAITING >>>>> nvme0: resetting controller >>>>> nvme0: aborting outstanding admin command >>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:00000000 >>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >>>>> nvme0: nvme_identify_controller failed! >>>>> nvme0: waiting >>>>> >>>>> Well, neither Windows not Linux have any problems with the device. I >>>>> understand they may be hiding it or workaround somehow. >>>>> >>>> >>>> Yea, I'm trying to figure out why your machine is different than mine, >>>> and what Windows or Linux do that is different. It may be dodgy hardware, >>>> but others have no trouble... >>>> >>>> I'll try setting NVME_2X_RESET in the kernel config and report back in >>>>> a while. >>>>> >>>> >>>> Thanks. If it helps, that tells me something. If it doesn't, that tells >>>> me something else. >>>> >>>> I suspect that it is somewhere else in the system, tbh, but I need to >>>> find it systematically. >>>> >>>> Warner >>>> >>> >>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't changed >>> anything. I. e it didn't help. >>> >> >> While it would have been nice to have this be the fix, I'm not that >> surprised either. >> It was the biggest change of late, apart from the big re-arrangement that >> I'd done. >> >> So the other changes have moved from the occasional missing interrupt >> message >> (which the old code would get when a command wasn't completed in the >> timeout >> period, but that we found to be done when we scanned the completion >> queue. Now >> the device is detected fine (as before), but then doesn't do I/O at all >> (including not >> completing the identify command!) and is worse. This is unexpected and >> I'm trying >> understand what happens on reboot that 'changes'the working state and why >> the >> new code behaves so differently. >> >> Warner >> > Sure, here it is: Controller Capabilities/Features ================================ Vendor ID: 1c5c Subsystem Vendor ID: 1c5c Serial Number: ND03N46381010423H Model Number: PC611 NVMe SK hynix 512GB Firmware Version: 11000111 Recommended Arb Burst: 4 IEEE OUI Identifier: 2e e4 ac Multi-Path I/O Capabilities: Not Supported Max Data Transfer Size: 262144 bytes Sanitize Crypto Erase: Not Supported Sanitize Block Erase: Supported Sanitize Overwrite: Not Supported Sanitize NDI: Not Supported Sanitize NODMMAS: Undefined Controller ID: 0x0001 Version: 1.3.0 Admin Command Set Attributes ============================ Security Send/Receive: Supported Format NVM: Supported Firmware Activate/Download: Supported Namespace Management: Not Supported Device Self-test: Supported Directives: Not Supported NVMe-MI Send/Receive: Not Supported Virtualization Management: Not Supported Doorbell Buffer Config: Not Supported Get LBA Status: Not Supported Sanitize: block, Abort Command Limit: 4 Async Event Request Limit: 8 Number of Firmware Slots: 3 Firmware Slot 1 Read-Only: No Per-Namespace SMART Log: No Error Log Page Entries: 256 Number of Power States: 5 Total NVM Capacity: 0 bytes Unallocated NVM Capacity: 0 bytes Firmware Update Granularity: 00 (Not Reported) Host Buffer Preferred Size: 0 bytes Host Buffer Minimum Size: 0 bytes NVM Command Set Attributes ========================== Submission Queue Entry Size Max: 64 Min: 64 Completion Queue Entry Size Max: 16 Min: 16 Number of Namespaces: 1 Compare Command: Supported Write Uncorrectable Command: Supported Dataset Management Command: Supported Write Zeroes Command: Supported Save Features: Supported Reservations: Not Supported Timestamp feature: Supported Verify feature: Not Supported Fused Operation Support: Not Supported Format NVM Attributes: Per-NS Erase, Per-NS Format Volatile Write Cache: Present NVM Subsystem Name: