Re: Dell Latitude 7400 - nvme0: Missing interrupt

From: Pavel Timofeev <timp87_at_gmail.com>
Date: Wed, 13 Oct 2021 00:51:29 UTC
вт, 12 окт. 2021 г. в 13:56, Warner Losh <imp@bsdimp.com>:

> One piece of data that would be good to have:
>
> nvmecontrol identify nvme0
>
> There's an optional feature that none of my drives have, but that the
> Linux driver does oddly
> weird things when enabled. The output of that command will help me
> understand if that may
> be in play. Maybe we need to do oddly weird things too :)
>
> Warner
>
> On Sun, Oct 10, 2021 at 11:00 PM Warner Losh <imp@bsdimp.com> wrote:
>
>>
>>
>> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev <timp87@gmail.com> wrote:
>>
>>> сб, 9 окт. 2021 г. в 14:59, Warner Losh <imp@bsdimp.com>:
>>>
>>>>
>>>>
>>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev <timp87@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh <imp@bsdimp.com>:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <timp87@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <imp@bsdimp.com>:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <timp87@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Warner Losh <imp@bsdimp.com>:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev <timp87@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>  Pavel Timofeev <timp87@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>> >
>>>>>>>>>>> > Chuck Tuffli <ctuffli@gmail.com>:
>>>>>>>>>>> >
>>>>>>>>>>> >> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev <
>>>>>>>>>>> timp87@gmail.com> wrote:
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > Hello
>>>>>>>>>>> >> > I've got a Dell Latitude 7400 and tried installing the
>>>>>>>>>>> latest
>>>>>>>>>>> >> 14.0-CURRENT
>>>>>>>>>>> >> > (main-n248636-d20e9e02db3) on it.
>>>>>>>>>>> >> > Despite other things the weird one which concerns me is
>>>>>>>>>>> >> >   nvme0: Missing interrupt
>>>>>>>>>>> >> > message I get sometimes on the console.
>>>>>>>>>>> >> > It seems like I get it only after the reboot of the laptop,
>>>>>>>>>>> i. e. not
>>>>>>>>>>> >> > getting that message if I power cycle the laptop, at least
>>>>>>>>>>> I haven't
>>>>>>>>>>> >> seen
>>>>>>>>>>> >> > them for now in such cases.
>>>>>>>>>>> >> > So when the laptop is rebooted I can't even take advantage
>>>>>>>>>>> of
>>>>>>>>>>> >> > nvmecontrol(8) quickly.
>>>>>>>>>>> >> > Well, it still works, but it takes tens of seconds to
>>>>>>>>>>> return the output.
>>>>>>>>>>> >> ...
>>>>>>>>>>> >> > dmesg when power cycled -
>>>>>>>>>>> >> >
>>>>>>>>>>> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ
>>>>>>>>>>> >> > dmesg when rebooted -
>>>>>>>>>>> >> >
>>>>>>>>>>> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh
>>>>>>>>>>> >>
>>>>>>>>>>> >> I'm sort of curious about the time stamps for the log
>>>>>>>>>>> messages in the
>>>>>>>>>>> >> failing case. Something like:
>>>>>>>>>>> >>
>>>>>>>>>>> >> $ grep "nv\(me\|d\)" /var/log/messages
>>>>>>>>>>> >>
>>>>>>>>>>> >> --chuck
>>>>>>>>>>> >>
>>>>>>>>>>> >
>>>>>>>>>>> > Well, I can't see timestamps in the verbose boot log. Am I
>>>>>>>>>>> missing some
>>>>>>>>>>> > configuration for that?
>>>>>>>>>>> >
>>>>>>>>>>> > $ grep "nv\(me\|d\)" /var/log/messages
>>>>>>>>>>> > nvme0: <Generic NVMe Device> mem
>>>>>>>>>>> >
>>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>>>>>> > 0.0 on pci6
>>>>>>>>>>> > nvme0: attempting to allocate 5 MSI-X vectors (17 supported)
>>>>>>>>>>> > nvme0: using IRQs 133-137 for MSI-X
>>>>>>>>>>> > nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20
>>>>>>>>>>> > nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0,
>>>>>>>>>>> MPSMAX 0
>>>>>>>>>>> > nvme0: Version: 0x00010300: 1.3
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>> > nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
>>>>>>>>>>> > GEOM: new disk nvd0
>>>>>>>>>>> > nvd0: 488386MB (1000215216 512 byte sectors)
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ah, sorry, provided wrong output.
>>>>>>>>>>> Here is what you requested:
>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device> mem
>>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
>>>>>>>>>>> at device
>>>>>>>>>>> 0.0 on pci6
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to allocate 5
>>>>>>>>>>> MSI-X
>>>>>>>>>>> vectors (17 supported)
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for
>>>>>>>>>>> MSI-X
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: MQES
>>>>>>>>>>> 1023, CQR,
>>>>>>>>>>> TO 20
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030: DSTRD
>>>>>>>>>>> 0, NSSRS,
>>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: 1.3
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix
>>>>>>>>>>> 512GB> NVMe
>>>>>>>>>>> namespace
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 512
>>>>>>>>>>> byte
>>>>>>>>>>> sectors)
>>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and
>>>>>>>>>> hw.cam.nda.nvd_compat=1
>>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd
>>>>>>>>>> was? Or does
>>>>>>>>>> it work?
>>>>>>>>>>
>>>>>>>>>> Something weird is going on in the interrupt assignment, I think,
>>>>>>>>>> but I
>>>>>>>>>> wanted to get any nvd vs nda issues out of the way first.
>>>>>>>>>>
>>>>>>>>>> Warner
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead
>>>>>>>>> of hw.cam.nda.nvd_compat?
>>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now.
>>>>>>>>>
>>>>>>>>> So I tried to set  hw.nvme.use_nvd to 1 as suggested, but I still
>>>>>>>>> see
>>>>>>>>>   nvme0: Missing interrupt
>>>>>>>>> and now also
>>>>>>>>>   Root mount waiting for: CAM
>>>>>>>>> messages besides those
>>>>>>>>>
>>>>>>>>
>>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by
>>>>>>>> default these
>>>>>>>> days.
>>>>>>>>
>>>>>>>> I'll take a look on monday starting at the differences in interrupt
>>>>>>>> assignment that
>>>>>>>> are apparent when you cold boot vs reboot.
>>>>>>>>
>>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also
>>>>>>>> didn't really
>>>>>>>> expect it to be.
>>>>>>>>
>>>>>>>> Warner
>>>>>>>>
>>>>>>>>
>>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even
>>>>>>> worse now.
>>>>>>> So clean poweron works as before.
>>>>>>> But if rebooted nvme drive refuses to work, while before the code
>>>>>>> upgrade it was just complaining about missing interrupts.
>>>>>>>
>>>>>>> currently dmesg show this:
>>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>> 0.0 on pci6
>>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors)
>>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>> 0.0 on pci6
>>>>>>>
>>>>>>
>>>>>> Why is this showing up twice? Or is everything above this line left
>>>>>> over from the first, working boot?
>>>>>>
>>>>>>
>>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288
>>>>>>> nvme0: timeout with nothing complete, resetting
>>>>>>> nvme0: Resetting controller due to a timeout.
>>>>>>> nvme0: RECOVERY_WAITING
>>>>>>> nvme0: resetting controller
>>>>>>> nvme0: aborting outstanding admin command
>>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
>>>>>>> cdw11:00000000
>>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>>>>>>> nvme0: nvme_identify_controller failed!
>>>>>>> nvme0: waiting
>>>>>>>
>>>>>>
>>>>>> Clearly something bad is going on with the drive here... We looked
>>>>>> into the completion queues when we didn't get an interrupt and there was
>>>>>> nothing complete there....
>>>>>>
>>>>>> The only thing I can think of is that this means there's a phase
>>>>>> error between the drive and the system. I recently removed a second reset
>>>>>> and made it an option NVME_2X_RESET. Can you see if adding
>>>>>> 'options NVME_2X_RESET' to your kernel config fixes this?
>>>>>>
>>>>>> Warner
>>>>>>
>>>>>>
>>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>> 0.0 on pci6
>>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561
>>>>>>> nvme0: timeout with nothing complete, resetting
>>>>>>> nvme0: Resetting controller due to a timeout.
>>>>>>> nvme0: RECOVERY_WAITING
>>>>>>> nvme0: resetting controller
>>>>>>> nvme0: aborting outstanding admin command
>>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
>>>>>>> cdw11:00000000
>>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>>>>>>> nvme0: nvme_identify_controller failed!
>>>>>>> nvme0: waiting
>>>>>>>
>>>>>>>
>>>>>
>>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's
>>>>> like:
>>>>> nvme0: <Generic NVMe Device> mem
>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>> 0.0 on pci6
>>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423
>>>>> nvme0: timeout with nothing complete, resetting
>>>>> nvme0: Resetting controller due to a timeout.
>>>>> nvme0: RECOVERY_WAITING
>>>>> nvme0: resetting controller
>>>>> nvme0: aborting outstanding admin command
>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:00000000
>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>>>>> nvme0: nvme_identify_controller failed!
>>>>> nvme0: waiting
>>>>>
>>>>> Well, neither Windows not Linux have any problems with the device. I
>>>>> understand they may be hiding it or workaround somehow.
>>>>>
>>>>
>>>> Yea, I'm trying to figure out why your machine is different than mine,
>>>> and what Windows or Linux do that is different. It may be dodgy hardware,
>>>> but others have no trouble...
>>>>
>>>> I'll try setting NVME_2X_RESET in the kernel config and report back in
>>>>> a while.
>>>>>
>>>>
>>>> Thanks. If it helps, that tells me something. If it doesn't, that tells
>>>> me something else.
>>>>
>>>> I suspect that it is somewhere else in the system, tbh, but I need to
>>>> find it systematically.
>>>>
>>>> Warner
>>>>
>>>
>>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't changed
>>> anything. I. e it didn't help.
>>>
>>
>> While it would have been nice to have this be the fix, I'm not that
>> surprised either.
>> It was the biggest change of late, apart from the big re-arrangement that
>> I'd done.
>>
>> So the other changes have moved from the occasional missing interrupt
>> message
>> (which the old code would get when a command wasn't completed in the
>> timeout
>> period, but that we found to be done when we scanned the completion
>> queue. Now
>> the device is detected fine (as before), but then doesn't do I/O at all
>> (including not
>> completing the identify command!) and is worse. This is unexpected and
>> I'm trying
>> understand what happens on reboot that 'changes'the working state and why
>> the
>> new code behaves so differently.
>>
>> Warner
>>
>

Sure, here it is:
Controller Capabilities/Features
================================
Vendor ID:                   1c5c
Subsystem Vendor ID:         1c5c
Serial Number:               ND03N46381010423H
Model Number:                PC611 NVMe SK hynix 512GB
Firmware Version:            11000111
Recommended Arb Burst:       4
IEEE OUI Identifier:         2e e4 ac
Multi-Path I/O Capabilities: Not Supported
Max Data Transfer Size:      262144 bytes
Sanitize Crypto Erase:       Not Supported
Sanitize Block Erase:        Supported
Sanitize Overwrite:          Not Supported
Sanitize NDI:                Not Supported
Sanitize NODMMAS:            Undefined
Controller ID:               0x0001
Version:                     1.3.0

Admin Command Set Attributes
============================
Security Send/Receive:       Supported
Format NVM:                  Supported
Firmware Activate/Download:  Supported
Namespace Management:        Not Supported
Device Self-test:            Supported
Directives:                  Not Supported
NVMe-MI Send/Receive:        Not Supported
Virtualization Management:   Not Supported
Doorbell Buffer Config:      Not Supported
Get LBA Status:              Not Supported
Sanitize:                    block,
Abort Command Limit:         4
Async Event Request Limit:   8
Number of Firmware Slots:    3
Firmware Slot 1 Read-Only:   No
Per-Namespace SMART Log:     No
Error Log Page Entries:      256
Number of Power States:      5
Total NVM Capacity:          0 bytes
Unallocated NVM Capacity:    0 bytes
Firmware Update Granularity: 00 (Not Reported)
Host Buffer Preferred Size:  0 bytes
Host Buffer Minimum Size:    0 bytes

NVM Command Set Attributes
==========================
Submission Queue Entry Size
  Max:                       64
  Min:                       64
Completion Queue Entry Size
  Max:                       16
  Min:                       16
Number of Namespaces:        1
Compare Command:             Supported
Write Uncorrectable Command: Supported
Dataset Management Command:  Supported
Write Zeroes Command:        Supported
Save Features:               Supported
Reservations:                Not Supported
Timestamp feature:           Supported
Verify feature:              Not Supported
Fused Operation Support:     Not Supported
Format NVM Attributes:       Per-NS Erase, Per-NS Format
Volatile Write Cache:        Present

NVM Subsystem Name: