Re: nvme timeout issues with hardware and bhyve vm's

From: Warner Losh <imp_at_bsdimp.com>
Date: Thu, 07 Dec 2023 23:59:08 UTC
On Thu, Dec 7, 2023 at 4:09 PM Tomoaki AOKI <junchoon@dec.sakura.ne.jp>
wrote:

> On Thu, 7 Dec 2023 14:38:37 -0800
> Pete Wright <pete@nomadlogic.org> wrote:
>
> >
> >
> > On 10/13/23 7:34 PM, Warner Losh wrote:
> > >
> >
> > >
> > >     the messages i posted in the start of the thread are from the VM
> itself
> > >     (13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showed
> no
> > >     such issues.
> > >
> > >     Based on your comment about the improvements in 14 I'll focus my
> > >     efforts
> > >     on my workstation, it seemed to happen regularly so hopefully i can
> > >     find
> > >     a repo case.
> > >
> > >
> > > Let me now if you see similar messages in stable/14. I think I've
> fixed
> > > all the
> > > issues with timeouts, though you shouldn't ever seem them in a vm setup
> > > unless something else weird is going on.
> > >
> >
> >
> > Hi Warner, just resurfacing this thread because I've had a few lockups
> > on my workstation running 14.0-STABLE.  I was able to capture a photo of
> > the hang and this seems to be the most important line:
> >
> > nvme0: Resetting controller due to a timeout and possible hot unplug.
> >
> > When I scan the device after reboot I don't see any errors, but if there
> > is a particular thing I should check via nvmecontrol please let me know.
> >   Also, since it mentions possible hot unplug I wonder if this is
> > hardware/firmware related to my system?
> >
> > Anyway, haven't found a repro case yet but it has locked up a few times
> > the past two weeks.
> >
> > -pete
> >
> >
> > --
> > Pete Wright
> > pete@nomadlogic.org
>
> If I myself encounter this kind of problem ON BARE METAL HARDWARE,
> I would usually suspect
>
>  *Overheating caused hang of NVMe controller or PCI bridge on SSD, or
>

Yes. Most drive's firmware when it overheats resets. There might be
something
that the pci code can do when this happens to retrain the link, reprogram
the
config registers, etc.


>  *Unstable physical connection (bad contact)
>

Yea, hot plug controller is required for this, but this will be bouncing.

Warner