From nobody Thu Dec 07 22:49:03 2023 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SmTwX5bywz53l1M for ; Thu, 7 Dec 2023 22:49:16 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-ed1-x533.google.com (mail-ed1-x533.google.com [IPv6:2a00:1450:4864:20::533]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4SmTwX3TxTz3R3G for ; Thu, 7 Dec 2023 22:49:16 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-ed1-x533.google.com with SMTP id 4fb4d7f45d1cf-54dcfca54e0so1608177a12.1 for ; Thu, 07 Dec 2023 14:49:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20230601.gappssmtp.com; s=20230601; t=1701989355; x=1702594155; darn=freebsd.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=W6wsczIHjzUX84vUc+gXeX5UQmTe3nQkJFjqVWTTeNg=; b=nOsW+O8XIedJn08nLbJ1/flkKcH7f3FP/smjNd71gX2d2tHagSPG6ehTUzZSfoz3dn sA9bPAzWbmWzBhg1eLW/t1x59uS/GSNW4ob4d4wUUaIht+yKInsTXOjMF9twwmUeU5IY XjGbXI+iCd9wpr+3oYHtNV/iKKSn9qUkAQ+Y+GIy+5nzb86MVtv70aUylx0UQ7GSRAFL Rps9jRQ/5Nvkrxrqkn1tIK+UhZrBc9ry0x4ZR5HTg3l0snfxnBYm+olReoi4Q96c29ri PM9AKbXRwsV1lijT7mDBv/+9+7PuARiAnmyNn+uwvE7BWHled/w8eU4JtOyDJfEFhCLG zV5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701989355; x=1702594155; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=W6wsczIHjzUX84vUc+gXeX5UQmTe3nQkJFjqVWTTeNg=; b=gGgAKsSgHP1bgNqw6sTugeOlwnt0rkAORcDUJNMriy+hRlw50XNZgsBQhz6tBFAhWL 5iPtaS7Io+1KQg+QxDnQa1VuUO5tV0PzSLpnLEXu5ZT1QdVGNKl62U1mWsX6X8J148vC cs35WUYTIxJ/4SrSo9/Xjehe53gQgaCZCclq+HAvzg2RfX9nf0Bix2zEU26tzWF0Ge2H PA9pCYzKLDX0RCsUJkOa66uFpdM/pc1l0gPQbLJ3+m5sTB/ZdJpt+4K8hX6jGV/0dmIW KODu6kMfwYMhGqfOWso9peH2yugTj9Lggl4m4yNruzgK1sq4Wu9azmjStU2iqMN1hnw6 Q/Kg== X-Gm-Message-State: AOJu0Yz4oU5ND88xJVijf8xyEcMOIl3q/i80Uzc38XrOgCFxRlG0nblP pWi0SX7KBCgr88xPYHgvHujFtvWqt5Q8NnwUUy66in0LZk/We2vH X-Google-Smtp-Source: AGHT+IH+OSl/uZVqHcFJksYc+fPhEH0kKJcI9lkPdeQmUYSKXftmLmRuiQSsfyui8Lrtv5aj8oN3kOQtAye11K4OKxU= X-Received: by 2002:a05:6402:1619:b0:54d:ae:1759 with SMTP id f25-20020a056402161900b0054d00ae1759mr1863876edv.24.1701989354909; Thu, 07 Dec 2023 14:49:14 -0800 (PST) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: <90d3e532-8ea7-4eea-8e31-8c363285a156@nomadlogic.org> <0ad493d5-1c1e-4370-977a-118f46ebd677@nomadlogic.org> <0c4f8149-89dd-4635-a5ed-4766fffd2553@nomadlogic.org> In-Reply-To: From: Warner Losh Date: Thu, 7 Dec 2023 15:49:03 -0700 Message-ID: Subject: Re: nvme timeout issues with hardware and bhyve vm's To: Pete Wright Cc: FreeBSD Current Content-Type: multipart/alternative; boundary="000000000000e942a4060bf34636" X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US] X-Spamd-Bar: ---- X-Rspamd-Queue-Id: 4SmTwX3TxTz3R3G --000000000000e942a4060bf34636 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Dec 7, 2023 at 3:38=E2=80=AFPM Pete Wright wr= ote: > > > On 10/13/23 7:34 PM, Warner Losh wrote: > > > > > > > the messages i posted in the start of the thread are from the VM > itself > > (13.2-RELEASE). The zpool on the hypervisor (13.2-RELEASE) showed = no > > such issues. > > > > Based on your comment about the improvements in 14 I'll focus my > > efforts > > on my workstation, it seemed to happen regularly so hopefully i can > > find > > a repo case. > > > > > > Let me now if you see similar messages in stable/14. I think I've fixed > > all the > > issues with timeouts, though you shouldn't ever seem them in a vm setup > > unless something else weird is going on. > > > > > Hi Warner, just resurfacing this thread because I've had a few lockups > on my workstation running 14.0-STABLE. I was able to capture a photo of > the hang and this seems to be the most important line: > > nvme0: Resetting controller due to a timeout and possible hot unplug. > > When I scan the device after reboot I don't see any errors, but if there > is a particular thing I should check via nvmecontrol please let me know. > Also, since it mentions possible hot unplug I wonder if this is > hardware/firmware related to my system? > > Anyway, haven't found a repro case yet but it has locked up a few times > the past two weeks. > What the message means is that (a) we stopped getting interrupts from the device and (b) when we went to check on the status of the device it read back like missing hardware. So is this from inside the VM running under bhyve, or in the host that's hosting the VM? We have different next steps depending on where it is. Warner --000000000000e942a4060bf34636 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Thu, Dec 7, 2023 at 3:38=E2=80=AFP= M Pete Wright <pete@nomadlogic.or= g> wrote:


On 10/13/23 7:34 PM, Warner Losh wrote:
>

>
>=C2=A0 =C2=A0 =C2=A0the messages i posted in the start of the thread ar= e from the VM itself
>=C2=A0 =C2=A0 =C2=A0(13.2-RELEASE).=C2=A0 The zpool on the hypervisor (= 13.2-RELEASE) showed no
>=C2=A0 =C2=A0 =C2=A0such issues.
>
>=C2=A0 =C2=A0 =C2=A0Based on your comment about the improvements in 14 = I'll focus my
>=C2=A0 =C2=A0 =C2=A0efforts
>=C2=A0 =C2=A0 =C2=A0on my workstation, it seemed to happen regularly so= hopefully i can
>=C2=A0 =C2=A0 =C2=A0find
>=C2=A0 =C2=A0 =C2=A0a repo case.
>
>
> Let me now if you see similar messages in stable/14. I think I've = fixed
> all the
> issues with timeouts, though you shouldn't ever seem them in a vm = setup
> unless something else weird is going on.
>


Hi Warner, just resurfacing this thread because I've had a few lockups =
on my workstation running 14.0-STABLE.=C2=A0 I was able to capture a photo = of
the hang and this seems to be the most important line:

nvme0: Resetting controller due to a timeout and possible hot unplug.

When I scan the device after reboot I don't see any errors, but if ther= e
is a particular thing I should check via nvmecontrol please let me know. =C2=A0 Also, since it mentions possible hot unplug I wonder if this is
hardware/firmware related to my system?

Anyway, haven't found a repro case yet but it has locked up a few times=
the past two weeks.

What the message me= ans is that (a) we stopped getting interrupts from the device and (b) when = we went to check on the status of the device it read back like missing hard= ware.

So is this from inside the VM running under = bhyve, or in the host that's hosting the VM? We have different next ste= ps depending on where it is.

Warner=C2=A0
--000000000000e942a4060bf34636--