From nobody Thu Dec 07 23:09:29 2023 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SmVN65t0kz53mMS for ; Thu, 7 Dec 2023 23:09:42 +0000 (UTC) (envelope-from junchoon@dec.sakura.ne.jp) Received: from www121.sakura.ne.jp (www121.sakura.ne.jp [153.125.133.21]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4SmVN50DMcz3T8p for ; Thu, 7 Dec 2023 23:09:40 +0000 (UTC) (envelope-from junchoon@dec.sakura.ne.jp) Authentication-Results: mx1.freebsd.org; dkim=none; spf=none (mx1.freebsd.org: domain of junchoon@dec.sakura.ne.jp has no SPF policy when checking 153.125.133.21) smtp.mailfrom=junchoon@dec.sakura.ne.jp; dmarc=none Received: from kalamity.joker.local (123-1-22-158.area1b.commufa.jp [123.1.22.158]) (authenticated bits=0) by www121.sakura.ne.jp (8.16.1/8.16.1/[SAKURA-WEB]/20201212) with ESMTPA id 3B7N9Tfs098580 for ; Fri, 8 Dec 2023 08:09:29 +0900 (JST) (envelope-from junchoon@dec.sakura.ne.jp) Date: Fri, 8 Dec 2023 08:09:29 +0900 From: Tomoaki AOKI To: freebsd-current@freebsd.org Subject: Re: nvme timeout issues with hardware and bhyve vm's Message-Id: <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp> In-Reply-To: References: <90d3e532-8ea7-4eea-8e31-8c363285a156@nomadlogic.org> <0ad493d5-1c1e-4370-977a-118f46ebd677@nomadlogic.org> <0c4f8149-89dd-4635-a5ed-4766fffd2553@nomadlogic.org> Organization: Junchoon corps X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.33; amd64-portbld-freebsd14.0) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spamd-Result: default: False [-1.47 / 15.00]; AUTH_NA(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.97)[-0.974]; MV_CASE(0.50)[]; MIME_GOOD(-0.10)[text/plain]; ONCE_RECEIVED(0.10)[]; ASN(0.00)[asn:7684, ipnet:153.125.128.0/18, country:JP]; MLMMJ_DEST(0.00)[freebsd-current@freebsd.org]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; RCVD_COUNT_ONE(0.00)[1]; R_DKIM_NA(0.00)[]; HAS_ORG_HEADER(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; R_SPF_NA(0.00)[no SPF record]; FROM_HAS_DN(0.00)[]; ARC_NA(0.00)[]; DMARC_NA(0.00)[sakura.ne.jp]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_NONE(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-current@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspamd-Queue-Id: 4SmVN50DMcz3T8p X-Spamd-Bar: - On Thu, 7 Dec 2023 14:38:37 -0800 Pete Wright wrote: > > > On 10/13/23 7:34 PM, Warner Losh wrote: > > > > > > > the messages i posted in the start of the thread are from the VM itself > > (13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showed no > > such issues. > > > > Based on your comment about the improvements in 14 I'll focus my > > efforts > > on my workstation, it seemed to happen regularly so hopefully i can > > find > > a repo case. > > > > > > Let me now if you see similar messages in stable/14. I think I've fixed > > all the > > issues with timeouts, though you shouldn't ever seem them in a vm setup > > unless something else weird is going on. > > > > > Hi Warner, just resurfacing this thread because I've had a few lockups > on my workstation running 14.0-STABLE. I was able to capture a photo of > the hang and this seems to be the most important line: > > nvme0: Resetting controller due to a timeout and possible hot unplug. > > When I scan the device after reboot I don't see any errors, but if there > is a particular thing I should check via nvmecontrol please let me know. > Also, since it mentions possible hot unplug I wonder if this is > hardware/firmware related to my system? > > Anyway, haven't found a repro case yet but it has locked up a few times > the past two weeks. > > -pete > > > -- > Pete Wright > pete@nomadlogic.org If I myself encounter this kind of problem ON BARE METAL HARDWARE, I would usually suspect *Overheating caused hang of NVMe controller or PCI bridge on SSD, or *Unstable physical connection (bad contact) first. -- Tomoaki AOKI