From nobody Tue Feb 13 04:15:31 2024 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TYp055xM1z59lBb for ; Tue, 13 Feb 2024 04:15:33 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4TYp055WPyz4CWL for ; Tue, 13 Feb 2024 04:15:33 +0000 (UTC) (envelope-from truckman@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1707797733; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: references:references; bh=sb3sR4HxUH0hbUvGB0voplfYc9TP6z4kM9T+LbKAZFU=; b=JsR9AWRX7UDWNpnR1gAA2b3AmKIkaOsA6Z7qoo3fVKlg84gG78dvCHwTCpJgUD5lmK+uIi xCC2gIiwf14CRYorsAZaLw1ozSJlW4aC48kr2h7abjV6hbPd+Dm9zyQx3WxWMdkUXvE9vL X7nxji2/CuDl5hHhJvdCb1+tqymb1B5E893g6QMODSbsr6MJ85IsoyqzkEET5vtTqWAgiA kYCBEZhxfolFcEqWnkR0zHO76zxodYFq4lJU+9YvgvFOKYkBqGKKHarZpJXJU+7r4cd8sN rIaQ+Bnd4jjdSiWkaYAV807U5t55XfV8EP9j/hQTkpD6JDVKY2E9kqrLd+9S8g== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1707797733; a=rsa-sha256; cv=none; b=G7gx9YddoJK+2jhLJrOebNwGKsPEzxXtpir83o7QPTxZDmKrwXjviRoJMYYGWjsG9KU/nW CYIsNjWDbVBuoJreVzRNdJr9gNjeP9ozTrMXzrTBAmF2xzwM+s8E9fiusLwnX8n/bBMJ9G hx+whkF7J77cOEuRM4Rg1FmC0FDZE47q+E7U802RxNeCz31Sv8Aq76ThJ6XrYv/GrmMZZ7 VfFepq1qNpGbfggNpypg+1ORiSAzpAFzQLHV/eNJyyPpK0xW0Ysk4jpNl0TkDpUoSXQUAG MS/l7/2v1VJy8J4TPDyX6395tusqhiRAZUo2JRZetKA8cgSMggNHbfcsqwlHCw== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1707797733; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: references:references; bh=sb3sR4HxUH0hbUvGB0voplfYc9TP6z4kM9T+LbKAZFU=; b=Li1ujsv7+iwEDJEz24kJMVV/iHfv44BShjRV4fAQRUYlbwMm/4btus3/SpX8JRM/IwL7ik 3iDbonY9KPhoGQlVIe63yfE6fLFIS0r5fIbaxi1xgy0AQfYmYuwPgDHj5mXGGMRN98E49/ lLg/XZFXttHBomMmAuJWiDmUBR14peRdXKMZKnzDYlC3/pFZl6hZwr3HEy5VcNB2KQjUHJ g7Ew1F99vIWTANtYGM3znEI63i8Z/DV+0mCCbva4aLoRaAoQEhdv93tR5+8SRvogm8OGub k5IGdyrzK9yKzP87ikuoVsLTQePOF2hZ/IVCYkLfnDrTiRflJYcFmPi+mkpjbA== Received: from gw.catspoiler.org (unknown [IPv6:2602:304:cd45:5b11::2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: truckman) by smtp.freebsd.org (Postfix) with ESMTPSA id 4TYp051QPRzh1R for ; Tue, 13 Feb 2024 04:15:33 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from dl (uid 1001) (envelope-from truckman@FreeBSD.org) id 23bc8e by gw.catspoiler.org (DragonFly Mail Agent v0.13 on mousie.catspoiler.org); Mon, 12 Feb 2024 20:15:31 -0800 Date: Mon, 12 Feb 2024 20:15:31 -0800 (PST) From: Don Lewis Subject: Re: nvme controller reset failures on recent -CURRENT To: Maxim Sobolev cc: FreeBSD current , John Baldwin Message-ID: References: List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; CHARSET=utf-8 Content-Transfer-Encoding: 8BIT Content-Disposition: INLINE On 12 Feb, Maxim Sobolev wrote: > Might be an overheating. Today's nvme drives are notoriously flaky if you > run them without proper heat sink attached to it. I don't think it is a thermal problem. According to the drive health page, the device temperature has never reached Temperature 2, whatever that is. The room temperature is around 65F. The system was stable last summer when the room temperature spent a lot of time in the 80-85F range. The device temperature depends a lot on the I/O rate, and the last panic happened when the I/O rate had been below 40tps for quite a while. > On Mon, Feb 12, 2024, 4:28 PM Don Lewis wrote: > >> I just upgraded my package build machine to: >> FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e >> from: >> FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 >> and I've had two nvme-triggered panics in the last day. >> >> nvme is being used for swap and L2ARC. I'm not able to get a crash >> dump, probably because the nvme device has gone away and I get an error >> about not having a dump device. It looks like a low-memory panic >> because free memory is low and zfs is calling malloc(). >> >> This shows up in the log leading up to the panic: >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a >> timeout a >> nd possible hot unplug. >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a >> timeout a >> nd possible hot unplug. >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog >> ti >> meout. >> >> The device looks healthy to me: >> SMART/Health Information Log >> ============================ >> Critical Warning State: 0x00 >> Available spare: 0 >> Temperature: 0 >> Device reliability: 0 >> Read only: 0 >> Volatile memory backup: 0 >> Temperature: 312 K, 38.85 C, 101.93 F >> Available spare: 100 >> Available spare threshold: 10 >> Percentage used: 3 >> Data units (512,000 byte) read: 5761183 >> Data units written: 29911502 >> Host read commands: 471921188 >> Host write commands: 605394753 >> Controller busy time (minutes): 32359 >> Power cycles: 110 >> Power on hours: 19297 >> Unsafe shutdowns: 14 >> Media errors: 0 >> No. error info log entries: 0 >> Warning Temp Composite Time: 0 >> Error Temp Composite Time: 0 >> Temperature 1 Transition Count: 5231 >> Temperature 2 Transition Count: 0 >> Total Time For Temperature 1: 41213 >> Total Time For Temperature 2: 0 >> >> >>