From nobody Sat Aug 07 07:21:34 2021 X-Original-To: freebsd-arm@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 9EE3310FB81B for ; Sat, 7 Aug 2021 07:21:34 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4GhYgL3zrFz4mbG for ; Sat, 7 Aug 2021 07:21:34 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 715E01CEF2 for ; Sat, 7 Aug 2021 07:21:34 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 1777LYKU077945 for ; Sat, 7 Aug 2021 07:21:34 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 1777LYrJ077944 for freebsd-arm@FreeBSD.org; Sat, 7 Aug 2021 07:21:34 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: freebsd-arm@FreeBSD.org Subject: [Bug 257670] RAS CONTROLLER: Fatal unrecoverable error detected with SAS3008 Date: Sat, 07 Aug 2021 07:21:34 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: arm X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: daniel@morante.net X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-arm@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter attachments.mimetype attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated List-Id: Porting FreeBSD to ARM processors List-Archive: https://lists.freebsd.org/archives/freebsd-arm List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-arm@freebsd.org MIME-Version: 1.0 X-ThisMailContainsUnwantedMimeParts: N https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D257670 Bug ID: 257670 Summary: RAS CONTROLLER: Fatal unrecoverable error detected with SAS3008 Product: Base System Version: CURRENT Hardware: arm64 OS: Any Status: New Severity: Affects Some People Priority: --- Component: arm Assignee: freebsd-arm@FreeBSD.org Reporter: daniel@morante.net Attachment #227004 text/plain mime type: Created attachment 227004 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D227004&action= =3Dedit capture of boot via serial I am testing FreeBSD-14.0-CURRENT-arm64-aarch64-20210805-f3a3b061216-248478= on a Cavium ThunderX2 (Gigabyte R281-T91). This system has an onboard SAS3008 PCI-Express Fusion-MPT SAS-3 controller.=20=20 ``` mpr0@pci0:14:0:0: class=3D0x010700 rev=3D0x02 hdr=3D0x00 vendor=3D0x1= 000 device=3D0x0097 subvendor=3D0x1458 subdevice=3D0x3008 vendor =3D 'Broadcom / LSI' device =3D 'SAS3008 PCI-Express Fusion-MPT SAS-3' class =3D mass storage subclass =3D SAS ``` I load the `mpr` driver by having `mpr_load=3D"YES"` in `/boot/loader.conf`= . So far so good except for the weird messages in dmesg. (see attachment) There are currently 8 HDD's attached to it and I setup 3 ZFS pools. This g= oes well until I finally start to put some load on them. The system kernel pan= ics and halts with the following in dmesg: ``` mpr0: IOC Fault 0x4000265d, Resetting mpr0: Reinitializing controller ... RAS CONTROLLER: Fatal unrecoverable error detected ``` This is not to say the problem is with ZFS. I suspect the mpr driver is ju= st unstable. The system can no longer boot into multi user mode. It kernel panics with = the same error as soon as it tries to start ZFS. ``` mountroot: waiting for device /dev/nda0p2... WARNING: / was not properly dismounted Dual Console: Video Primary, Serial Secondary witness_lock_list_get: witness exhausted ZFS filesystem version: 5 ZFS storage pool version: features support (5000) RAS CONTROLLER: Fatal unrecoverable error detected *** NBU Error *** ... ``` In order to get a functional system I disable ZFS in `/etc/rc.conf` while in single user mode. Now back in multi user mode I can do a `service zfs onestart` and try to im= port one of the pools. The system then kernel panics again. I detail the full specs of this system in bug #254651 (where I have a probl= em with the onboard SATA controllers) and in my forum post at https://forums.freebsd.org/threads/aarch64-trouble-with-cn99xx-ahci-and-fas= tlinq-ql41000-controllers.79556/ (where I explain the lack of a driver for the onboard Ethernet). Also, for some weird reason I can no longer boot 13.0-RELEASE on this syste= m.=20 It panics with "panic: NVME polled command failed to complete within 10s". I think it doesn't like the add-on PCIe NVME. However when it was working (p= rior to adding in the NVME) the SAS controller was just as unstable. Seeing how most of the hardware is still very new, I don't expect FreeBSD (especcially arm64) to support it. I'd like to help anyway that I can shou= ld someone be interested. The system has an IPMI and I'd be willing to offer remote access to it for as long as it's required via VPN (if that's a thing that's normally done) on a dedicated network with any other required resources). --=20 You are receiving this mail because: You are the assignee for the bug.=