From nobody Sun May 16 18:30:53 2021 X-Original-To: bugs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4558184A4A7 for ; Sun, 16 May 2021 18:30:53 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4FjrRx1ShFz3G5D for ; Sun, 16 May 2021 18:30:53 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 1B7261D4EE for ; Sun, 16 May 2021 18:30:53 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 14GIUrAW071896 for ; Sun, 16 May 2021 18:30:53 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 14GIUrId071895 for bugs@FreeBSD.org; Sun, 16 May 2021 18:30:53 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 255930] ocs_fc Lost all connected devices after some use. Date: Sun, 16 May 2021 18:30:53 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: Unspecified X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: arne@Steinkamm.COM X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated List-Id: Bug reports List-Archive: http://lists.freebsd.org/bugs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-bugs@freebsd.org MIME-Version: 1.0 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D255930 Bug ID: 255930 Summary: ocs_fc Lost all connected devices after some use. Product: Base System Version: Unspecified Hardware: Any OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: arne@Steinkamm.COM Created attachment 225001 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D225001&action= =3Dedit Message file with all described problems. See Bug reports for time stamps I connected a HP Proliant 380 Gen9 server with emulex fc HBAs to two simple= fc setups and attached a NetApp FlashFiler EF550 unit. To get the most out of = ZFS I assigned all 24 flash modules without using the EF550 RAID features to the proliant. I use geom_multipath to handle the redundant connections to the flash filer= and made a ZFS Pool with 3 x 7-disk raidz-1, one spare, one log and one cache disks. The read/write speed is good (2.5 GB/s according to zpool iostat) but after minutes of heavy use I got kernel: ocs_fc0: ocs_initiator_io: device LOST 0 messages and all fc connec= ted disks are gone. I found no way to recover out of this error situation other than reboot, pa= nic (zfs is not happy about the situation) or hardware reset. Further obervations: - reported topologies and link speeds are correct. - ef550 replaced with identical spare unit: no change - changed fc ports: no effect - used different emulex cards (alone, mixed): no effect, problem happens wi= th any combination of installed emulex cards - tried qlogic cards (driver: isp(4)): No problems, works 100% stable but slightly slower io performance. - tried 12.1-RELEASE, 12.2-RELEASE and 13.0-RELEASE. Last one with generic kernel without any changes. Every time lost all fc devices. - Boot with disabled switch fc ports: After portenable of the brokades' ports the fc links went up, no automatic attachment of the disks. A camcontrol rescan all was not successfull, thousands of "device not rea= dy" messages flooded the console. The only way to get the flash modules online is to boot the server with working fc setup. - Bumping the emulex cards to the newest available firmware had no visible effect. - Playing with the HBA related BIOS settings "HP Shared Memory Feature", "Brocade FA-PWWN" and "PLOGT Retry Timer" had= no visible effect. More details of the last try with 13.0-RELEASE generic: uname -a: FreeBSD vwcnctd00fs003.dev.kpdm01.group.vwg 13.0-RELEASE FreeBSD 13.0-RELEA= SE #0 releng/13.0-n244733-ea31abc261f: Fri Apr 9 04:24:09 UTC 2021=20=20=20=20 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 pciconf -lv: ocs_fc0@pci0:8:0:0: class=3D0x0c0400 rev=3D0x01 hdr=3D0x00 vendor=3D0x1= 0df device=3D0xe300 subvendor=3D0x1590 subdevice=3D0x0214 vendor =3D 'Emulex Corporation' device =3D 'LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapte= r' class =3D serial bus subclass =3D Fibre Channel ocs_fc1@pci0:8:0:1: class=3D0x0c0400 rev=3D0x01 hdr=3D0x00 vendor=3D0x1= 0df device=3D0xe300 subvendor=3D0x1590 subdevice=3D0x0214 vendor =3D 'Emulex Corporation' device =3D 'LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapte= r' class =3D serial bus subclass =3D Fibre Channel ocs_fc2@pci0:129:0:0: class=3D0x0c0400 rev=3D0x30 hdr=3D0x00 vendor=3D0x1= 0df device=3D0xe200 subvendor=3D0x103c subdevice=3D0x197f vendor =3D 'Emulex Corporation' device =3D 'LPe15000/LPe16000 Series 8Gb/16Gb Fibre Channel Adapter' class =3D serial bus subclass =3D Fibre Channel ocs_fc3@pci0:129:0:1: class=3D0x0c0400 rev=3D0x30 hdr=3D0x00 vendor=3D0x1= 0df device=3D0xe200 subvendor=3D0x103c subdevice=3D0x197f vendor =3D 'Emulex Corporation' device =3D 'LPe15000/LPe16000 Series 8Gb/16Gb Fibre Channel Adapter' class =3D serial bus subclass =3D Fibre Channel HP device names: HPE SN1200E 16Gb 2p FC HBA Product Part Number: Q0L14-63001 Assembly Number 870002-001 HP SN1100E 16Gb 2P FC HBA Product Part Number: C8R39-60001 Assembly Number: 719212-001 The EF550 has two independent controllers both connected to all flash module bays. Each controller has two FC ports. This ports are connected to two independent brocade fc switches (no interli= nk fibre). One port of each emulex card is connected to one of the fc switches. The other port of each emulex card is not in use (connected to an enterprise fabric network independent from my laborotry setup, but ports are disabled = on the switch site). Using only on of the emulex cards does not change the effect. I tryed all permutations possible. To get valid data for this bug report I installed 13.0-release with minimal setup: /boot/device.hints: hint.ocs_fc.0.initiator=3D"1" hint.ocs_fc.2.initiator=3D"1" hint.ocs_fc.0.topology=3D"1" hint.ocs_fc.2.topology=3D"1" hint.ocs_fc.0.speed=3D"16000" hint.ocs_fc.2.speed=3D"16000" /etc/sysctl.conf: dev.ocs_fc.1.port_state=3Doffline dev.ocs_fc.3.port_state=3Doffline In the attached messages File you will find this: May 15 19:21:43 - 19:29:22 First boot and configuring network connectivity on the shell. May 15 19:44:24 Enabling FC ports on both brocades May 15 19:47:21 camcontrol rescan all (all rescans successful according to camcontrol) May 15 19:59:15 reboot --- Now with enabled FC links. It will find the flash modules May 15 20:06:36 kldload geom_multipath.ko geom_multipath finds four preconfigured links to each flash module. This is correct. No I did a zpool import zone and startet a couple of test tools Output of zpool iostat zone 1: capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- zone 14.7T 486G 40.1K 0 2.61G 0 zone 14.7T 486G 39.0K 436 2.58G 1.94M zone 14.7T 486G 41.6K 0 2.60G 0 zone 14.7T 486G 39.4K 0 2.60G 0 zone 14.7T 486G 39.4K 0 2.62G 0 zone 14.7T 486G 40.7K 0 2.57G 0 zone 14.7T 486G 39.9K 420 2.54G 1.94M zone 14.7T 486G 39.5K 0 2.58G 0 zone 14.7T 486G 39.6K 0 2.64G 0 zone 14.7T 486G 39.3K 0 2.57G 0 zone 14.7T 486G 39.4K 0 2.62G 0 ... May 15 20:15:15 The problem starts May 15 20:16:18 attempt of a camcontrol rescan with no success My short term solution is to use QLogic cards with the isp driver which wor= ks without any changes necessary 100% stable. --=20 You are receiving this mail because: You are the assignee for the bug.=