From nobody Mon May 30 14:51:32 2022 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 7C8891B48BE5 for ; Mon, 30 May 2022 14:51:40 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-QB1-obe.outbound.protection.outlook.com (mail-qb1can01on2088.outbound.protection.outlook.com [40.107.66.88]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "DigiCert Cloud Services CA-1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4LBdf32p6kz4SRH for ; Mon, 30 May 2022 14:51:39 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Nb+hjMoA4we2q3QaRR+E4MI8je2+HsscjMU/Lr2NlzvJj1oYhOMTxV7poypccVC05SbjLhMQcLjyUX4OMeVrvCfjWGQ2fJAhNfQayfo/urNyG9EGY2PnAF5St5fmIs3I5sSpSVWgQ5qFKQenKGmVh/sMWcBTO+UDFSfLaOylEU/XT2L45DRkU+bhUdT+byAcQ9YxtKILjjV/qycY1p+InEgMGDsV3s1Ouw+fO084WGiusuliD6eggEwwiXyP9Mh+j9O7qTrUF5KbMERoBDxDGO/YurEW58mzu/93EmRo13fecEySpSC1HbmqExQnjFu/sX0BBDvztAu9eWNUo9zC/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=DxhlPdQlq5LT64AmSQsso0TTEKNnchj7rl+Y4yh4mqg=; b=kSn1wbl0nlfe9BoEM1/vg63Oe8Y5aEnZFapne96X5JGV/SvSF7GkwSuF2Iyrpg5GCbOybz+viDRwnaY+CQfMlm+ToMmxWEmULSVQFaIR0ueSzi/jL/2Xsnz1OMF5yLOcGk7MYUfWVx9Dm34Ww2Gc9gf+WXuiHgHXrQq300FdaW5xknNNX9XcmMeM0NYpU1PuxPsyqhRUXV61X7nbwUXLAgiLuP3Bhim1/3Lx105K/bZM07cvFdEC7XYk6H6SvzOSNsyXATex/q3aFRtHf2V6NqKKWXpYtpRP/55o2B3VE9E38HnX1lgPVnWsB/MAA9fcafxDYLlHxDF6igwn++L4Iw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=uoguelph.ca; dmarc=pass action=none header.from=uoguelph.ca; dkim=pass header.d=uoguelph.ca; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uoguelph.ca; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=DxhlPdQlq5LT64AmSQsso0TTEKNnchj7rl+Y4yh4mqg=; b=hkL2cy7TkxAbnK9xkcvqo3LWrqqOPhx3Q4dS1dG4lXHBVpw8YH5He+EvPU0T0gIQqqVjsnm2+wYs//I8v+4U7+22Wv1p/44MgL9Et/dfxK3VogoFs3Zt3QTSPaHpBXIQEDl/4xsHesA8/WZKOSn+lkmuJG7TdH6tMECRwoBhFoQgA6q+0ELp3rWaXzTTuK96m4pUvAZsk3Yyo1mNA/snXjYGKoUNdJHk+ivlc/y7XA7RQrybvQe3S9QJ5ui5xKaSO8K1PcRWocwDLyKpfl0A79DT3kEtZ+0RU5bK936zvTjrkZm1vJcFu5vAlrnn1zcNMTsFGpjIjGz6q9JW8p3GMQ== Received: from YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:c01:81::14) by YQXPR01MB3893.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:c00:47::24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.5293.13; Mon, 30 May 2022 14:51:32 +0000 Received: from YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM ([fe80::b921:251e:4a0b:54fc]) by YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM ([fe80::b921:251e:4a0b:54fc%6]) with mapi id 15.20.5293.019; Mon, 30 May 2022 14:51:32 +0000 From: Rick Macklem To: Andreas Kempe CC: "freebsd-fs@freebsd.org" Subject: Re: FreeBSD 12.3/13.1 NFS client hang Thread-Topic: FreeBSD 12.3/13.1 NFS client hang Thread-Index: AQHYcgY+3LPS/CtmUk+ZVTnbRrY6qK0zLx2fgAAamoCAABgQoIAD/L8AgAAhTLI= Date: Mon, 30 May 2022 14:51:32 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: suggested_attachment_session_id: 7e95e85e-27d7-7d3a-6f35-d6492c507fcb x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 73b7f7c2-07d1-479e-0051-08da424be2e5 x-ms-traffictypediagnostic: YQXPR01MB3893:EE_ x-microsoft-antispam-prvs: x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: jIWBuz26TCTMq2e1GMzjCpSKWAK7EUgpvYbJxROn9cMw5Tx/dXbAppSof9gtCIeISomHQTZVcsGFAVbmlN0T8IgTIFDKVNGfssQHutSCvFcNwHPip1UBqGNqOLTtOzaY8wvQh2/7FymjuigI74sEc/aBkvupWlLSRuthZhTqFpmg5GE0bm9gvFhXHE91EuItadoxl9M+v3e6HLezijDccP12DST7Qwx0DEKsbX7NkFSBlugwGHywR8bPPw7Py2r5D1qKBIcmvfvJlfMlYi0fvJ088qG4IeWo39bpDYgk/fP14h0h6wqoQG/dcjXiTM4U6As/TfGB7dbWG7TeQIAMUVjadV6MaM5lJ4DILNLpPnZ2idQJ6GQPMpghgViqo9oj0Kyk9VaWXSvAh9OxqbUwaim/X3l0V2jQJzrBT6YGKruuk3WfU3xSlynsgLav3CuqHbJ6YUW1cmVOKmtxbOY95d4aYRoZk8BnrrJkK9NsZvf7qRp4+rsv9oluJFIhmuHokYrKSOM88mNFpLNcj31ccNL/ot3pYYXugL1RN42Rc6svUKG2wtpl1Gsl6R1jSXdLPH1IzXvgfQzjWJyRm3iFXKCSaBCVSTdIRkitnQx0LlEoafKi1WtvdVyWcUDO8SN2SGWVadfupcH147uiVK3PYY/yaA8fYDwfb0mxOblden+hyFq0egGJzIKiPNt72KGuOMhKsCccjXsDn5CAdnbFzg== x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM;PTR:;CAT:NONE;SFS:(13230001)(4636009)(366004)(83380400001)(38070700005)(508600001)(122000001)(86362001)(2906002)(71200400001)(8676002)(4326008)(6506007)(9686003)(64756008)(66446008)(786003)(316002)(296002)(66556008)(66946007)(66476007)(55016003)(38100700002)(52536014)(6916009)(5660300002)(7696005)(33656002)(8936002)(186003)(91956017)(76116006);DIR:OUT;SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 2 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?Q?s+50XdbASKHZn6RTyHhN3IaZ/3mhL+l5Zvm5l7OYe9bIe22diVFRsPtGZD?= =?iso-8859-1?Q?gcIMcqmYWk+VG0b+rE/ZoS9k2o1cKKdcNGMVmdUieFT9xfpKg97zrT2jH5?= =?iso-8859-1?Q?8GHQykEmyW8+JWxH54MhFk0P4y/V8LFQcMfu2vo7OXMIcqh6I/Ie0PKH9G?= =?iso-8859-1?Q?H0YPo50zl+BUSiwbjMfDtT6lnGj1QMl1/ODVWmYP4hj908ruA9LSCZulUE?= =?iso-8859-1?Q?YDxFBAf7uzwllb26pWAXFn4/egndUzpmEh3WxNcdI9RWS/JfsZIuNbXVOX?= =?iso-8859-1?Q?FRfbSOizULH9gfvviviZbHy5jDi8FLUbYtoqNTjqW93xDPxeFdu859lq/5?= =?iso-8859-1?Q?GNpjQ1ZOruHBaK6dMQfDAwQnaocII8TfCnchBKur5zTXkE3vU2CJwQSdf7?= =?iso-8859-1?Q?vExsSMrmiMiCIJuDlkImJKs3xK6GFs+IsLrGO6yZYhDBGWQxb9YYZL8eit?= =?iso-8859-1?Q?sPJzD2DNMMWSPXOFtTAjU6xzzQVNTYxTVgyV3puR+K6RlFGN6/nleF3hQ6?= =?iso-8859-1?Q?8Tlk/H7dw3aizB1Aa3h5AuAYVkPT31vdVWrDD+LqHumuL76DBeF0c8IeAY?= =?iso-8859-1?Q?TJLM8tQsB1EqBVGN/UemeHVBaLiLWNY7ryr3cFzE4GsWI69fYVg7vTSpK6?= =?iso-8859-1?Q?/q18H4/YF9Cc1a95GSkplaNzpKDgUG1/nT8jok16+V8suMYQup86N7ekv9?= =?iso-8859-1?Q?zqbGcSP2VObTrkKOfxgI0wWfoVzzJZqY2H0ioJGwUYYnbh9u6Pk5VKbBCF?= =?iso-8859-1?Q?VM30/HY3HQdyXccGGYUyTf3nLHYEc0UmK9sHasHbLOhzHtRcC9/jAcWx3j?= =?iso-8859-1?Q?NmJwYKIRglLqEYhGJP1+FFfvAJXbWCGyBANHywQeKVCT7nQuOKWJ3cW/eJ?= =?iso-8859-1?Q?QORNM5WYHGmMyNtxwAxqn0wJk9xZmgtQlGJVVjCP7OrBVbpFzre9Ao6kJh?= =?iso-8859-1?Q?4gz2WW7t+JQRI7il0Nn4pCL1zJHHVr9ahwqckVx0mBrcAfyXHR2/RRn7QM?= =?iso-8859-1?Q?E6iYhxX17YpblVCLhCdWdzZZM6P7v0nAFXIdbz92Qw7mmUKhH15UgouoaF?= =?iso-8859-1?Q?U1dwlscEDaKQSv6cuEpBpT+z3E3BwnWM33c4wWT8JA2JV8AMH+1DntrFQB?= =?iso-8859-1?Q?a9gYU9+nZkYL4AF584g+tpJQt7GC71mggYcrugn0H+ZXjat6u9YLqgKrMi?= =?iso-8859-1?Q?FF3p+Xa601dNzPewzISYgxZWd25qsQIV8vHfyaKXEwa/aZl662mv7rKBnb?= =?iso-8859-1?Q?GncblPDTl+XOnY6IxQbVW9kRLFwf7ixjZ/V2FSb7EIIqr2YCyGAdw4zgCm?= =?iso-8859-1?Q?pkstM9nXoPmB/ZGPlpVfcjP5K0cvGuQSO65xdMp4o43RGTs2XPuaHDZC8H?= =?iso-8859-1?Q?N5fdqkIQk688BEPZ9KElICulbTsUVwoVX65QAPXlu9j9586QJ0ahsCA7oU?= =?iso-8859-1?Q?MiBkQwirWFlWxO5KMWQ5k3wAnGyGUoO9hITGN107GJsQkZInyi1bVYNd0q?= =?iso-8859-1?Q?a4A5O15M4UxgisKKctTdCcq20tkZOuh7Xyv8U8AOUy8NaPSqCYJJOe1yj6?= =?iso-8859-1?Q?uvbAxj+SoRXf2YviqUsIL1DgA80ql5vag5Kh3zO8qONEHkjMHqyQpE64mo?= =?iso-8859-1?Q?53EelpstO9O/79Z1ySWmGwNn/i2UMa81xqMV376s4XoCCt6u26n50EHeWR?= =?iso-8859-1?Q?rDd94Pls4Ub7AxXhU5P6+NSCs2oK6/YnxD0+HRQlev5aXSD8z7wokkhl43?= =?iso-8859-1?Q?vS/2FkewOh4UyrqI52+E5RCSsUzBFQ68zfo1OjjJKBUiXWr2XxwgErOgAk?= =?iso-8859-1?Q?Q7utI82T9JX764rS+j1W7Rc2/q4RJHrZwTjSEM4DSkVcMwNOXK8jNuc0Qz?= =?iso-8859-1?Q?+V?= x-ms-exchange-antispam-messagedata-1: HJ7SBLgSzfBAhQ== Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: YQBPR0101MB9742.CANPRD01.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-Network-Message-Id: 73b7f7c2-07d1-479e-0051-08da424be2e5 X-MS-Exchange-CrossTenant-originalarrivaltime: 30 May 2022 14:51:32.2360 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: taOqkQm8Ou4+YodaP9p3VV4UKYF66N/W38CZkypFy98wIJ7esRxEh9y3blCOIvyYWgHEFz0dyxLIQE+35c+BSw== X-MS-Exchange-Transport-CrossTenantHeadersStamped: YQXPR01MB3893 X-Rspamd-Queue-Id: 4LBdf32p6kz4SRH X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org; dkim=pass header.d=uoguelph.ca header.s=selector2 header.b=hkL2cy7T; arc=pass ("microsoft.com:s=arcselector9901:i=1"); dmarc=pass (policy=none) header.from=uoguelph.ca; spf=pass (mx1.freebsd.org: domain of rmacklem@uoguelph.ca designates 40.107.66.88 as permitted sender) smtp.mailfrom=rmacklem@uoguelph.ca X-Spamd-Result: default: False [-6.00 / 15.00]; TO_DN_EQ_ADDR_SOME(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[uoguelph.ca:s=selector2]; FREEFALL_USER(0.00)[rmacklem]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:40.107.0.0/16]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; DWL_DNSWL_LOW(-1.00)[uoguelph.ca:dkim]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[uoguelph.ca:+]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[40.107.66.88:from]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[uoguelph.ca,none]; MLMMJ_DEST(0.00)[freebsd-fs]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:8075, ipnet:40.104.0.0/14, country:US]; ARC_ALLOW(-1.00)[microsoft.com:s=arcselector9901:i=1]; RWL_MAILSPIKE_POSSIBLE(0.00)[40.107.66.88:from] X-ThisMailContainsUnwantedMimeParts: N Andreas Kempe wrote:=0A= [lots of stuff snipped]=0A= >=0A= > I guess this means you think the error is at a protocol handling level=0A= > and the issues aren't caused by locking issues in the code? I was=0A= > wondering whether the hangs that were not slot related could possibly=0A= > be due to some race condition when locking since it happens so=0A= > seemingly randomly.=0A= Anything is possible, but the locking is pretty straightforward and no one= =0A= has found a bug in it for ages. (You can certainly run a kernel with=0A= WITNESS, DEBUG_VFS_LOCKS, etc., but there will be a performance=0A= penalty.=0A= =0A= My experience is that most hangs (other than the business with sessions=0A= for soft or intr mounts) are caused by network fabric issues.=0A= A couple of examples:=0A= As I noted, having TSO fail for some specific segment. Then retransmit of= =0A= the segment fails again, and again... =0A= =0A= In 13.0, there was a bug in TCP that=0A= caused the receive socket upcall to not happen under certain circumstances= =0A= and that could cause a hang. The bug is not in 12.n or 13.1 and the hang=0A= was normally observed when a Linux client had a FreeBSD server mounted,=0A= not vise versa.=0A= =0A= After a network partitioning healed, Linux and FreeBSD would get into=0A= what I might call an "RST storm". Every time one end would try to=0A= establish a new TCP connection, the other end would RST it.=0A= (Sorry, but it has been a while and I cannot remember exactly how to cause = it=0A= or if it even got resolved?)=0A= =0A= > > If you can reproduce it for a hard mount, you could capture packets via= :=0A= > > # tcpdump -s 0 -w out.pcap host =0A= > > Tcpdump is useless at decoding NFS, but wireshark can decode the out.pc= ap=0A= > > quite nicely. I can look at the out.pcap or, if you do so, you start by= looking for=0A= > > NFSv4 specific errors.=0A= > > --> The client will usually log if it gets one of these. It will be an = error # > 10000.=0A= > >=0A= > =0A= > With us not knowing the NFSv4 protocol, we were holding off on even=0A= > trying to get Wireshark dumps since we wouldn't know what to look for=0A= > and would have to learn the protocol first. You having a look would be=0A= > greatly appreciated! As I wrote above, I'll try to get dumps if we can=0A= > find a reproducer.=0A= I certainly don't mind looking, but you might be surprised at how good=0A= wireshark is at this stuff.=0A= It not onlt decodes the RPCs for you, it flags anything that looks "sketchy= "=0A= in yellow and anything obviously broken in red.=0A= It was wireshark that spotted and flagged the RSTs I mentioned above.=0A= Beyond that, you just try and get to the place where things broke (a hang= =0A= might be at the end of the capture, for example) and then work backwards.= =0A= It is true that you need to know the protocol to spot things other than=0A= server error returns that are not going as planned.=0A= =0A= The big challenge is getting the packet capture that is less than petabytes= =0A= in size. Although starting a packet capture after a hang has occurred can= =0A= be useful, it is usually too late, since the breakage has already happened.= =0A= =0A= rick=0A= =0A= > Good luck with it, rick=0A= > > rick=0A= > >=0A= =0A= Cordially,=0A= Andreas Kempe=0A= =0A=