From nobody Thu May 30 13:53:59 2024
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4VqnmN61TDz5Mh4V
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Thu, 30 May 2024 13:54:12 +0000 (UTC)
	(envelope-from wlosh@bsdimp.com)
Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "WR4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4VqnmN46yyz4cYK
	for <freebsd-hackers@freebsd.org>; Thu, 30 May 2024 13:54:12 +0000 (UTC)
	(envelope-from wlosh@bsdimp.com)
Authentication-Results: mx1.freebsd.org;
	none
Received: by mail-ej1-x629.google.com with SMTP id a640c23a62f3a-a630ff4ac84so77910266b.1
        for <freebsd-hackers@freebsd.org>; Thu, 30 May 2024 06:54:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bsdimp-com.20230601.gappssmtp.com; s=20230601; t=1717077251; x=1717682051; darn=freebsd.org;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=yGrkIKz/V9QgsvYa362eniO8y97woF+Q+Y/FPy94zCw=;
        b=nOORm8N0SflDTAf3QP9iJWgmvv+o//xof1//6N1tEA3p8NNkoSrNlGK+7vm3vl22vQ
         eFhkdUPMo+0Yn5TILtDWJmjDBWxFsVy9pAGrLdbPTVxGWeFSaOxEskl5uej6eJCx210r
         dd+0h5lx0c0ESQgAAmjMd3fekveDujPTI5Ki+K0BWeIZbdE7VYqX1eU+Dzqme3tTgKE2
         rq6+zm0btPjIvsYrZ4DsUfnyzE2DBksDWt72oiLUq40VcKQwvpavU9cr0u+e3ynIPgsy
         6B1wCA1u9SKgmPzN3jsK+m8Bap9cgvlhHAJx+mE1+Xmo9VPxswJpDK/xvMXnooPwZmti
         lH4Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1717077251; x=1717682051;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=yGrkIKz/V9QgsvYa362eniO8y97woF+Q+Y/FPy94zCw=;
        b=SESyusDjnktHcCnfV2xRCUYn1zyD3tMdZHRinFJvMzEgOTgm8Z4iel0xdFJtINIW5Y
         HMHcmSlpFnUJs4UzRHwpzvbQHcypPGr3vLYM4RQr3V94WwQzW4hpDCMVndfxOGKgCmhP
         +4QYl/TZ2zyBZDl65AAz+JqmU3W26xp/wtpwN/yidPq0kVhgnHOn2mnaUhJ3ykZODzG3
         FZfdcC2jGAR7BFx4fdVl9WOOfL596Js7FpSiTiwzNBH7EglY6A9vWK5gQT3dneVL9isz
         MG9IO4W9qLicePGTiOeBlfIItUZro8GVzlA8DYamIdT2u+0MSLJFkfjXyO3Rt/oFtYym
         CyfQ==
X-Gm-Message-State: AOJu0Yx4UUoepNKn7l0EHns8rudjqG1BBk/KJbMSNT5zGbKr3PHkpp9m
	Z+XSD7ES7+reKvMZipJZSEkWkPwM/hqDkb3K0OsRCF8U3lirMlPPEmsGVEqO7CgpBrq/LYK29i2
	JfBwjBI/DI79S67l40RQ3fOXSE8VCyp0sMarbbWA0TMcZLpR4AChzFg==
X-Google-Smtp-Source: AGHT+IFLrshXYXJ9agDFXZ3+rM780alj8BLizhwYs7Vd2ZCUHjwOysw2qpiUQS8QgFiZutApVum0PtMDjPaNQ8pNS8o=
X-Received: by 2002:a17:906:1717:b0:a5a:423:a69f with SMTP id
 a640c23a62f3a-a65e8d1857amr159078466b.9.1717077250838; Thu, 30 May 2024
 06:54:10 -0700 (PDT)
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@FreeBSD.org
MIME-Version: 1.0
References: <CAG6t_XAcUDK+pPHiUZ9Bwu2fE5wg6vwK_zcuEYe94sb15HnUPg@mail.gmail.com>
In-Reply-To: <CAG6t_XAcUDK+pPHiUZ9Bwu2fE5wg6vwK_zcuEYe94sb15HnUPg@mail.gmail.com>
From: Warner Losh <imp@bsdimp.com>
Date: Thu, 30 May 2024 09:53:59 -0400
Message-ID: <CANCZdfqUtDvpgTpHx3P5ENmSQ+o=W+9X5x-G5Zgu2UOkF_iiGQ@mail.gmail.com>
Subject: Re: Upperlimit for bwait()
To: Kumara Babu <nkumarababu@gmail.com>
Cc: freebsd-hackers@freebsd.org
Content-Type: multipart/alternative; boundary="00000000000096b2160619ac3307"
X-Spamd-Bar: ----
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]
X-Rspamd-Queue-Id: 4VqnmN46yyz4cYK

--00000000000096b2160619ac3307
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, May 30, 2024 at 1:54=E2=80=AFAM Kumara Babu <nkumarababu@gmail.com>=
 wrote:

> Hello,
>
> There have been a few incidents reported on Juniper devices with FreeBSD,
> where buffer IO operations sleep for more than 30 mins. Theoretically, th=
is
> can happen due to faulty hardware or in virtual platforms due to faulty
> connection between guest and host, filesystem corruption, too many buffer
> IO operations, and/or host not responding due to various reasons. When th=
at
> happens, as this buffer IO writes hold a lock before going to sleep, the
> threads waiting for that lock would starve for so long. There is no upper
> limit for this bwait() as of now. If that wait goes beyond 30 mins for a
> sleeping thread OR 15 mins for a thread blocked on turnstile, deadlkres
> crashes the kernel assuming a possible deadlock.
>
Why isn't the I/O timing out? That's the real problem.

> We perhaps could gracefully handle such lengthy buffer IO operations by
> adding a timeout in bwait() - like say 10 minutes. If the buffer IO is no=
t
> completed in a few mins, it probably would not complete forever and/or
> would be slowing down the entire system. So it is better to stop such
> faulty IO operations.
>
I think that's a terrible idea. Why aren't the I/Os timing out?

> For now, since we had seen these instances only with BIO operations, I
> have a patch to set this value only from bufwait(). Please find the patch
> attached. I am not very sure if 10 mins is a good upper limit for all the
> scenarios for bwait(). If it is, then we could just change msleep() in
> bwait() to set a 10 mins upper limit by default.
>
I never see this on any of the thousands of systems I've used.

> Please let me know if this approach works for all the usecases - If not,
> is there a better alternative ?  And is 10 mins okay for a timeout ?
>
Making sure that the I/Os timeout.

And by that, I mean doing what we do in CAM. All the SIMs ensure that
transactions posted to the device will timeout. Most of the SIMs create a
timeout per transaction which expire and complete the CCBs with a timeout,
which the periph drivers then see this status and will fail the I/O with a
timed out status (or maybe retries it a couple of times, depending on the
hardware and its recovery methods (eg is the timeout due to the drive, the
link, the HBA, etc will result in different recovery in the face of
timeouts). NVME nvd does similar things: A timeout will cause the nvme card
to be reset and we try again, but eventually fail.

One might also wonder why 30s is the timeout for most of the commands. I
get that 'special' commands might need a longer timeout, but we likely
should look at lowering this somewhat. 15s is almost certainly safe. 10s is
probably safe. 5s will work, but you start to get P99.9999 outliers on
popular completely working spinning rust, and P99.9 on marginal drives, so
it can be a bit tricky to change (we'll have to phase it in). That could
make things a bit better in terms of worse case recovery time.

So why aren't the I/O's timing out is the real question here.

Warner


> Thanks and Regards,
>
> Kumara
>

--00000000000096b2160619ac3307
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Thu, May 30, 2024 at 1:54=E2=80=AF=
AM Kumara Babu &lt;<a href=3D"mailto:nkumarababu@gmail.com">nkumarababu@gma=
il.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"m=
argin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left=
:1ex"><div dir=3D"ltr"><div><p style=3D"color:rgb(33,33,33);font-family:Apt=
os;font-size:16px">Hello,<br></p><p style=3D"color:rgb(33,33,33);font-famil=
y:Aptos;font-size:16px">There have been a few incidents reported on Juniper=
 devices with FreeBSD, where buffer IO operations sleep for more than 30 mi=
ns. Theoretically, this can happen due to faulty hardware or in virtual pla=
tforms due to faulty connection between guest and host, filesystem corrupti=
on, too many buffer IO operations, and/or host not responding due to variou=
s reasons. When that happens, as this buffer IO writes hold a lock before g=
oing to sleep, the threads waiting for that lock would starve for so long. =
There is no upper limit for this bwait() as of now. If that wait goes beyon=
d 30 mins for a sleeping thread OR 15 mins for a thread blocked on turnstil=
e, deadlkres crashes the kernel assuming a possible deadlock.<br></p></div>=
</div></blockquote><div>Why isn&#39;t the I/O timing out? That&#39;s the re=
al problem.</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=
=3D"ltr"><div><p style=3D"color:rgb(33,33,33);font-family:Aptos;font-size:1=
6px"></p><p style=3D"color:rgb(33,33,33);font-family:Aptos;font-size:16px">=
We perhaps could gracefully handle such lengthy buffer IO operations by add=
ing a timeout in bwait() - like say 10 minutes. If the buffer IO is not com=
pleted in a few mins, it probably would not complete forever and/or would b=
e slowing down the entire system. So it is better to stop such faulty IO op=
erations.</p></div></div></blockquote><div>I think that&#39;s a terrible id=
ea. Why aren&#39;t the I/Os timing out?=C2=A0</div><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex"><div dir=3D"ltr"><div><p style=3D"color:rgb(33,33=
,33);font-family:Aptos;font-size:16px">For now, since we had seen these ins=
tances only with BIO operations, I have a patch to set this value only from=
 bufwait(). Please find the patch attached. I am not very sure if 10 mins i=
s a good upper limit for all the scenarios for bwait(). If it is, then we c=
ould just change msleep() in bwait() to set a 10 mins upper limit by defaul=
t.<span>=C2=A0</span></p></div></div></blockquote><div>I never see this on =
any of the thousands of systems I&#39;ve used.</div><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex"><div dir=3D"ltr"><div><p style=3D"color:rgb(33,3=
3,33);font-family:Aptos;font-size:16px">Please let me know if this approach=
 works for all the usecases - If not, is there a better alternative ?=C2=A0=
 And is 10 mins okay for a timeout ?</p></div></div></blockquote><div>Makin=
g sure that the I/Os timeout.</div><div><br></div><div>And by that, I mean =
doing what we do in CAM. All the SIMs ensure that transactions posted to th=
e device will timeout. Most of the SIMs create a timeout per transaction wh=
ich expire and complete the CCBs with a timeout, which the periph drivers t=
hen see this status and will fail the I/O with a timed out status (or maybe=
 retries it a couple of times, depending on the hardware and its recovery m=
ethods (eg is the timeout due to the drive, the link, the HBA, etc will res=
ult in different recovery in the face of timeouts). NVME nvd does similar t=
hings: A timeout will cause the nvme card to be reset and we try again, but=
 eventually fail.</div><div><br></div><div>One might also wonder why 30s is=
 the timeout for most of the commands. I get that &#39;special&#39; command=
s might need a longer timeout, but we likely should look at lowering this s=
omewhat. 15s is almost certainly safe. 10s is probably safe. 5s will work, =
but you start to get P99.9999 outliers on popular completely working spinni=
ng rust, and P99.9 on marginal drives, so it can be a bit tricky to change =
(we&#39;ll have to phase it in). That could make things a bit better in ter=
ms of worse case recovery time.</div><div><br></div><div>So why aren&#39;t =
the I/O&#39;s timing out is the real question here.</div><div><br></div><di=
v>Warner</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"m=
argin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left=
:1ex"><div dir=3D"ltr"><div><p class=3D"MsoNormal" style=3D"margin:0in;font=
-size:16px;font-family:Aptos,sans-serif;color:rgb(33,33,33)"><span style=3D=
"font-size:11pt;font-family:Calibri,sans-serif;color:black;line-height:1.2"=
>Thanks and Regards,</span></p><p class=3D"MsoNormal" style=3D"margin:0in;f=
ont-size:16px;font-family:Aptos,sans-serif;color:rgb(33,33,33)"><span style=
=3D"font-size:11pt;font-family:Calibri,sans-serif;color:black;line-height:1=
.2">Kumara</span></p></div></div>
</blockquote></div></div>

--00000000000096b2160619ac3307--