From nobody Thu May 30 13:53:59 2024 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4VqnmN61TDz5Mh4V for ; Thu, 30 May 2024 13:54:12 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4VqnmN46yyz4cYK for ; Thu, 30 May 2024 13:54:12 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-ej1-x629.google.com with SMTP id a640c23a62f3a-a630ff4ac84so77910266b.1 for ; Thu, 30 May 2024 06:54:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20230601.gappssmtp.com; s=20230601; t=1717077251; x=1717682051; darn=freebsd.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=yGrkIKz/V9QgsvYa362eniO8y97woF+Q+Y/FPy94zCw=; b=nOORm8N0SflDTAf3QP9iJWgmvv+o//xof1//6N1tEA3p8NNkoSrNlGK+7vm3vl22vQ eFhkdUPMo+0Yn5TILtDWJmjDBWxFsVy9pAGrLdbPTVxGWeFSaOxEskl5uej6eJCx210r dd+0h5lx0c0ESQgAAmjMd3fekveDujPTI5Ki+K0BWeIZbdE7VYqX1eU+Dzqme3tTgKE2 rq6+zm0btPjIvsYrZ4DsUfnyzE2DBksDWt72oiLUq40VcKQwvpavU9cr0u+e3ynIPgsy 6B1wCA1u9SKgmPzN3jsK+m8Bap9cgvlhHAJx+mE1+Xmo9VPxswJpDK/xvMXnooPwZmti lH4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717077251; x=1717682051; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=yGrkIKz/V9QgsvYa362eniO8y97woF+Q+Y/FPy94zCw=; b=SESyusDjnktHcCnfV2xRCUYn1zyD3tMdZHRinFJvMzEgOTgm8Z4iel0xdFJtINIW5Y HMHcmSlpFnUJs4UzRHwpzvbQHcypPGr3vLYM4RQr3V94WwQzW4hpDCMVndfxOGKgCmhP +4QYl/TZ2zyBZDl65AAz+JqmU3W26xp/wtpwN/yidPq0kVhgnHOn2mnaUhJ3ykZODzG3 FZfdcC2jGAR7BFx4fdVl9WOOfL596Js7FpSiTiwzNBH7EglY6A9vWK5gQT3dneVL9isz MG9IO4W9qLicePGTiOeBlfIItUZro8GVzlA8DYamIdT2u+0MSLJFkfjXyO3Rt/oFtYym CyfQ== X-Gm-Message-State: AOJu0Yx4UUoepNKn7l0EHns8rudjqG1BBk/KJbMSNT5zGbKr3PHkpp9m Z+XSD7ES7+reKvMZipJZSEkWkPwM/hqDkb3K0OsRCF8U3lirMlPPEmsGVEqO7CgpBrq/LYK29i2 JfBwjBI/DI79S67l40RQ3fOXSE8VCyp0sMarbbWA0TMcZLpR4AChzFg== X-Google-Smtp-Source: AGHT+IFLrshXYXJ9agDFXZ3+rM780alj8BLizhwYs7Vd2ZCUHjwOysw2qpiUQS8QgFiZutApVum0PtMDjPaNQ8pNS8o= X-Received: by 2002:a17:906:1717:b0:a5a:423:a69f with SMTP id a640c23a62f3a-a65e8d1857amr159078466b.9.1717077250838; Thu, 30 May 2024 06:54:10 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@FreeBSD.org MIME-Version: 1.0 References: In-Reply-To: From: Warner Losh Date: Thu, 30 May 2024 09:53:59 -0400 Message-ID: Subject: Re: Upperlimit for bwait() To: Kumara Babu Cc: freebsd-hackers@freebsd.org Content-Type: multipart/alternative; boundary="00000000000096b2160619ac3307" X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US] X-Rspamd-Queue-Id: 4VqnmN46yyz4cYK --00000000000096b2160619ac3307 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, May 30, 2024 at 1:54=E2=80=AFAM Kumara Babu = wrote: > Hello, > > There have been a few incidents reported on Juniper devices with FreeBSD, > where buffer IO operations sleep for more than 30 mins. Theoretically, th= is > can happen due to faulty hardware or in virtual platforms due to faulty > connection between guest and host, filesystem corruption, too many buffer > IO operations, and/or host not responding due to various reasons. When th= at > happens, as this buffer IO writes hold a lock before going to sleep, the > threads waiting for that lock would starve for so long. There is no upper > limit for this bwait() as of now. If that wait goes beyond 30 mins for a > sleeping thread OR 15 mins for a thread blocked on turnstile, deadlkres > crashes the kernel assuming a possible deadlock. > Why isn't the I/O timing out? That's the real problem. > We perhaps could gracefully handle such lengthy buffer IO operations by > adding a timeout in bwait() - like say 10 minutes. If the buffer IO is no= t > completed in a few mins, it probably would not complete forever and/or > would be slowing down the entire system. So it is better to stop such > faulty IO operations. > I think that's a terrible idea. Why aren't the I/Os timing out? > For now, since we had seen these instances only with BIO operations, I > have a patch to set this value only from bufwait(). Please find the patch > attached. I am not very sure if 10 mins is a good upper limit for all the > scenarios for bwait(). If it is, then we could just change msleep() in > bwait() to set a 10 mins upper limit by default. > I never see this on any of the thousands of systems I've used. > Please let me know if this approach works for all the usecases - If not, > is there a better alternative ? And is 10 mins okay for a timeout ? > Making sure that the I/Os timeout. And by that, I mean doing what we do in CAM. All the SIMs ensure that transactions posted to the device will timeout. Most of the SIMs create a timeout per transaction which expire and complete the CCBs with a timeout, which the periph drivers then see this status and will fail the I/O with a timed out status (or maybe retries it a couple of times, depending on the hardware and its recovery methods (eg is the timeout due to the drive, the link, the HBA, etc will result in different recovery in the face of timeouts). NVME nvd does similar things: A timeout will cause the nvme card to be reset and we try again, but eventually fail. One might also wonder why 30s is the timeout for most of the commands. I get that 'special' commands might need a longer timeout, but we likely should look at lowering this somewhat. 15s is almost certainly safe. 10s is probably safe. 5s will work, but you start to get P99.9999 outliers on popular completely working spinning rust, and P99.9 on marginal drives, so it can be a bit tricky to change (we'll have to phase it in). That could make things a bit better in terms of worse case recovery time. So why aren't the I/O's timing out is the real question here. Warner > Thanks and Regards, > > Kumara > --00000000000096b2160619ac3307 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Thu, May 30, 2024 at 1:54=E2=80=AF= AM Kumara Babu <nkumarababu@gma= il.com> wrote:

Hello,

There have been a few incidents reported on Juniper= devices with FreeBSD, where buffer IO operations sleep for more than 30 mi= ns. Theoretically, this can happen due to faulty hardware or in virtual pla= tforms due to faulty connection between guest and host, filesystem corrupti= on, too many buffer IO operations, and/or host not responding due to variou= s reasons. When that happens, as this buffer IO writes hold a lock before g= oing to sleep, the threads waiting for that lock would starve for so long. = There is no upper limit for this bwait() as of now. If that wait goes beyon= d 30 mins for a sleeping thread OR 15 mins for a thread blocked on turnstil= e, deadlkres crashes the kernel assuming a possible deadlock.

=
Why isn't the I/O timing out? That's the re= al problem.

= We perhaps could gracefully handle such lengthy buffer IO operations by add= ing a timeout in bwait() - like say 10 minutes. If the buffer IO is not com= pleted in a few mins, it probably would not complete forever and/or would b= e slowing down the entire system. So it is better to stop such faulty IO op= erations.

I think that's a terrible id= ea. Why aren't the I/Os timing out?=C2=A0

For now, since we had seen these ins= tances only with BIO operations, I have a patch to set this value only from= bufwait(). Please find the patch attached. I am not very sure if 10 mins i= s a good upper limit for all the scenarios for bwait(). If it is, then we c= ould just change msleep() in bwait() to set a 10 mins upper limit by defaul= t.=C2=A0

I never see this on = any of the thousands of systems I've used.

Please let me know if this approach= works for all the usecases - If not, is there a better alternative ?=C2=A0= And is 10 mins okay for a timeout ?

Makin= g sure that the I/Os timeout.

And by that, I mean = doing what we do in CAM. All the SIMs ensure that transactions posted to th= e device will timeout. Most of the SIMs create a timeout per transaction wh= ich expire and complete the CCBs with a timeout, which the periph drivers t= hen see this status and will fail the I/O with a timed out status (or maybe= retries it a couple of times, depending on the hardware and its recovery m= ethods (eg is the timeout due to the drive, the link, the HBA, etc will res= ult in different recovery in the face of timeouts). NVME nvd does similar t= hings: A timeout will cause the nvme card to be reset and we try again, but= eventually fail.

One might also wonder why 30s is= the timeout for most of the commands. I get that 'special' command= s might need a longer timeout, but we likely should look at lowering this s= omewhat. 15s is almost certainly safe. 10s is probably safe. 5s will work, = but you start to get P99.9999 outliers on popular completely working spinni= ng rust, and P99.9 on marginal drives, so it can be a bit tricky to change = (we'll have to phase it in). That could make things a bit better in ter= ms of worse case recovery time.

So why aren't = the I/O's timing out is the real question here.

Warner
=C2=A0

Thanks and Regards,

Kumara

--00000000000096b2160619ac3307--