From nobody Fri Jul 14 18:05:38 2023 X-Original-To: scsi@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4R2fXx5Wcxz4mZyv for ; Fri, 14 Jul 2023 18:05:53 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-lf1-x133.google.com (mail-lf1-x133.google.com [IPv6:2a00:1450:4864:20::133]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4R2fXw6sSvz49v8 for ; Fri, 14 Jul 2023 18:05:52 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-lf1-x133.google.com with SMTP id 2adb3069b0e04-4fb960b7c9dso3794168e87.0 for ; Fri, 14 Jul 2023 11:05:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20221208.gappssmtp.com; s=20221208; t=1689357950; x=1691949950; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=WrAJ8V53r2lXNjuUSwDNUzfBirrAnaeDrAzrEjiwPNc=; b=CSjDU6W9jFv2Je1rJbUTZ57TlapXzbGdww1/7QQNc0KeRQV4BZJzeNaPTjPbhvJilX hIKFyWKJyTl9siBL9tgNC6d0N/P6MS25Nc35eYrOIsIv04RKzQEFPlPzHkyHn5fU73ZV V6mVIHso/FkYOtHobThFhj5oV/HaqHshLnmdTqHYCzVnjlGeIfC/8x7VGYFXWbJLW8z1 noRaRB3H7cjzy1TMCd9xY6roDraYeVKE14JU/GGuZgbLFcjePEWniYBp1Apatac5IZI3 j6vRK7VS4pFyommow/vpJyvEaTM+JF5oTvKXxgcRX+xRc3wyBIzEfC5t3yVC1p1U8rcG NK/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689357950; x=1691949950; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WrAJ8V53r2lXNjuUSwDNUzfBirrAnaeDrAzrEjiwPNc=; b=dQzowmEyxSQ6pYq9nB4vkybkLP8Rfhnv8ATeRvRNpBJulNcLLhDiqD7Okyf+V//4ws sjQL1ZVYUTm7hB+ZNyXNI468tkSFPg3Eh8YzMlqWiEibMIl6ZfZU4kRf6y4MYHt3UuIk 53odZq+Dg8AeQ96Zqg9Y9b/6ej0QYHBPKscoZ0w1p6B6FfY/DFUmbvZXLbT9jFF+LLrp ztYnMaHTjNnU9wNISxPQ0jBxXBaeXqX0RznwAFjQQSgiQb5uEtHgxykVjhb2CAPFaA1G sc3GCjvSmDEtyiKsiIzJ7zmoVd0lhYFrnGc+xyGRy5MBMLWmdon592ISvZrXvORkoenm S8xg== X-Gm-Message-State: ABy/qLYQFmmA5ldaIqXszxZxfhRi8tspNPsnzgHUly/16Jbdbu1c6Ltw If2nNU91p1Jjup/8QhE1zgLoYZkEPIQVmQZSNYktXw== X-Google-Smtp-Source: APBJJlEo7TJUBEZ8v8Q7wVtPn1mflp+VhhVlcu0YC0YATRGAvrOzOg3dcGaaS2DySYgwelM52zko3TxsTy4MAoN213U= X-Received: by 2002:a05:6512:e9c:b0:4f9:740e:ca38 with SMTP id bi28-20020a0565120e9c00b004f9740eca38mr4472510lfb.53.1689357949503; Fri, 14 Jul 2023 11:05:49 -0700 (PDT) List-Id: SCSI subsystem List-Archive: https://lists.freebsd.org/archives/freebsd-scsi List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-scsi@freebsd.org X-BeenThere: freebsd-scsi@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Warner Losh Date: Fri, 14 Jul 2023 12:05:38 -0600 Message-ID: Subject: Re: ASC/ASCQ Review To: Alan Somers Cc: scsi@freebsd.org Content-Type: multipart/alternative; boundary="0000000000007aa3d30600764c4a" X-Rspamd-Queue-Id: 4R2fXw6sSvz49v8 X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N --0000000000007aa3d30600764c4a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Jul 14, 2023, 11:12 AM Alan Somers wrote: > On Thu, Jul 13, 2023 at 12:14=E2=80=AFPM Warner Losh wro= te: > > > > Greetings, > > > > i've been looking closely at failed drives for $WORK lately. I've > noticed that a lot of errors that kinda sound like fatal errors have > SS_RDEF set on them. > > > > What's the process for evaluating whether those error codes are worth > retrying. There are several errors that we seem to be seeing (preliminary > read of the data) before the drive gives up the ghost altogether. For tho= se > cases, I'd like to post more specific lists. Should I do that here? > > > > Independent of that, I may want to have a more aggressive 'fail fast' > policy than is appropriate for my work load (we have a lot of data that's= a > copy of a copy of a copy, so if we lose it, we don't care: we'll just > delete any files we can't read and get on with life, though I know others > will have a more conservative attitude towards data that might be preciou= s > and unique). I can set the number of retries lower, I can do some other > hacks for disks that tell the disk to fail faster, but I think part of th= e > solution is going to have to be failing for some sense-code/ASC/ASCQ tupl= es > that we don't want to fail in upstream or the general case. I was thinkin= g > of identifying those and creating a 'global quirk table' that gets applie= d > after the drive-specific quirk table that would let $WORK override the > defaults, while letting others keep the current behavior. IMHO, it would = be > better to have these separate rather than in the global data for tracking > upstream... > > > > Is that clear, or should I give concrete examples? > > > > Comments? > > > > Warner > > Basically, you want to change the retry counts for certain ASC/ASCQ > codes only, on a site-by-site basis? That sounds reasonable. Would > it be configurable at runtime or only at build time? > I'd like to change the default actions. But maybe we just do that for everyone and assume modern drives... Also, I've been thinking lately that it would be real nice if READ > UNRECOVERABLE could be translated to EINTEGRITY instead of EIO. That > would let consumers know that retries are pointless, but that the data > is probably healable. > Unlikely, unless you've tuned things to not try for long at recovery... But regardless... do you have a concrete example of a use case? There's a number of places that map any error to EIO. And I'd like a use case before we expand the errors the lower layers return... Warner > --0000000000007aa3d30600764c4a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Fri, Jul 14, 2023, 11:12 AM Alan Somers <asomers@freebsd.org> wrote:
On Thu, Jul 13, 2023 at 12:14=E2=80=AFPM Wa= rner Losh <imp@bsdimp.com> wrote:
>
> Greetings,
>
> i've been looking closely at failed drives for $WORK lately. I'= ;ve noticed that a lot of errors that kinda sound like fatal errors have SS= _RDEF set on them.
>
> What's the process for evaluating whether those error codes are wo= rth retrying. There are several errors that we seem to be seeing (prelimina= ry read of the data) before the drive gives up the ghost altogether. For th= ose cases, I'd like to post more specific lists. Should I do that here?=
>
> Independent of that, I may want to have a more aggressive 'fail fa= st' policy than is appropriate for my work load (we have a lot of data = that's a copy of a copy of a copy, so if we lose it, we don't care:= we'll just delete any files we can't read and get on with life, th= ough I know others will have a more conservative attitude towards data that= might be precious and unique). I can set the number of retries lower, I ca= n do some other hacks for disks that tell the disk to fail faster, but I th= ink part of the solution is going to have to be failing for some sense-code= /ASC/ASCQ tuples that we don't want to fail in upstream or the general = case. I was thinking of identifying those and creating a 'global quirk = table' that gets applied after the drive-specific quirk table that woul= d let $WORK override the defaults, while letting others keep the current be= havior. IMHO, it would be better to have these separate rather than in the = global data for tracking upstream...
>
> Is that clear, or should I give concrete examples?
>
> Comments?
>
> Warner

Basically, you want to change the retry counts for certain ASC/ASCQ
codes only, on a site-by-site basis?=C2=A0 That sounds reasonable.=C2=A0 Wo= uld
it be configurable at runtime or only at build time?
=

I'd like to change = the default actions. But maybe we just do that for everyone and assume mode= rn drives...

Also, I've been thinking lately that it would be real nice if READ
UNRECOVERABLE could be translated to EINTEGRITY instead of EIO.=C2=A0 That<= br> would let consumers know that retries are pointless, but that the data
is probably healable.

Unlikely, unless you've tuned things to not try fo= r long at recovery...=C2=A0

But regardless... do you have a concrete example of a use case? There&#= 39;s a number of places that map any error to EIO. And I'd like a use c= ase before we expand the errors the lower layers return...

Warner
--0000000000007aa3d30600764c4a--