add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom

Sat Nov 25 17:50:09 UTC 2017

On 25/11/2017 19:38, Warner Losh wrote:
> 
> 
> On Sat, Nov 25, 2017 at 9:58 AM, Andriy Gapon <avg at freebsd.org
> <mailto:avg at freebsd.org>> wrote:
> 
>     On 25/11/2017 18:25, Warner Losh wrote:
>     >
>     >
>     > On Fri, Nov 24, 2017 at 10:17 AM, Andriy Gapon <avg at freebsd.org <mailto:avg at freebsd.org>
>     > <mailto:avg at freebsd.org <mailto:avg at freebsd.org>>> wrote:
>     >
>     >     On 24/11/2017 16:57, Scott Long wrote:
>     >     >
>     >     >
>     >     >> On Nov 24, 2017, at 6:34 AM, Andriy Gapon <avg at FreeBSD.org> wrote:
>     >     >>
>     >     >> On 24/11/2017 15:08, Warner Losh wrote:
>     >     >>>
>     >     >>>
>     >     >>> On Fri, Nov 24, 2017 at 3:30 AM, Andriy Gapon <avg at freebsd.org <mailto:avg at freebsd.org>
>     >     <mailto:avg at freebsd.org <mailto:avg at freebsd.org>>
>     >     >>> <mailto:avg at freebsd.org <mailto:avg at freebsd.org> <mailto:avg at freebsd.org
>     <mailto:avg at freebsd.org>>>> wrote:
>     >     >>>
>     >     >>>
>     >     >>>    https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>
>     >     <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>> <https://reviews.freebsd.org/D13224
>     <https://reviews.freebsd.org/D13224>
>     >     <https://reviews.freebsd.org/D13224 <https://reviews.freebsd.org/D13224>>>
>     >     >>>
>     >     >>>    Anyone interested is welcome to join the review.
>     >     >>>
>     >     >>>
>     >     >>> I think it's a really bad idea. It introduces a 'one-size-fits-all'
>     >     notion of
>     >     >>> QoS that seems misguided. It conflates a shorter timeout with don't
>     >     retry. And
>     >     >>> why is retrying bad? It seems more a notion of 'fail fast' or so other
>     >     concept.
>     >     >>> There's so many other ways you'd want to use it. And it uses the
>     same return
>     >     >>> code (EIO) to mean something new. It's generally meant 'The lower
>     layers
>     >     have
>     >     >>> retried this, and it failed, do not submit it again as it will not
>     >     succeed' with
>     >     >>> 'I gave it a half-assed attempt, and that failed, but resubmission
>     might
>     >     work'.
>     >     >>> This breaks a number of assumptions in the BUF/BIO layer as well as
>     >     parts of CAM
>     >     >>> even more than they are broken now.
>     >     >>>
>     >     >>> So let's step back a bit: what problem is it trying to solve?
>     >     >>
>     >     >> A simple example.  I have a mirror, I issue a read to one of its
>     >     members.  Let's
>     >     >> assume there is some trouble with that particular block on that
>     >     particular disk.
>     >     >> The disk may spend a lot of time trying to read it and would still
>     fail. 
>     >     With
>     >     >> the current defaults I would wait 5x that time to finally get the
>     error back.
>     >     >> Then I go to another mirror member and get my data from there.
>     >     >
>     >     > There are many RAID stacks that already solve this problem by having
>     a policy
>     >     > of always reading all disk members for every transaction, and throwing
>     >     away the
>     >     > sub-transactions that arrive late.  It’s not a policy that is always
>     >     desired, but it
>     >     > serves a useful purpose for low-latency needs.
>     >
>     >     That's another possible and useful strategy.
>     >
>     >     >> IMO, this is not optimal.  I'd rather pass BIO_NORETRY to the first
>     read, get
>     >     >> the error back sooner and try the other disk sooner.  Only if I
>     know that there
>     >     >> are no other copies to try, then I would use the normal read with
>     all the retrying.
>     >     >>
>     >     >
>     >     > I agree with Warner that what you are proposing is not correct.  It
>     weakens the
>     >     > contract between the disk layer and the upper layers, making it less
>     clear who is
>     >     > responsible for retries and less clear what “EIO” means.  That
>     contract is already
>     >     > weak due to poor design decisions in VFS-BIO and GEOM, and Warner and I
>     >     > are working on a plan to fix that.
>     >
>     >     Well...  I do realize now that there is some problem in this area,
>     both you and
>     >     Warner mentioned it.  But knowing that it exists is not the same as
>     knowing what
>     >     it is :-)
>     >     I understand that it could be rather complex and not easy to describe
>     in a short
>     >     email...
>     >
>     >     But then, this flag is optional, it's off by default and no one is
>     forced to
>     >     used it.  If it's used only by ZFS, then it would not be horrible.
>     >
>     >
>     > Except that it isn't the same flag as what Solaris has (its B_FAILFAST does
>     > something different: it isn't about limiting retries but about failing ALL the
>     > queued I/O for a unit, not just trying one retry), and the problems that it
>     > solves are quite rare. And if you return a different errno, then the EIO
>     > contract is still fulfilled. 
> 
>     Yes, it isn't the same.
>     I think that illumos flag does even more.
> 
> 
> Since it isn't the same, and there's not other systems that do a similar thing,
> that ups the burden of proof that this is a good idea.
> 
>     >     Unless it makes things very hard for the infrastructure.
>     >     But I am circling back to not knowing what problem(s) you and Warner are
>     >     planning to fix.
>     >
>     >
>     > The middle layers of the I/O system are a bit fragile in the face of I/O errors.
>     > We're fixing that.
> 
>     What are the middle layers?
> 
> 
> The buffer cache and lower layers of the UFS code is where the problems chiefly lie.

Those are the upper layers from the point of view of GEOM and things below it.
If they don't set that flag on the bio, then it is not going to magically appear
in their I/O path.

>     > Of course, you still haven't articulated why this approach would be better
> 
>     Better than what?
> 
> 
> Well, anything?

I think that I have described how it is better than what we have now, which I
think is a part of 'anything'.

> 
>     > nor
>     > show any numbers as to how it makes things better.
> 
>     By now, I have.  See my reply to Scott's email.
> 
> 
> I just checked my email, I've seen no such reply. I checked it before I
> replied.  Maybe it's just delayed.

Sorry, my mistake.  I thought that I sent that email in the morning, but I
didn't.  I have just sent it.
Apologies again.

-- 
Andriy Gapon