Re: bio re-ordering
- In reply to: Warner Losh : "Re: bio re-ordering"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 18 Feb 2022 17:47:09 UTC
On Fri, Feb 18, 2022 at 7:31 PM Warner Losh <imp@bsdimp.com> wrote: > So I spent some time looking at what BIO_ORDERED means in today's kernel > and flavored it with my indoctrination of the ordering guarantees with BIO > requests > from when I wrote the CAM I/O scheduler. it's kinda long, but spells out > what > BIO_ORDERED means, where it can come from and who depends on it for what. > > On Fri, Feb 18, 2022 at 1:36 AM Peter Jeremy <peterj@freebsd.org> wrote: > >> On 2022-Feb-17 17:48:14 -0800, John-Mark Gurney <jmg@funkthat.com> wrote: >> >Peter Jeremy wrote this message on Sat, Feb 05, 2022 at 20:50 +1100: >> >> I've raised https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261731 >> to >> >> make geom_gate support BIO_ORDERED. Exposing the BIO_ORDERED flag to >> >> userland is quite easy (once a decision is made as to how to do that). >> >> Enhancing the geom_gate clients to correctly implement BIO_ORDERED is >> >> somewhat harder. >> > >> >The clients are single threaded wrt IOs, so I don't think updating them >> >are required. >> >> ggatec(8) and ggated(8) will not reorder I/Os. I'm not sure about hast. >> >> >I do have patches to improve things by making ggated multithreaded to >> >improve IOPs, and so making this improvement would allow those patches >> >to be useful. >> >> Likewise, I found ggatec and ggated to be too slow for my purposes and >> so I've implemented my own variant (not network API compatible) that >> can/does reorder requests. That was when I noticed that BIO_ORDERED >> wasn't implemented. >> >> >I do have a question though, what is the exact semantics of _ORDERED? >> >> I can't authoritatively answer this, sorry. >> > > This is under documented. Clients, in general, are expected to cope with > I/O that completes in an arbitrary order. They are expected to not schedule > new I/O that depends on old I/O completing for whatever reason (usually > on-media consistency). BIO_ORDERED is used to create a full barrier > in the stream of I/Os. The comments in the code say vaguely: > > /* > * This bio must be executed after all previous bios in the queue have been > * executed, and before any successive bios can be executed. > */ > > Drivers implement this as a partitioning of requests. All requests before > it are completed, then the BIO_ORDERED operation is done, then requests > after it are scheduled with the device. > > BIO_FLUSH I think is the only remaining operation that's done as > BIO_ORDERED > directly. xen.../blkback.c, geom_io.c and ffs_softdep.c are the only ones > that set it > and all on BIO_FLUSH operations. bio/buf clients depend on this to ensure > metadata > on the drive is in a consistent state after it's been updated. > > xen/.../blkback.c also sets it for all BLKIF_OP_WRITE_BARRIER operations > (so > write barriers). > > In the upper layers, we have struct buf instead of struct bio to describe > future I/Os > that the buffer cache may need to do. There's a flag B_BARRIER that gets > turned > into BIO_ORDERED in geom_vfs. B_BARRIER is set in only two places (and > copied > in one other) in vfs_bio.c. babarrierwrite and bbarrierwrite for async vs > sync writes > respectively. > > CAM will set BIO_ORDERED for all BIO_ZONE commands for reasons that are > at best unclear to me, but which won't matter for this discussion. > > ffs_alloc.c (so UFS again) is the only place that uses babarrierwrite. It > is used > to ensure that all inode initializations are completed before the cylinder > group > bitmap is written out. This is done with newfs, when new cylinder groups > are > created with growfs, and apparently in a few other cases where additional > inodes > are created in newly-created UFS2 filesystems. This can be disabled with > vfs.ffs.doasyncinodeinit=0 when barrier writes aren't working as > advertised, > but there's a big performance hit from doing so until all the inodes for > the > filesystem have been lazily populated. > > No place uses bbarrierwrite that I can find. > > Based on all of that, the CAM's dynamic I/O scheduler will reorder reads > around a BIO_ORDERED operation, but not writes, trims or flushes. Since, > in general, operations happen in an arbitrary order, scheduling both a read > and a write at the same time for the same block will result in undefined > results. > > Different drivers handle this differently. CAM will honor the BIO_ORDERED > tag by scheduling the I/O with an ordering tag so that the SCSI hardware > will > properly order the result. The simpler ATA version will use a non NCQ > request > to force the proper ordering (since to send a non-NCQ request, you have to > drain the queue, do that one command, and then start up again). nvd will > just throw > the I/O at the device, until it encounters a BIO_ORDERED request. Then it > will queue > everything until all the current requests complete, then do the ordered > request, then > do the rest of the queued I/O as if it had just showed up. > > Most drivers use bioq_disksort(), which will queue the request to the end > of the bioq > and mark things so all I/Os after that are in their new 'elevator car' for > its elevator sort > algorithm. This means that CAM's normal ways of dequeuing the request will > preserve > ordering through the periph driver's start routine (where the dynamic > schedule will honor > it for writes, but not reads, but the default scheduler will honor it for > both). > > >> >And right now, the ggate protocol (from what I remember) doesn't have >> >a way to know when the remote kernel has received notification that an >> >IO is complete. >> >> A G_GATE_CMD_START write request will be sent to the remote system and >> issued as a pwrite(2) then an acknowledgement packet will be returned >> and passed back to the local kernel via G_GATE_CMD_DONE. There's no >> support for BIO_FLUSH or BIO_ORDERED so there's no way for the local >> kernel to know when the write has been written to non-volatile store. >> > > That's unfortunate. UFS can work around the BIO_ORDERED problem with > a simple setting, but not the BIO_FLUSH problem. > > >> >> I've done some experiments and OpenZFS doesn't generate BIO_ORDERED >> >> operations so I've also raised >> https://github.com/openzfs/zfs/issues/13065 >> >> I haven't looked into how difficult that would be to fix. >> >> Unrelated to the above but for completeness: OpenZFS avoids the need >> for BIO_ORDERED by not issuing additional I/Os until previous I/Os have >> been retired when ordering is important. (It does rely on BIO_FLUSH). >> > > To be clear: OpenZFS won't schedule new I/Os until the BIO_FLUSH it sends > down w/o the BIO_ORDERED flag completes, right? The parenthetical confuses > me on how to parse it: BIO_FLUSH is needed and ZFS depends on it completing > with all blocks flushed to stable media, or ZFS depends on BIO_FLUSH being > strongly ordered relative to other commands. I think you mean the former, > but want > to make sure. > > The root of this problem, I think, is the following: > % man 9 bio > No manual entry for bio > ---------------------------------------------------------------------- > I think I'll have to massage this email into an appropriate man page. > At the very least, I should turn some/all of the above into a blog post :) > > Warner > ---------------------------------------------------------------------- The above sentence is WONDERFUL ... In my some messages , I am saying that : - Make Handbook parts , man pages a "blog" system , - Attach the related messages to these parts , - Relay comments / questions specified for these pages to mailing lists , - After a while or at suitable times , move "knowledge" ( meaning "what to do" ) in these messages into related parts . My opinion is that my ideas are not very effective . If the above sentence can "converge" to such a structure , it may be really WONDERFUL ... With my best wishes , Mehmet Erol Sanliturk