Unmapped I/O
Jeff Roberson
jroberson at jroberson.net
Fri Dec 28 23:20:32 UTC 2012
On Sat, 29 Dec 2012, Konstantin Belousov wrote:
> On Fri, Dec 28, 2012 at 12:36:57PM -1000, Jeff Roberson wrote:
>> On Wed, 19 Dec 2012, Konstantin Belousov wrote:
>>
>>> One of the known FreeBSD I/O path performance bootleneck is the
>>> neccessity to map each I/O buffer pages into KVA. The problem is that
>>> on the multi-core machines, the mapping must flush TLB on all cores,
>>> due to the global mapping of the buffer pages into the kernel. This
>>> means that buffer creation and destruction disrupts execution of all
>>> other cores to perform TLB shootdown through IPI, and the thread
>>> initiating the shootdown must wait for all other cores to execute and
>>> report.
>>>
>>> The patch at
>>> http://people.freebsd.org/~kib/misc/unmapped.4.patch
>>> implements the 'unmapped buffers'. It means an ability to create the
>>> VMIO struct buf, which does not point to the KVA mapping the buffer
>>> pages to the kernel addresses. Since there is no mapping, kernel does
>>> not need to clear TLB. The unmapped buffers are marked with the new
>>> B_NOTMAPPED flag, and should be requested explicitely using the
>>> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped
>>> buffer is requested but unmapped buffer already exists, the buffer
>>> subsystem automatically maps the pages.
>>>
>>> The clustering code is also made aware of the not-mapped buffers, but
>>> this required the KPI change that accounts for the diff in the non-UFS
>>> filesystems.
>>>
>>> UFS is adopted to request not mapped buffers when kernel does not need
>>> to access the content, i.e. mostly for the file data. New helper
>>> function vn_io_fault_pgmove() operates on the unmapped array of pages.
>>> It calls new pmap method pmap_copy_pages() to do the data move to and
>>> from usermode.
>>>
>>> Besides not mapped buffers, not mapped BIOs are introduced, marked
>>> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated
>>> to unmapped BIOs. Geom providers may indicate an acceptance of the
>>> unmapped BIOs. If provider does not handle unmapped i/o requests,
>>> geom now automatically establishes transient mapping for the i/o
>>> pages.
>>>
>>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The
>>> gpart providers indicate the unmapped BIOs support if the underlying
>>> provider can do unmapped i/o. I also hacked ahci(4) to handle
>>> unmapped i/o, but this should be changed after the Jeff' physbio patch
>>> is committed, to use proper busdma interface.
>>>
>>> Besides, the swap pager does unmapped swapping if the swap partition
>>> indicated that it can do unmapped i/o. By Jeff request, a buffer
>>> allocation code may reserve the KVA for unmapped buffer in advance.
>>> The unmapped page-in for the vnode pager is also implemented if
>>> filesystem supports it, but the page out is not. The page-out, as well
>>> as the vnode-backed md(4), currently require mappings, mostly due to
>>> the use of VOP_WRITE().
>>>
>>> As such, the patch worked in my test environment, where I used
>>> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no
>>> statistically significant difference in the buildworld -j 10 times on
>>> the 4-core machine with HT. On the other hand, when doing sha1 over
>>> the 5GB file, the system time was reduced by 30%.
>>>
>>> Unfinished items:
>>> - Integration with the physbio, will be done after physbio is
>>> committed to HEAD.
>>> - The key per-architecture function needed for the unmapped i/o is the
>>> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it
>>> shall be done for all other architectures.
>>> - The sizing of the submap used for transient mapping of the BIOs is
>>> naive. Should be adjusted, esp. for KVA-lean architectures.
>>> - Conversion of the other filesystems. Low priority.
>>>
>>> I am interested in reviews, tests and suggestions. Note that this
>>> only works now for md(4) and ahci(4), for other drivers the patched
>>> kernel should fall back to the mapped i/o.
>>>
>>> sys/amd64/amd64/pmap.c | 24 +++
>>> sys/cam/ata/ata_da.c | 5 +-
>>> sys/cam/cam_ccb.h | 30 ++++
>>> sys/dev/ahci/ahci.c | 53 +++++-
>>> sys/dev/md/md.c | 255 ++++++++++++++++++++++++-----
>>> sys/fs/cd9660/cd9660_vnops.c | 2 +-
>>> sys/fs/ext2fs/ext2_balloc.c | 2 +-
>>> sys/fs/ext2fs/ext2_vnops.c | 9 +-
>>> sys/fs/msdosfs/msdosfs_vnops.c | 4 +-
>>> sys/fs/udf/udf_vnops.c | 5 +-
>>> sys/geom/geom.h | 1 +
>>> sys/geom/geom_disk.c | 2 +
>>> sys/geom/geom_disk.h | 1 +
>>> sys/geom/geom_io.c | 44 ++++-
>>> sys/geom/geom_vfs.c | 10 +-
>>> sys/geom/part/g_part.c | 1 +
>>> sys/i386/i386/pmap.c | 42 +++++
>>> sys/kern/vfs_bio.c | 356 +++++++++++++++++++++++++++++++++--------
>>> sys/kern/vfs_cluster.c | 118 +++++++-------
>>> sys/kern/vfs_vnops.c | 39 +++++
>>> sys/sys/bio.h | 7 +
>>> sys/sys/buf.h | 22 ++-
>>> sys/sys/mount.h | 1 +
>>> sys/sys/vnode.h | 2 +
>>> sys/ufs/ffs/ffs_alloc.c | 10 +-
>>> sys/ufs/ffs/ffs_balloc.c | 58 ++++---
>>> sys/ufs/ffs/ffs_vfsops.c | 3 +-
>>> sys/ufs/ffs/ffs_vnops.c | 35 ++--
>>> sys/ufs/ufs/ufs_extern.h | 1 +
>>> sys/vm/pmap.h | 2 +
>>> sys/vm/swap_pager.c | 43 +++--
>>> sys/vm/swap_pager.h | 1 +
>>> sys/vm/vm.h | 2 +
>>> sys/vm/vm_init.c | 6 +-
>>> sys/vm/vm_kern.c | 9 +-
>>> sys/vm/vnode_pager.c | 30 +++-
>>> 36 files changed, 989 insertions(+), 246 deletions(-)
>>>
>>>
>>
>> A few comments:
>>
>> 1) If the BIO is mapped at all you have to pass the virtual address to
>> busdma. We can not lose the fact that it is mapped somewhere or virtual
>> cache architectures will fail. I think when we integrate with physbio we
>> can do this properly by implementing a load routine on the bio.
> This is weird. How the buffer cache works for such architectures then ?
> VMIO buffer pages could be mapped by kernel and any usermode.
pmap_remove_write in vfs_busy_pages() handles this. No new writes are
allowed until the IO is complete.
>
> Anyway, I believe I never pass BIO_NOTMAPPED down if the kernel mapping
> for the buffer exists. I might document it somewhere.
Ok, we just need to be sure at any place where we make the transition from
unmapped to mapped we use the right pointer.
>
> Would you prefer to have both mapping address and vm_page array passed
> down, if KVA is available ?
Yes, I think that's best.
>>
>> 2) Why does mdstart_swap() need a cpu_flush_dcache? Shouldn't sf bufs on
>> the appropriate architecture do the right flushing when they are unmapped?
> I believe it was added by Marcel. sfbuf code cannot not know if the
> [d]cache should be flushed, since there is no indication from the caller
> on the kind of access to the mapping, ro or rw.
Maybe we should change the free function to indicate. This seems sloppy.
not your fault though.
>
>>
>> 3) I find the NOTMAPPED negative flag awkward grammatically. UNMAPPED
>> seems more natural. Or a positive MAPPED flag would be better. Minor
>> concern, bikeshedding, etc.
> So you prefers BIO/B_UNMAPPED ? I will change this.
I think I'd prefer the positive flag BIO_MAPPED but I recognize that's a
lot of work to switch the sense in your entire patch now so feel free to
ignore me.
>
>>
>> 4) It would be better to have some wrapper functions around the bio
>> transient map and or sf buf handling. I will need it to map unmapped cam
>> ccbs in device drivers. We need to come to some agreement on this API.
>> There should be a fast page-by-page version and a potentially blocking
>> all-at-once linear version.
> No, I do not think this is needed. I thought about this, and decided
> to handle it centralized in g_down instead to require drivers code
> to be changed there.
>
> The driver should indicate the acceptance of the unmapped BIOs. If it
> does not accept them, the patch already contains a code in the g_down
> path to create the transient mapping. The PIO-mode ATA drivers should
> not set the flag on the disk. There might be some need to be able to
> dynamically clear the flag when the channel mode is switched to PIO.
>
> In fact, I do not see why not to require the always mapped BIOs for
> legacy ATA. See DISKFLAG_NOTMAPPED_BIO.
I guess this would be acceptable for a first cut but I don't think it will
fly long-term. I audited all of the cam and block devices recently.
There are cases that are difficult to predict ahead of time which would
mean some devices would always be operating in a slower mode.
Also my primary test machines are 16core boxes that still use legacy ATA.
It is not that uncommon. It would be a shame if many of our users saw no
benefit from this. Although for ATA PIO we could make an iterator that
used the directmap or sf_bufs. That may be sufficient. It would be nice
if it was encapsulated in a simple api so every user didn't have to know
the details of pinning, sf bufs, and the memory layout. What would you
think of that as the in-between compromise?
>
>>
>> 5) Should the transient bio space not come from the buffer_map? We don't
>> have more KVA on 32bit architectures. Or did that come from pbufs? We
>> could consolidate here some but there are a lot of potential deadlock
>> issues.
> No, exactly because it is impossible to request the buffer_map KVA
> defragmentation from the g_down. The worst which could happen with the
> current code is the pauses during intensive i/o when transient map is
> exhausted or fragmented. The g_down is paused until enough in-flight
> transactions are completed.
>
How is there room for this new transient map? Can't pbufs completely give
up their space to it now? Shouldn't you always allocate BKVASIZE to avoid
fragmentation? And speed-up the search.
> Peter has tested it under the load where a lot of the parallel i/o were
> issued and transient map was constantly exhausted, for the ad(4) driver
> requiring fallback mappings. I added the counters to watch the situation.
>
>>
>> 6) Thank you for adding the KVAALLOC flag. Isilon needs this internally.
> Please note that I did not tested this yet :/.
>
>>
>> 7) All of that code that exploded into getblk() should be refactored into
>> some support functions. It's hard to read and getblk() is too big
>> already.
> Agreed.
>
>>
>> All in all it looks like we have the right pieces coming together. Just
>> needs a little refactoring and then a lot of test. I'm almost ready to
>> commit the first phase of physbio. It doesn't yet have the code necessary
>> to temporarily map io for things like ATA PIO. I'm hoping that you'll
> It is there in the central place.
>
>> provide the functions to do that. It does abstract out most of the
>> details of the memory formats so we can change them at all. And adds
>> support to busdma for physical addresses so bounce pages, alignment,
>> sizes, and boundaries work.
>>
>> Thanks,
>> Jeff
>
Jeff
More information about the freebsd-arch
mailing list