Unmapped I/O

Wed Dec 19 19:28:40 UTC 2012

On Wed, 19 Dec 2012, Alan Cox wrote:

> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov <kostikbel at gmail.com>wrote:
>
>> One of the known FreeBSD I/O path performance bootleneck is the
>> neccessity to map each I/O buffer pages into KVA.  The problem is that
>> on the multi-core machines, the mapping must flush TLB on all cores,
>> due to the global mapping of the buffer pages into the kernel.  This
>> means that buffer creation and destruction disrupts execution of all
>> other cores to perform TLB shootdown through IPI, and the thread
>> initiating the shootdown must wait for all other cores to execute and
>> report.
>>
>> The patch at
>> http://people.freebsd.org/~kib/misc/unmapped.4.patch
>> implements the 'unmapped buffers'.  It means an ability to create the
>> VMIO struct buf, which does not point to the KVA mapping the buffer
>> pages to the kernel addresses.  Since there is no mapping, kernel does
>> not need to clear TLB. The unmapped buffers are marked with the new
>> B_NOTMAPPED flag, and should be requested explicitely using the
>> GB_NOTMAPPED flag to the buffer allocation routines.  If the mapped
>> buffer is requested but unmapped buffer already exists, the buffer
>> subsystem automatically maps the pages.
>>
>> The clustering code is also made aware of the not-mapped buffers, but
>> this required the KPI change that accounts for the diff in the non-UFS
>> filesystems.
>>
>> UFS is adopted to request not mapped buffers when kernel does not need
>> to access the content, i.e. mostly for the file data.  New helper
>> function vn_io_fault_pgmove() operates on the unmapped array of pages.
>> It calls new pmap method pmap_copy_pages() to do the data move to and
>> from usermode.
>>
>> Besides not mapped buffers, not mapped BIOs are introduced, marked
>> with the flag BIO_NOTMAPPED.  Unmapped buffers are directly translated
>> to unmapped BIOs.  Geom providers may indicate an acceptance of the
>> unmapped BIOs.  If provider does not handle unmapped i/o requests,
>> geom now automatically establishes transient mapping for the i/o
>> pages.
>>
>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The
>> gpart providers indicate the unmapped BIOs support if the underlying
>> provider can do unmapped i/o.  I also hacked ahci(4) to handle
>> unmapped i/o, but this should be changed after the Jeff' physbio patch
>> is committed, to use proper busdma interface.
>>
>> Besides, the swap pager does unmapped swapping if the swap partition
>> indicated that it can do unmapped i/o.  By Jeff request, a buffer
>> allocation code may reserve the KVA for unmapped buffer in advance.
>> The unmapped page-in for the vnode pager is also implemented if
>> filesystem supports it, but the page out is not. The page-out, as well
>> as the vnode-backed md(4), currently require mappings, mostly due to
>> the use of VOP_WRITE().
>>
>> As such, the patch worked in my test environment, where I used
>> ahci-attached SATA disks with gpt partitions, md(4) and UFS.  I see no
>> statistically significant difference in the buildworld -j 10 times on
>> the 4-core machine with HT.  On the other hand, when doing sha1 over
>> the 5GB file, the system time was reduced by 30%.
>>
>> Unfinished items:
>> - Integration with the physbio, will be done after physbio is
>>   committed to HEAD.
>> - The key per-architecture function needed for the unmapped i/o is the
>>   pmap_copy_pages(). I implemented it for amd64 and i386 right now, it
>>   shall be done for all other architectures.
>> - The sizing of the submap used for transient mapping of the BIOs is
>>   naive.  Should be adjusted, esp. for KVA-lean architectures.
>> - Conversion of the other filesystems. Low priority.
>>
>> I am interested in reviews, tests and suggestions.  Note that this
>> only works now for md(4) and ahci(4), for other drivers the patched
>> kernel should fall back to the mapped i/o.
>>
>>
> Here are a couple things for you to think about:
>
> 1. A while back, I developed the patch at
> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in trying to
> reduce the number of TLB shootdowns by the buffer map.  The idea is simple:
> Replace the calls to pmap_q{enter,remove}() with calls to a new
> machine-dependent function that opportunistically sets the buffer's kernel
> virtual address to the direct map for physically contiguous pages.
> However, if the pages are not physically contiguous, it calls pmap_qenter()
> with the kernel virtual address from the buffer map.
>
> This eliminated about half of the TLB shootdowns for a buildworld, because
> there is a decent amount of physical contiguity that occurs by "accident".
> Using a buddy allocator for physical page allocation tends to promote this
> contiguity.  However, in a few places, it occurs by explicit action, e.g.,
> mapped files, including large executables, using superpage reservations.
>
> So, how does this fit with what you've done?  You might think of using what
> I describe above as a kind of "fast path".  As you can see from the patch,
> it's very simple and non-intrusive.  If the pages aren't physically
> contiguous, then instead of using pmap_qenter(), you fall back to whatever
> approach for creating ephemeral mappings is appropriate to a given
> architecture.

I think these are complimentary.  Kib's patch gives us the fastest 
possible path for user data.  Alan's patch will improve the metadata 
performance for things that really require the buffer cache.  I see no 
reason not to clean up and commit both.

>
> 2. As for managing the ephemeral mappings on machines that don't support a
> direct map.  I would suggest an approach that is loosely inspired by
> copying garbage collection (or the segment cleaners in log-structured file
> systems).  Roughly, you manage the buffer map as a few spaces (or
> segments).  When you create a new mapping in one of these spaces (or
> segments), you simply install the PTEs.  When you decide to "garbage
> collect" a space (or spaces), then you perform a global TLB flush.
> Specifically, you do something like toggling the bit in the cr4 register
> that enables/disables support for the PG_G bit.  If the spaces are
> sufficiently large, then the number of such global TLB flushes should be
> quite low.  Every space would have an epoch number (or flush number).  In
> the buffer, you would record the epoch number alongside the kernel virtual
> address.  On access to the buffer, if the epoch number was too old, then
> you have to recreate the buffer's mapping in a new space.

Are the machines that don't have a direct map performance critical?  My 
expectation is that they are legacy or embedded.  This seems like a great 
project to do when the rest of the pieces are stable and fast.  Until then 
they could just use something like pbufs?

Jeff

>
> Alan
> _______________________________________________
> freebsd-arch at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe at freebsd.org"
>