Unmapped I/O

Wed Dec 19 19:28:49 UTC 2012

On Wed, Dec 19, 2012 at 12:58:41PM -0600, Alan Cox wrote:
> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov <kostikbel at gmail.com>wrote:
> 
> > One of the known FreeBSD I/O path performance bootleneck is the
> > neccessity to map each I/O buffer pages into KVA.  The problem is that
> > on the multi-core machines, the mapping must flush TLB on all cores,
> > due to the global mapping of the buffer pages into the kernel.  This
> > means that buffer creation and destruction disrupts execution of all
> > other cores to perform TLB shootdown through IPI, and the thread
> > initiating the shootdown must wait for all other cores to execute and
> > report.
> >
> > The patch at
> > http://people.freebsd.org/~kib/misc/unmapped.4.patch
> > implements the 'unmapped buffers'.  It means an ability to create the
> > VMIO struct buf, which does not point to the KVA mapping the buffer
> > pages to the kernel addresses.  Since there is no mapping, kernel does
> > not need to clear TLB. The unmapped buffers are marked with the new
> > B_NOTMAPPED flag, and should be requested explicitely using the
> > GB_NOTMAPPED flag to the buffer allocation routines.  If the mapped
> > buffer is requested but unmapped buffer already exists, the buffer
> > subsystem automatically maps the pages.
> >
> > The clustering code is also made aware of the not-mapped buffers, but
> > this required the KPI change that accounts for the diff in the non-UFS
> > filesystems.
> >
> > UFS is adopted to request not mapped buffers when kernel does not need
> > to access the content, i.e. mostly for the file data.  New helper
> > function vn_io_fault_pgmove() operates on the unmapped array of pages.
> > It calls new pmap method pmap_copy_pages() to do the data move to and
> > from usermode.
> >
> > Besides not mapped buffers, not mapped BIOs are introduced, marked
> > with the flag BIO_NOTMAPPED.  Unmapped buffers are directly translated
> > to unmapped BIOs.  Geom providers may indicate an acceptance of the
> > unmapped BIOs.  If provider does not handle unmapped i/o requests,
> > geom now automatically establishes transient mapping for the i/o
> > pages.
> >
> > Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The
> > gpart providers indicate the unmapped BIOs support if the underlying
> > provider can do unmapped i/o.  I also hacked ahci(4) to handle
> > unmapped i/o, but this should be changed after the Jeff' physbio patch
> > is committed, to use proper busdma interface.
> >
> > Besides, the swap pager does unmapped swapping if the swap partition
> > indicated that it can do unmapped i/o.  By Jeff request, a buffer
> > allocation code may reserve the KVA for unmapped buffer in advance.
> > The unmapped page-in for the vnode pager is also implemented if
> > filesystem supports it, but the page out is not. The page-out, as well
> > as the vnode-backed md(4), currently require mappings, mostly due to
> > the use of VOP_WRITE().
> >
> > As such, the patch worked in my test environment, where I used
> > ahci-attached SATA disks with gpt partitions, md(4) and UFS.  I see no
> > statistically significant difference in the buildworld -j 10 times on
> > the 4-core machine with HT.  On the other hand, when doing sha1 over
> > the 5GB file, the system time was reduced by 30%.
> >
> > Unfinished items:
> > - Integration with the physbio, will be done after physbio is
> >   committed to HEAD.
> > - The key per-architecture function needed for the unmapped i/o is the
> >   pmap_copy_pages(). I implemented it for amd64 and i386 right now, it
> >   shall be done for all other architectures.
> > - The sizing of the submap used for transient mapping of the BIOs is
> >   naive.  Should be adjusted, esp. for KVA-lean architectures.
> > - Conversion of the other filesystems. Low priority.
> >
> > I am interested in reviews, tests and suggestions.  Note that this
> > only works now for md(4) and ahci(4), for other drivers the patched
> > kernel should fall back to the mapped i/o.
> >
> >
> Here are a couple things for you to think about:
> 
> 1. A while back, I developed the patch at
> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in trying to
> reduce the number of TLB shootdowns by the buffer map.  The idea is simple:
> Replace the calls to pmap_q{enter,remove}() with calls to a new
> machine-dependent function that opportunistically sets the buffer's kernel
> virtual address to the direct map for physically contiguous pages.
> However, if the pages are not physically contiguous, it calls pmap_qenter()
> with the kernel virtual address from the buffer map.
> 
> This eliminated about half of the TLB shootdowns for a buildworld, because
> there is a decent amount of physical contiguity that occurs by "accident".
> Using a buddy allocator for physical page allocation tends to promote this
> contiguity.  However, in a few places, it occurs by explicit action, e.g.,
> mapped files, including large executables, using superpage reservations.
> 
> So, how does this fit with what you've done?  You might think of using what
> I describe above as a kind of "fast path".  As you can see from the patch,
> it's very simple and non-intrusive.  If the pages aren't physically
> contiguous, then instead of using pmap_qenter(), you fall back to whatever
> approach for creating ephemeral mappings is appropriate to a given
> architecture.
I remember this.

I did not measured the change in the amount of IPIs issued during the
buildworld, but I do account for the mapped/unmapped buffer space in
the patch. For the buildworld load, there is 5-10% of the mapped buffers
from the whole buffers, which coincide with the intuitive size of the
metadata for sources. Since unmapped buffers eliminate IPIs at creation
and reuse, I safely guess that IPI reduction is on the comparable numbers.

The pmap_map_buf() patch is orthohonal to the work I did, and it should
nicely reduce the overhead for the metadata buffers handling. I can finish
it, if you want. I do not think that it should be added to the already
large patch, but instead it could be done and committed separately.

> 
> 2. As for managing the ephemeral mappings on machines that don't support a
> direct map.  I would suggest an approach that is loosely inspired by
> copying garbage collection (or the segment cleaners in log-structured file
> systems).  Roughly, you manage the buffer map as a few spaces (or
> segments).  When you create a new mapping in one of these spaces (or
> segments), you simply install the PTEs.  When you decide to "garbage
> collect" a space (or spaces), then you perform a global TLB flush.
> Specifically, you do something like toggling the bit in the cr4 register
> that enables/disables support for the PG_G bit.  If the spaces are
> sufficiently large, then the number of such global TLB flushes should be
> quite low.  Every space would have an epoch number (or flush number).  In
> the buffer, you would record the epoch number alongside the kernel virtual
> address.  On access to the buffer, if the epoch number was too old, then
> you have to recreate the buffer's mapping in a new space.

Could you, please, describe the idea in more details ? For which mappings
the described mechanism should be used ?

Do you mean the pmap_copy_pages() implementation, or the fallback mappings
for BIOs ?

Note that pmap_copy_pages() implementaion on i386 is shamelessly stolen
from pmap_copy_page() and uses the per-cpu ephemeral mapping for copying.

For BIOs, this might be used, but I am also quite satisfied with submap
and pmap_qenter().
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20121219/68f785de/attachment.sig>