Unmapped I/O
Jeff Roberson
jroberson at jroberson.net
Wed Dec 19 19:28:40 UTC 2012
On Wed, 19 Dec 2012, Alan Cox wrote:
> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov <kostikbel at gmail.com>wrote:
>
>> One of the known FreeBSD I/O path performance bootleneck is the
>> neccessity to map each I/O buffer pages into KVA. The problem is that
>> on the multi-core machines, the mapping must flush TLB on all cores,
>> due to the global mapping of the buffer pages into the kernel. This
>> means that buffer creation and destruction disrupts execution of all
>> other cores to perform TLB shootdown through IPI, and the thread
>> initiating the shootdown must wait for all other cores to execute and
>> report.
>>
>> The patch at
>> http://people.freebsd.org/~kib/misc/unmapped.4.patch
>> implements the 'unmapped buffers'. It means an ability to create the
>> VMIO struct buf, which does not point to the KVA mapping the buffer
>> pages to the kernel addresses. Since there is no mapping, kernel does
>> not need to clear TLB. The unmapped buffers are marked with the new
>> B_NOTMAPPED flag, and should be requested explicitely using the
>> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped
>> buffer is requested but unmapped buffer already exists, the buffer
>> subsystem automatically maps the pages.
>>
>> The clustering code is also made aware of the not-mapped buffers, but
>> this required the KPI change that accounts for the diff in the non-UFS
>> filesystems.
>>
>> UFS is adopted to request not mapped buffers when kernel does not need
>> to access the content, i.e. mostly for the file data. New helper
>> function vn_io_fault_pgmove() operates on the unmapped array of pages.
>> It calls new pmap method pmap_copy_pages() to do the data move to and
>> from usermode.
>>
>> Besides not mapped buffers, not mapped BIOs are introduced, marked
>> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated
>> to unmapped BIOs. Geom providers may indicate an acceptance of the
>> unmapped BIOs. If provider does not handle unmapped i/o requests,
>> geom now automatically establishes transient mapping for the i/o
>> pages.
>>
>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The
>> gpart providers indicate the unmapped BIOs support if the underlying
>> provider can do unmapped i/o. I also hacked ahci(4) to handle
>> unmapped i/o, but this should be changed after the Jeff' physbio patch
>> is committed, to use proper busdma interface.
>>
>> Besides, the swap pager does unmapped swapping if the swap partition
>> indicated that it can do unmapped i/o. By Jeff request, a buffer
>> allocation code may reserve the KVA for unmapped buffer in advance.
>> The unmapped page-in for the vnode pager is also implemented if
>> filesystem supports it, but the page out is not. The page-out, as well
>> as the vnode-backed md(4), currently require mappings, mostly due to
>> the use of VOP_WRITE().
>>
>> As such, the patch worked in my test environment, where I used
>> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no
>> statistically significant difference in the buildworld -j 10 times on
>> the 4-core machine with HT. On the other hand, when doing sha1 over
>> the 5GB file, the system time was reduced by 30%.
>>
>> Unfinished items:
>> - Integration with the physbio, will be done after physbio is
>> committed to HEAD.
>> - The key per-architecture function needed for the unmapped i/o is the
>> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it
>> shall be done for all other architectures.
>> - The sizing of the submap used for transient mapping of the BIOs is
>> naive. Should be adjusted, esp. for KVA-lean architectures.
>> - Conversion of the other filesystems. Low priority.
>>
>> I am interested in reviews, tests and suggestions. Note that this
>> only works now for md(4) and ahci(4), for other drivers the patched
>> kernel should fall back to the mapped i/o.
>>
>>
> Here are a couple things for you to think about:
>
> 1. A while back, I developed the patch at
> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in trying to
> reduce the number of TLB shootdowns by the buffer map. The idea is simple:
> Replace the calls to pmap_q{enter,remove}() with calls to a new
> machine-dependent function that opportunistically sets the buffer's kernel
> virtual address to the direct map for physically contiguous pages.
> However, if the pages are not physically contiguous, it calls pmap_qenter()
> with the kernel virtual address from the buffer map.
>
> This eliminated about half of the TLB shootdowns for a buildworld, because
> there is a decent amount of physical contiguity that occurs by "accident".
> Using a buddy allocator for physical page allocation tends to promote this
> contiguity. However, in a few places, it occurs by explicit action, e.g.,
> mapped files, including large executables, using superpage reservations.
>
> So, how does this fit with what you've done? You might think of using what
> I describe above as a kind of "fast path". As you can see from the patch,
> it's very simple and non-intrusive. If the pages aren't physically
> contiguous, then instead of using pmap_qenter(), you fall back to whatever
> approach for creating ephemeral mappings is appropriate to a given
> architecture.
I think these are complimentary. Kib's patch gives us the fastest
possible path for user data. Alan's patch will improve the metadata
performance for things that really require the buffer cache. I see no
reason not to clean up and commit both.
>
> 2. As for managing the ephemeral mappings on machines that don't support a
> direct map. I would suggest an approach that is loosely inspired by
> copying garbage collection (or the segment cleaners in log-structured file
> systems). Roughly, you manage the buffer map as a few spaces (or
> segments). When you create a new mapping in one of these spaces (or
> segments), you simply install the PTEs. When you decide to "garbage
> collect" a space (or spaces), then you perform a global TLB flush.
> Specifically, you do something like toggling the bit in the cr4 register
> that enables/disables support for the PG_G bit. If the spaces are
> sufficiently large, then the number of such global TLB flushes should be
> quite low. Every space would have an epoch number (or flush number). In
> the buffer, you would record the epoch number alongside the kernel virtual
> address. On access to the buffer, if the epoch number was too old, then
> you have to recreate the buffer's mapping in a new space.
Are the machines that don't have a direct map performance critical? My
expectation is that they are legacy or embedded. This seems like a great
project to do when the rest of the pieces are stable and fast. Until then
they could just use something like pbufs?
Jeff
>
> Alan
> _______________________________________________
> freebsd-arch at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe at freebsd.org"
>
More information about the freebsd-arch
mailing list