superpages for UMA

Konstantin Belousov kostikbel at gmail.com
Mon Aug 18 18:39:31 UTC 2014


On Mon, Aug 18, 2014 at 07:03:05PM +0400, Alexander V. Chernikov wrote:
> Hello list.
> 
> Currently UMA(9) uses PAGE_SIZE kegs to store items in.
> It seems fine for most usage scenarios,  however there are some where 
> very large number of items is required.
> 
> I've run into this problem while using ipfw tables (radix based) with 
> ~50k records. This is how
> `pmcstat -TS DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK -w1` looks like:
> PMC: [DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK] Samples: 2359 (100.0%) , 0 
> unresolved
> 
> %SAMP IMAGE      FUNCTION             CALLERS
>   28.7 kernel     rn_match             ipfw_lookup_table:21.7 
> rtalloc_fib_nolock:7.0
>   25.5 ipfw.ko    ipfw_chk             ipfw_check_hook
>    6.0 kernel     rn_lookup            ipfw_lookup_table
> 
> Some numbers: table entry occupies 128 bytes, so we may store no more 
> than 30 records in single page-sized keg.
> 50k records require more than 1500 kegs.
> As far as I understand second-level TLB for modern Intel CPU may be 256 
> or 512 entries( for 4K pages ), so using large number of entries
> results in TLB cache misses constantly happening.
> 
> Other examples:
> Route tables (in current implementation): struct rte occupies more than 
> 128 bytes and storing full-view (> 500k routes) would result in TLB 
> misses happening all of the time.
> Various stateful packet processing: modern SLB/firewall can have 
> millions of states. Regardless of state size PAGE_SIZE'd kegs is not the 
> best choice.
> 
> All of these can be addressed:
> Ipwa tables/ipfw dynamic state allocation code can (and will) be 
> rewritten to use uma+uma_zone_set_allocf (suggested by glebius),
> radix should simply be changed to a different lookup algo (as it is 
> happening in ipfw tables).
> 
> However, we may consider on adding another UMA flag to allocate 
> 2M/1G-sized kegs per request.
> (Additionally, Intel Haswell arch has 512 entries in STLB shared? 
> between 4k/2M so it should help the former).
> 
> What do you think?
> 
Zones with small object sizes use uma_small_alloc() to request physical
page and its KVA mapping. On amd64, uma_small_alloc() allocates a
physical page and returns direct mapping address for the page. The
direct map is done by large pages (2MB, 1GB if avaliable). In this
sense, your allocations already use large pages for virtual memory
translations.

Zones are not local in the KVA, i.e. objects from the same zone are
usually far apart in the KVA.  Zones do not get dedicated submaps to
contain the zone-owned pages.

Note that large pages TLB is usually relatively small.  E.g. on my
Nehalem machine, it only has 32 entries which can hold 2MB pages,
which results in the 64MB of cached address space translations in
the best case.  You might try to reduce the available memory to
see the increased locality and better DTLB hit ratio, if your load
can survive with lesser memory size.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20140818/c0e90e7c/attachment.sig>


More information about the freebsd-arch mailing list