About Transparent Superpages and Non-transparent superapges
Cedric Blancher
cedric.blancher at gmail.com
Sat Sep 21 01:16:07 UTC 2013
[repost, the previous email was stuck because I used an old email address]
On 21 September 2013 03:09, Cedric Blancher <cedric.blancher at gmail.com> wrote:
> On 20 September 2013 17:20, Sebastian Kuzminsky <S.Kuzminsky at f5.com> wrote:
>> On Sep 19, 2013, at 22:06 , Patrick Dung wrote:
>>
>>> >We at Line Rate (now F5) are developing support for 1 Gig superpages on amd64. We're basing our work on 9.1.0 for now.
>>> >
>>> >An early preview is available here:
>>> >
>>> >https://github.com/Seb-LineRate/freebsd/tree/freebsd-9.1.0-1gig-pages-NOT-READY-2
>>>
>>> That is cool.
>>>
>>> What type of applications can take advantage of the 1Gb page size?
>>> And is it transparent? Or applications need to be modified?
>>
>> It's transparent for the kernel: all of UMA and kmem_malloc()/kmem_free() is backed by 1 gig superpages.
>>
>> It's not transparent for userspace: applications need to pass a new flag to mmap() to get 1 gig pages.
>
> That may be the wrong approach. What happens if x86 gets more
> huge/largepage sizes like SPARC does (hint: Sign NDA with Intel and
> AMD and get surprised, and then allocate 16 more bits for mmap() if
> you wish to stick with your approach)? For example SPARC64 does 8k,
> 64k, 512k, 4M, 32M, 256M, 2GB and 256GB pages (actual page sizes
> differ from MMU to MMU implementation, and can be probed via pagesize
> -a).
>
> A much better option would be to follow the Solaris API which has APIs
> to enumerate the available page sizes, and then set it either for
> heap, stack or a given address range (the last one is used to use
> largepages for file I/O via mmap()).
>
> For example ksh93 uses this to use 64k pages for the stack (this
> mainly aims at SPARC where 64k stack pages can be a real performance
> booster if you shuffle a lot of strings via stack):
> -----------
> int main(int argc, char *argv[])
> {
> #if _lib_memcntl
> /* advise larger stack size */
> struct memcntl_mha mha;
> mha.mha_cmd = MHA_MAPSIZE_STACK;
> mha.mha_flags = 0;
> mha.mha_pagesize = 64 * 1024;
> (void)memcntl(NULL, 0, MC_HAT_ADVISE, (caddr_t)&mha, 0, 0);
> #endif
> return(sh_main(argc, argv, (Shinit_f)0));
> }
> -----------
>
> Below is the memcntl(2) manpage describing the API:
> ---------------------------------------
>
>
>
> System Calls memcntl(2)
>
>
>
> NAME
> memcntl - memory management control
>
> SYNOPSIS
> #include <sys/types.h>
> #include <sys/mman.h>
>
> int memcntl(caddr_t _ a_ d_ d_ r, size_t _ l_ e_ n, int
> _ c_ m_ d, caddr_t _ a_ r_ g,
> int _ a_ t_ t_ r, int _ m_ a_ s_ k);
>
>
> DESCRIPTION
> The memcntl() function allows the calling process to apply a
> variety of control operations over the address space identi-
> fied by the mappings established for the address range
> [_ a_ d_ d_ r, _ a_ d_ d_ r + _ l_ e_ n).
>
>
> The _ a_ d_ d_ r argument must be a multiple of the pagesize as
> returned by sysconf(3C). The scope of the control operations
> can be further defined with additional selection criteria
> (in the form of attributes) according to the bit pattern
> contained in _ a_ t_ t_ r.
>
>
> The following attributes specify page mapping selection cri-
> teria:
>
> SHARED Page is mapped shared.
>
>
> PRIVATE Page is mapped private.
>
>
>
> The following attributes specify page protection selection
> criteria. The selection criteria are constructed by a bit-
> wise OR operation on the attribute bits and must match
> exactly.
>
> PROT_READ Page can be read.
>
>
> PROT_WRITE Page can be written.
>
>
> PROT_EXEC Page can be executed.
>
>
>
> The following criteria may also be specified:
>
>
>
>
> SunOS 5.11 Last change: 10 Apr 2007 1
>
>
>
>
>
>
> System Calls memcntl(2)
>
>
>
> PROC_TEXT Process text.
>
>
> PROC_DATA Process data.
>
>
>
> The PROC_TEXT attribute specifies all privately mapped seg-
> ments with read and execute permission, and the PROC_DATA
> attribute specifies all privately mapped segments with write
> permission.
>
>
> Selection criteria can be used to describe various abstract
> memory objects within the address space on which to operate.
> If an operation shall not be constrained by the selection
> criteria, _ a_ t_ t_ r must have the value 0.
>
>
> The operation to be performed is identified by the argument
> _ c_ m_ d. The symbolic names for the operations are defined in
> <sys/mman.h> as follows:
>
> MC_LOCK
>
> Lock in memory all pages in the range with attributes
> _ a_ t_ t_ r. A given page may be locked multiple times through
> different mappings; however, within a given mapping,
> page locks do not nest. Multiple lock operations on the
> same address in the same process will all be removed
> with a single unlock operation. A page locked in one
> process and mapped in another (or visible through a dif-
> ferent mapping in the locking process) is locked in
> memory as long as the locking process does neither an
> implicit nor explicit unlock operation. If a locked map-
> ping is removed, or a page is deleted through file remo-
> val or truncation, an unlock operation is implicitly
> performed. If a writable MAP_PRIVATE page in the address
> range is changed, the lock will be transferred to the
> private page.
>
> The _ a_ r_ g argument is not used, but must be 0 to ensure
> compatibility with potential future enhancements.
>
>
> MC_LOCKAS
>
> Lock in memory all pages mapped by the address space
> with attributes _ a_ t_ t_ r. The _ a_ d_ d_ r and _ l_ e_ n
> arguments are not
> used, but must be _ N_ U_ L_ L and 0 respectively, to ensure
> compatibility with potential future enhancements. The
> _ a_ r_ g argument is a bit pattern built from the flags:
>
>
>
> SunOS 5.11 Last change: 10 Apr 2007 2
>
>
>
>
>
>
> System Calls memcntl(2)
>
>
>
> MCL_CURRENT Lock current mappings.
>
>
> MCL_FUTURE Lock future mappings.
>
> The value of _ a_ r_ g determines whether the pages to be
> locked are those currently mapped by the address space,
> those that will be mapped in the future, or both. If
> MCL_FUTURE is specified, then all mappings subsequently
> added to the address space will be locked, provided suf-
> ficient memory is available.
>
>
> MC_SYNC
>
> Write to their backing storage locations all modified
> pages in the range with attributes _ a_ t_ t_ r. Optionally,
> invalidate cache copies. The backing storage for a modi-
> fied MAP_SHARED mapping is the file the page is mapped
> to; the backing storage for a modified MAP_PRIVATE map-
> ping is its swap area. The _ a_ r_ g argument is a bit pattern
> built from the flags used to control the behavior of the
> operation:
>
> MS_ASYNC Perform asynchronous writes.
>
>
> MS_SYNC Perform synchronous writes.
>
>
> MS_INVALIDATE Invalidate mappings.
>
> MS_ASYNC Return immediately once all write operations
> are scheduled; with MS_SYNC the function will not return
> until all write operations are completed.
>
> MS_INVALIDATE Invalidate all cached copies of data in
> memory, so that further references to the pages will be
> obtained by the system from their backing storage loca-
> tions. This operation should be used by applications
> that require a memory object to be in a known state.
>
>
> MC_UNLOCK
>
> Unlock all pages in the range with attributes _ a_ t_ t_ r. The
> _ a_ r_ g argument is not used, but must be 0 to ensure compa-
> tibility with potential future enhancements.
>
>
> MC_UNLOCKAS
>
>
>
>
> SunOS 5.11 Last change: 10 Apr 2007 3
>
>
>
>
>
>
> System Calls memcntl(2)
>
>
>
> Remove address space memory locks and locks on all pages
> in the address space with attributes _ a_ t_ t_ r. The
> _ a_ d_ d_ r,
> _ l_ e_ n, and _ a_ r_ g arguments are not used, but must be
> _ N_ U_ L_ L, 0
> and 0, respectively, to ensure compatibility with poten-
> tial future enhancements.
>
>
> MC_HAT_ADVISE
>
> Advise system how a region of user-mapped memory will be
> accessed. The _ a_ r_ g argument is interpreted as a "struct
> memcntl_mha *". The following members are defined in a
> struct memcntl_mha:
>
> uint_t mha_cmd;
> uint_t mha_flags;
> size_t mha_pagesize;
>
> The accepted values for mha_cmd are:
>
> MHA_MAPSIZE_VA
> MHA_MAPSIZE_STACK
> MHA_MAPSIZE_BSSBRK
>
> The mha_flags member is reserved for future use and must
> always be set to 0. The mha_pagesize member must be a
> valid size as obtained from getpagesizes(3C) or the con-
> stant value 0 to allow the system to choose an appropri-
> ate hardware address translation mapping size.
>
> MHA_MAPSIZE_VA sets the preferred hardware address
> translation mapping size of the region of memory from
> _ a_ d_ d_ r to _ a_ d_ d_ r + _ l_ e_ n. Both _ a_ d_ d_ r
> and _ l_ e_ n must be aligned to
> an mha_pagesize boundary. The entire virtual address
> region from _ a_ d_ d_ r to _ a_ d_ d_ r + _ l_ e_ n must not
> have any holes.
> Permissions within each mha_pagesize-aligned portion of
> the region must be consistent. When a size of 0 is
> specified, the system selects an appropriate size based
> on the size and alignment of the memory region, type of
> processor, and other considerations.
>
> MHA_MAPSIZE_STACK sets the preferred hardware address
> translation mapping size of the process main thread
> stack segment. The _ a_ d_ d_ r and _ l_ e_ n arguments must
> be _ N_ U_ L_ L
> and 0, respectively.
>
> MHA_MAPSIZE_BSSBRK sets the preferred hardware address
> translation mapping size of the process heap. The _ a_ d_ d_ r
> and _ l_ e_ n arguments must be _ N_ U_ L_ L and 0, respectively. See
> the NOTES section of the ppgsz(1) manual page for addi-
> tional information on process heap alignment.
>
>
>
>
> SunOS 5.11 Last change: 10 Apr 2007 4
>
>
>
>
>
>
> System Calls memcntl(2)
>
>
>
> The _ a_ t_ t_ r argument must be 0 for all MC_HAT_ADVISE opera-
> tions.
>
>
>
> The _ m_ a_ s_ k argument must be 0; it is reserved for future use.
>
>
> Locks established with the lock operations are not inherited
> by a child process after fork(2). The memcntl() function
> fails if it attempts to lock more memory than a system-
> specific limit.
>
>
> Due to the potential impact on system resources, the opera-
> tions MC_LOCKAS, MC_LOCK, MC_UNLOCKAS, and MC_UNLOCK are
> restricted to privileged processes.
>
> USAGE
> The memcntl() function subsumes the operations of plock(3C).
>
>
> MC_HAT_ADVISE is intended to improve performance of applica-
> tions that use large amounts of memory on processors that
> support multiple hardware address translation mapping sizes;
> however, it should be used with care. Not all processors
> support all sizes with equal efficiency. Use of larger sizes
> may also introduce extra overhead that could reduce perfor-
> mance or available memory. Using large sizes for one appli-
> cation may reduce available resources for other applications
> and result in slower system wide performance.
>
> RETURN VALUES
> Upon successful completion, memcntl() returns 0; otherwise,
> it returns -1 and sets errno to indicate an error.
>
> ERRORS
> The memcntl() function will fail if:
>
> EAGAIN When the selection criteria match, some or all of
> the memory identified by the operation could not
> be locked when MC_LOCK or MC_LOCKAS was specified,
> some or all mappings in the address range [_ a_ d_ d_ r,
> _ a_ d_ d_ r + _ l_ e_ n) are locked for I/O when MC_HAT_ADVISE
> was specified, or the system has insufficient
> resources when MC_HAT_ADVISE was specified.
>
> The _ c_ m_ d is MC_LOCK or MC_LOCKAS and locking the
> memory identified by this operation would exceed a
> limit or resource control on locked memory.
>
>
>
>
>
> SunOS 5.11 Last change: 10 Apr 2007 5
>
>
>
>
>
>
> System Calls memcntl(2)
>
>
>
> EBUSY When the selection criteria match, some or all of
> the addresses in the range [_ a_ d_ d_ r, _ a_ d_ d_ r
> + _ l_ e_ n) are
> locked and MC_SYNC with the MS_INVALIDATE option
> was specified.
>
>
> EINVAL The _ a_ d_ d_ r argument specifies invalid selection cri-
> teria or is not a multiple of the page size as
> returned by sysconf(3C); the _ a_ d_ d_ r and/or _ l_ e_ n
> argument does not have the value 0 when MC_LOCKAS
> or MC_UNLOCKAS is specified; the _ a_ r_ g argument is
> not valid for the function specified; mha_pagesize
> or mha_cmd is invalid; or MC_HAT_ADVISE is speci-
> fied and not all pages in the specified region
> have the same access permissions within the given
> size boundaries.
>
>
> ENOMEM When the selection criteria match, some or all of
> the addresses in the range [_ a_ d_ d_ r, _ a_ d_ d_ r
> + _ l_ e_ n) are
> invalid for the address space of a process or
> specify one or more pages which are not mapped.
>
>
> EPERM The {PRIV_PROC_LOCK_MEMORY} privilege is not
> asserted in the effective set of the calling pro-
> cess and MC_LOCK, MC_LOCKAS, MC_UNLOCK, or
> MC_UNLOCKAS was specified.
>
>
> ATTRIBUTES
> See attributes(5) for descriptions of the following attri-
> butes:
>
>
>
> ____________________________________________________________
> | ATTRIBUTE TYPE | ATTRIBUTE VALUE |
> |______________________________ |______________________________ |
> | MT-Level | MT-Safe |
> |______________________________ |______________________________ |
>
>
> SEE ALSO
> ppgsz(1), fork(2), mmap(2), mprotect(2), getpagesizes(3C),
> mlock(3C), mlockall(3C), msync(3C), plock(3C), sysconf(3C),
> attributes(5), privileges(5)
>
>
>
>
>
>
>
>
> SunOS 5.11 Last change: 10 Apr 2007 6
> ---------------------------------------
>
> Ced
> --
> Cedric Blancher <cedric.blancher at gmail.com>
> Institute Pasteur
--
Cedric Blancher <cedric.blancher at gmail.com>
Institute Pasteur
More information about the freebsd-hackers
mailing list