Re: FreeBSD hugepages

From: Mark Johnston <markj_at_freebsd.org>
Date: Thu, 25 Jul 2024 22:34:43 UTC
On Thu, Jul 25, 2024 at 04:11:22PM -0500, Jake Freeland wrote:
> On 7/25/24 15:18, Mark Johnston wrote:
> > On Thu, Jul 25, 2024 at 02:47:16PM -0500, Jake Freeland wrote:
> > > On 7/25/24 14:02, Konstantin Belousov wrote:
> > > > On Thu, Jul 25, 2024 at 01:46:17PM -0500, Jake Freeland wrote:
> > > > > Hi there,
> > > > > 
> > > > > I have been steadily working on bringing Data Plane Development Kit (DPDK)
> > > > > on FreeBSD up to date with the Linux version. The most significant hurdle so
> > > > > far has been supporting concurrent DPDK processes, each with their own
> > > > > contiguous memory regions.
> > > > > 
> > > > > These contiguous regions are used by DPDK as a heap for allocating DMA
> > > > > buffers and other miscellaneous resources. Retrieving the underlying memory
> > > > > and mapping these regions is currently different on Linux and FreeBSD:
> > > > > 
> > > > > On Linux, hugepages are fetched from the kernel's pre-allocated hugepage
> > > > > pool and are mapped into virtual address space on DPDK initialization. Since
> > > > > the hugepages exist in a pool, multiple processes can reserve their own
> > > > > hugepages and operate concurrently.
> > > > > 
> > > > > On FreeBSD, DPDK uses an in-house contigmem kernel module that reserves a
> > > > > large contiguous region of memory on load. During DPDK initialization, the
> > > > > entire region is mapped into virtual address space. This leaves no memory
> > > > > for another independent DPDK process, so only one process can operate at a
> > > > > time.
> > > > > 
> > > > > I could modify the DPDK contigmem module to mimic Linux's hugepages, but I
> > > > > thought it would be better to integrate and upstream a hugepage-like
> > > > > interface directly in the FreeBSD kernel source. I am writing this email to
> > > > > see if anyone has any advice on the matter. I did not see any previous
> > > > > attempts at this in Phabriactor or the commit log, but it is possible that I
> > > > > missed it. I have read about transparent superpage promotion, but that seems
> > > > > like a different mechanism altogether.
> > > > > 
> > > > > At a quick glance, the implementation seems straightforward: read some
> > > > > loader tunables, allocate persistent hugepages at boot time, and create a
> > > > > pseudo filesystem that supports creating and mapping hugepages. I could be
> > > > > underestimating the magnitude of this task, but that is why I'm asking for
> > > > > thoughts and advice :)
> > > > > 
> > > > > For reference, here is Linux's documentation on hugepages:
> > > > > https://docs.kernel.org/admin-guide/mm/hugetlbpage.html
> > > > Are posix shm largepages objects enough (they were developed to support
> > > > DPDK).  Look for shm_create_largepage(3).
> > > Yes, shm_create_largepage(2) looks promising, but I would like the ability
> > > to allocate these largepages at boot time when memory fragmentation as at a
> > > minimum. Perhaps a couple sysctl tunables could be added onto the
> > > vm.largepages node to specify a pagesize and allocate some number of pages
> > > at boot?
> > We could add an rc script which creates named largepage objects.  This
> > can be done using the posixshmcontrol utility.  That might not be early
> > enough during boot for some purposes.  In that case, we could have a
> > module which creates such objects from within the kernel.  This is
> > pretty straightforward to do; I wrote a dumb version of this for a
> > mips-specific project a few years ago, feel free to take code or
> > inspiration from it: https://people.freebsd.org/~markj/tlbdemo.c
> 
> Looks simple enough. Thanks for the example code.
> 
> > > It seems Linux had an interface similar to shm_create_largepage(2) back in
> > > v2.5, but they removed it in favor of their hugetlbfs filesystem. It would
> > > be nice to stay close to the file-backed Linux interface to maximize code
> > > sharing in userspace. It looks like the foundation for hugepages is there,
> > > but the interface for allocation and access needs to be extended.
> > POSIX shm objects have most of the properties one would want, I'd
> > expect, save the ability to access them via standard syscalls.  What
> > else is missing besides the ability to reserve memory at boot time?
> 
> Most notably, I would like the ability to allocate pages in a specific NUMA
> domain.

I thought this was already supported, but it seems not...

It should be very easy to implement: extend shm_largepage_conf to
include a NUMA domain parameter, and specify that domain when allocating
pages for the object (in shm_largepage_dotruncate(), the
vm_page_alloc_contig() call should become a
vm_page_alloc_contig_domain() call).

> Otherwise, in a perfect world, I'd like a unified interface for both
> Linux and FreeBSD. Linux hugepages are managed using standard system calls;
> files are mmap(2)'d into virtual address space from hugetlbfs and
> ftruncate(2)'d.

largepage shm objects work this way as well.

> A matching interface would not add an extra kernel
> entrypoint and even more importantly, it would ease the Linux-to-FreeBSD
> porting process for programs that use hugepages.