Re: FreeBSD hugepages

From: Jake Freeland <jake_at_technologyfriends.net>
Date: Thu, 25 Jul 2024 23:11:00 UTC
On 7/25/24 17:40, Mark Johnston wrote:
> On Thu, Jul 25, 2024 at 06:34:43PM -0400, Mark Johnston wrote:
>> On Thu, Jul 25, 2024 at 04:11:22PM -0500, Jake Freeland wrote:
>>> On 7/25/24 15:18, Mark Johnston wrote:
>>>> On Thu, Jul 25, 2024 at 02:47:16PM -0500, Jake Freeland wrote:
>>>>> On 7/25/24 14:02, Konstantin Belousov wrote:
>>>>>> On Thu, Jul 25, 2024 at 01:46:17PM -0500, Jake Freeland wrote:
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I have been steadily working on bringing Data Plane Development Kit (DPDK)
>>>>>>> on FreeBSD up to date with the Linux version. The most significant hurdle so
>>>>>>> far has been supporting concurrent DPDK processes, each with their own
>>>>>>> contiguous memory regions.
>>>>>>>
>>>>>>> These contiguous regions are used by DPDK as a heap for allocating DMA
>>>>>>> buffers and other miscellaneous resources. Retrieving the underlying memory
>>>>>>> and mapping these regions is currently different on Linux and FreeBSD:
>>>>>>>
>>>>>>> On Linux, hugepages are fetched from the kernel's pre-allocated hugepage
>>>>>>> pool and are mapped into virtual address space on DPDK initialization. Since
>>>>>>> the hugepages exist in a pool, multiple processes can reserve their own
>>>>>>> hugepages and operate concurrently.
>>>>>>>
>>>>>>> On FreeBSD, DPDK uses an in-house contigmem kernel module that reserves a
>>>>>>> large contiguous region of memory on load. During DPDK initialization, the
>>>>>>> entire region is mapped into virtual address space. This leaves no memory
>>>>>>> for another independent DPDK process, so only one process can operate at a
>>>>>>> time.
>>>>>>>
>>>>>>> I could modify the DPDK contigmem module to mimic Linux's hugepages, but I
>>>>>>> thought it would be better to integrate and upstream a hugepage-like
>>>>>>> interface directly in the FreeBSD kernel source. I am writing this email to
>>>>>>> see if anyone has any advice on the matter. I did not see any previous
>>>>>>> attempts at this in Phabriactor or the commit log, but it is possible that I
>>>>>>> missed it. I have read about transparent superpage promotion, but that seems
>>>>>>> like a different mechanism altogether.
>>>>>>>
>>>>>>> At a quick glance, the implementation seems straightforward: read some
>>>>>>> loader tunables, allocate persistent hugepages at boot time, and create a
>>>>>>> pseudo filesystem that supports creating and mapping hugepages. I could be
>>>>>>> underestimating the magnitude of this task, but that is why I'm asking for
>>>>>>> thoughts and advice :)
>>>>>>>
>>>>>>> For reference, here is Linux's documentation on hugepages:
>>>>>>> https://docs.kernel.org/admin-guide/mm/hugetlbpage.html
>>>>>> Are posix shm largepages objects enough (they were developed to support
>>>>>> DPDK).  Look for shm_create_largepage(3).
>>>>> Yes, shm_create_largepage(2) looks promising, but I would like the ability
>>>>> to allocate these largepages at boot time when memory fragmentation as at a
>>>>> minimum. Perhaps a couple sysctl tunables could be added onto the
>>>>> vm.largepages node to specify a pagesize and allocate some number of pages
>>>>> at boot?
>>>> We could add an rc script which creates named largepage objects.  This
>>>> can be done using the posixshmcontrol utility.  That might not be early
>>>> enough during boot for some purposes.  In that case, we could have a
>>>> module which creates such objects from within the kernel.  This is
>>>> pretty straightforward to do; I wrote a dumb version of this for a
>>>> mips-specific project a few years ago, feel free to take code or
>>>> inspiration from it: https://people.freebsd.org/~markj/tlbdemo.c
>>> Looks simple enough. Thanks for the example code.
>>>
>>>>> It seems Linux had an interface similar to shm_create_largepage(2) back in
>>>>> v2.5, but they removed it in favor of their hugetlbfs filesystem. It would
>>>>> be nice to stay close to the file-backed Linux interface to maximize code
>>>>> sharing in userspace. It looks like the foundation for hugepages is there,
>>>>> but the interface for allocation and access needs to be extended.
>>>> POSIX shm objects have most of the properties one would want, I'd
>>>> expect, save the ability to access them via standard syscalls.  What
>>>> else is missing besides the ability to reserve memory at boot time?
>>> Most notably, I would like the ability to allocate pages in a specific NUMA
>>> domain.
>> I thought this was already supported, but it seems not...
> Thinking a bit more, I'm pretty sure I had just been using something
> like
>
> $ cpuset -n prefer:<domain> posixshmcontrol create -l 1G /largepage-1G-<domain>
>
> so didn't need an explicit NUMA configuration parameter.  In C one would
> use cpuset_setdomain(2) instead, but that's not as convenient.  So,
> imbuing a NUMA domain in struct shm_largepage_conf is still probably a
> reasonable thing to do.

I just looked at the code, this seems very manageable. I'll draft up a 
review.

>> It should be very easy to implement: extend shm_largepage_conf to
>> include a NUMA domain parameter, and specify that domain when allocating
>> pages for the object (in shm_largepage_dotruncate(), the
>> vm_page_alloc_contig() call should become a
>> vm_page_alloc_contig_domain() call).
>>
>>> Otherwise, in a perfect world, I'd like a unified interface for both
>>> Linux and FreeBSD. Linux hugepages are managed using standard system calls;
>>> files are mmap(2)'d into virtual address space from hugetlbfs and
>>> ftruncate(2)'d.
>> largepage shm objects work this way as well.

After reading through the man page, this is quite apparent. Not sure how 
I failed make that connection. Anyway, this is starting to look easier 
than I thought it would be. The only difference from a userspace 
perspective that I can think of right now is how the pages are created 
(e.g. hugetlbfs open(2) on Linux vs. shm_create_largepage(2) on FreeBSD).

Thanks for the guidance Mark and Konstantin.

Jake Freeland
>>> A matching interface would not add an extra kernel
>>> entrypoint and even more importantly, it would ease the Linux-to-FreeBSD
>>> porting process for programs that use hugepages.