amd64: change VM_KMEM_SIZE_SCALE to 1?

Tue Jul 27 16:45:36 UTC 2010

On Mon, 26 Jul 2010, Alan Cox wrote:

> On Mon, Jul 26, 2010 at 1:19 PM, Andriy Gapon <avg at freebsd.org> wrote:
>
>> on 26/07/2010 20:04 Matthew Fleming said the following:
>>> On Mon, Jul 26, 2010 at 9:07 AM, Andriy Gapon <avg at freebsd.org> wrote:
>>>> Anyone knows any reason why VM_KMEM_SIZE_SCALE on amd64 should not be
>> set to 1?
>>>> I mean things potentially breaking, or some unpleasant surprise for an
>>>> administrator/user...

Shouldn't it be a fraction (of about 1/(2**32)) so that you can map things
sparsely into about 2**64 bytes of KVA?

Actually mapping 2**64 bytes of KVA would take too many resources, but is
does it take too many resources to reserve that amount and to be prepared
to actually map lots more than now?

>>> As I understand it, it's merely a resource usage issue.  amd64 needs
>>> page table entries for the expected virtual address space, so allowing
>>> more than e.g. 1/3 of physical memory means needing more PTEs.  But
>>> the memory overhead isn't all that large IIRC: each 4k physical memory
>>> devoted to PTEs maps 512 4k virtual addresses, or 2MB, so e.g. it
>>> takes about 4MB reserved as PTE pages to map 2GB of kernel virtual
>>> address space.

That's not small, but isn't it 1024 times less due to 4MB pages in the
kernel?  But I guess 4MB pages are no good for sparse mappings.

>> ...
>> Well, personally I would prefer kernel eating a lot of memory over getting
>> "kmem_map too small" panic.  Unexpectedly large memory usage by kernel can
>> be
>> detected and diagnosed, and then proper limits and (auto-)tuning could be
>> put in
>> place.  Panic at some random allocation is not that helpful.
>> Besides, presently there are more and more workloads that require a lot of
>> kernel memory - e.g. ZFS is gaining popularity.
>>
> Like what exactly?  Since I increased the size of the kernel address space
> for amd64 to 512GB, and thus the size of the kernel heap was no longer
> limited by virtual address space size, but only by the auto-tuning based
> upon physical memory size, I am not aware of any "kmem_map to small" panics
> that are not ZFS/ARC related.
>
>> Hence, the question/suggestion.
>>
>> Of course, the things can be tuned by hand, but I think that
>> VM_KMEM_SIZE_SCALE=1 would be a more reasonable default than current value.
>>
> Even this would not eliminate the ZFS/ARC panics.  I have heard that some
> people must configure the kmem_map to 1.5 times a machine's physical memory
> size to avoid panics.

2**32 times larger whould avoid this even better (up to 4GB physical memory)
:-).

With 512GB virtual and 4GB physical, 128 times larger (VM_KMEM_SIZE_SCALE=(1/
128.0) is almost possible and 32 larger seems practical (leave 3/4 for
other things).  However, it seems wrong to scale by physical memory
at all.  If you are prepared to map 512GB, why not allow a significant
fraction of that (say 1/4) to be used for kmem?  The only problem that
I see is that there will be more rounds of physical memory and disk
sizes increasing faster than virtual memory limits; on every round
algorithms based on sparse mappings break.

> The reason is that unlike the traditional FreeBSD way
> of caching file data, the ZFS/ARC wants to have every page of cached data
> *mapped* (and wired) in the kernel address space.

Traditional BSD (Net/2 at least, and perhaps even FreeBSD-1), mapped
and wired every page of cached data (all ~2MB of it) sparsely into
buffer map part of the kernel address space (all ~16MB or 32MB of it
in 386BSD or FreeBSD-early, but 256MB in FreeBSD-1.1.5).  I like the
simplicity of this.  It would have worked perfectly in FreeBSD-1.1.5
since physical memory and disk sizes were still much smaller than i386
address space.  It would work adequately even now (since nbuf now only
needs to be large enough to limit thrashing of VMIO mappings).

> Over time, the available,
> unused space in the kmem_map becomes fragmented, and even though the ARC
> thinks that it has not reached its size limit, kmem_malloc() cannot find
> contiguous space to satisfy the allocation request.  To see this described
> in great detail, do a web search for an e-mail by Ben Kelly with the subject
> "[patch] zfs kmem fragmentation".

This is exactly what happened several with the buffer map(s) in
FreeBSD-[1-2][3-4?], except with memory sizes scaled by 3, then 2,
then 1 decimal orders of magnitude.  In FreeBSD-1, plain malloc() was
used for buffers, and kmem_map was far too small (16MB) for this to
work well.  In FreeBSD-[2-current], a much more complicated method is
used to allocate buffers (and to map VMIO pages into buffers).  This
is essentially a private version of malloc() with lots of specialization
for buffers and a separate map so that it doesn't have to fight with
other users of malloc().  Despite its specialization, this still had
problems with fragmentation.  It wasn't until FreeBSD-4 that the
specialization became complicated enough to mostly avoid these problems.

Bruce