[rfc] allow to boot with >= 256GB physmem

Mon Feb 7 15:11:05 UTC 2011

On 22 January 2011 00:43, Alan Cox <alan.l.cox at gmail.com> wrote:
> On Fri, Jan 21, 2011 at 2:58 PM, Alan Cox <alan.l.cox at gmail.com> wrote:
>>
>> On Fri, Jan 21, 2011 at 11:44 AM, John Baldwin <jhb at freebsd.org> wrote:
>>>
>>> On Friday, January 21, 2011 11:09:10 am Sergey Kandaurov wrote:
>>> > Hello.
>>> >
>>> > Some time ago I faced with a problem booting with 400GB physmem.
>>> > The problem is that vm.max_proc_mmap type overflows with
>>> > such high value, and that results in a broken mmap() syscall.
>>> > The max_proc_mmap value is a signed int and roughly calculated
>>> > at vmmapentry_rsrc_init() as u_long vm_kmem_size quotient:
>>> > vm_kmem_size / sizeof(struct vm_map_entry) / 100.
>>> >
>>> > Although at the time it was introduced at svn r57263 the value
>>> > was quite low (f.e. the related commit log stands:
>>> > "The value defaults to around 9000 for a 128MB machine."),
>>> > the problem is observed on amd64 where KVA space after
>>> > r212784 is factually bound to the only physical memory size.
>>> >
>>> > With INT_MAX here is 0x7fffffff, and sizeof(struct vm_map_entry)
>>> > is 120, it's enough to have sligthly less than 256GB to be able
>>> > to reproduce the problem.
>>> >
>>> > I rewrote vmmapentry_rsrc_init() to set large enough limit for
>>> > max_proc_mmap just to protect from integer type overflow.
>>> > As it's also possible to live tune this value, I also added a
>>> > simple anti-shoot constraint to its sysctl handler.
>>> > I'm not sure though if it's worth to commit the second part.
>>> >
>>> > As this patch may cause some bikeshedding,
>>> > I'd like to hear your comments before I will commit it.
>>> >
>>> > http://plukky.net/~pluknet/patches/max_proc_mmap.diff
>>>
>>> Is there any reason we can't just make this variable and sysctl a long?
>>>
>>
>> Or just delete it.
>>
>> 1. Contrary to what the commit message says, this sysctl does not
>> effectively limit the number of vm map entries.  It only limits the number
>> that are created by one system call, mmap().  Other system calls create vm
>> map entries just as easily, for example, mprotect(), madvise(), mlock(), and
>> minherit().  Basically, anything that alters the properties of a mapping.
>> Thus, in 2000, after this sysctl was added, the same resource exhaustion
>> induced crash could have been reproduced by trivially changing the program
>> in PR/16573 to do an mprotect() or two.
>>
>> In a nutshell, if you want to really limit the number of vm map entries
>> that a process can allocate, the implementation is a bit more involved than
>> what was done for this sysctl.
>>
>> 2. UMA implements M_WAITOK, whereas the old zone allocator in 2000 did
>> not.  Moreover, vm map entries for user maps are allocated with M_WAITOK.
>> So, the exact crash reported in PR/16573 couldn't happen any longer.
>>
>
> Actually, I take back part of what I said here.  The old zone allocator did
> implement something like M_WAITOK, and that appears to have been used for
> user maps.  However, the crash described in PR/16573 was actually on the
> allocation of a vm map entry within the *kernel* address space for a process
> U area.  This type of allocation did not use the old zone allocator's
> equivalent to M_WAITOK.  However, we no longer have U areas, so the exact
> crash scenario is clearly no longer possible.  Interestingly, the sysctl in
> question has no direct effect on the allocation of kernel vm map entries.
>
> So, I remain skeptical that this sysctl is preventing any resource
> exhaustion based panics in the current kernel.  Again, I would be thrilled
> to see one or more people do some testing, such as rerunning the program
> from PR/16573.
>
>
>> 3. We now have the "vmemoryuse" resource limit.  When this sysctl was
>> defined, we didn't.  Limiting the virtual memory indirectly but effectively
>> limits the number of vm map entries that a process can allocate.
>>
>> In summary, I would do a little due diligence, for example, run the
>> program from PR/16573 with the limit disabled.  If you can't reproduce the
>> crash, in other words, nothing contradicts point #2 above, then I would just
>> delete this sysctl.
>>

I tried the test from PR/16573 running as root. If unmodified it just quickly
bounds on kern.maxproc limit. So, I added signal(SIGCHLD, SIG_IGN); to not
create zombie processes at all to give it more workload. With this change it
also survived. Submitter reported that it crashes with 10000 iterations.
After increasing the limit up to 1000000 I still couldn't get it to crash.

* The testing was done with commented out max_proc_mmap part.
The change effectively reverts r57263.

-- 
wbr,
pluknet
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vm_mmap_maxprocmmap.diff
Type: application/octet-stream
Size: 2328 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20110207/4b7d8af5/vm_mmap_maxprocmmap.obj