Memory reserves or lack thereof
Alan Cox
alc at rice.edu
Tue Nov 13 20:04:05 UTC 2012
On 11/12/2012 11:35, Alan Cox wrote:
> On 11/12/2012 07:36, Konstantin Belousov wrote:
>> On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote:
>>> On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov <kostikbel at gmail.com>wrote:
>>>
>>>> On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, Steven wrote:
>>>>> I have a memory subsystem design question that I'm hoping someone can
>>>> answer.
>>>>> I've been looking at a machine that is completely out of memory, as in
>>>>>
>>>>> v_free_count = 0,
>>>>> v_cache_count = 0,
>>>>>
>>>>> I wondered how a machine could completely run out of memory like this,
>>>> especially after finding a lack of interrupt storms or other pathologies
>>>> that would tend to overcommit memory. So I started investigating.
>>>>> Most allocators come down to vm_page_alloc(), which has this guard:
>>>>>
>>>>> if ((curproc == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) {
>>>>> page_req = VM_ALLOC_SYSTEM;
>>>>> };
>>>>>
>>>>> if (cnt.v_free_count + cnt.v_cache_count > cnt.v_free_reserved ||
>>>>> (page_req == VM_ALLOC_SYSTEM &&
>>>>> cnt.v_free_count + cnt.v_cache_count >
>>>> cnt.v_interrupt_free_min) ||
>>>>> (page_req == VM_ALLOC_INTERRUPT &&
>>>>> cnt.v_free_count + cnt.v_cache_count > 0)) {
>>>>>
>>>>> The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate
>>>> every last page.
>>>>> >From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare,
>>>> perhaps only used from interrupt threads. Not so, see kmem_malloc() or
>>>> uma_small_alloc() which both contain this mapping:
>>>>> if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>>>>> pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>>>>> else
>>>>> pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
>>>>>
>>>>> Note that M_USE_RESERVE has been deprecated and is used in just a
>>>> handful of places. Also note that lots of code paths come through these
>>>> routines.
>>>>> What this means is essentially _any_ allocation using M_NOWAIT will
>>>> bypass whatever reserves have been held back and will take every last page
>>>> available.
>>>>> There is no documentation stating M_NOWAIT has this side effect of
>>>> essentially being privileged, so any innocuous piece of code that can't
>>>> block will use it. And of course M_NOWAIT is literally used all over.
>>>>> It looks to me like the design goal of the BSD allocators is on
>>>> recovery; it will give all pages away knowing it can recover.
>>>>> Am I missing anything? I would have expected some small number of pages
>>>> to be held in reserve just in case. And I didn't expect M_NOWAIT to be a
>>>> sort of back door for grabbing memory.
>>>> Your analysis is right, there is nothing to add or correct.
>>>> This is the reason to strongly prefer M_WAITOK.
>>>>
>>> Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was
>>> well understand that it should only be used by interrupt handlers.
>>>
>>> The trouble is that M_NOWAIT conflates two orthogonal things. The obvious
>>> being that the allocation shouldn't sleep. The other being how far we're
>>> willing to deplete the cache/free page queues.
>>>
>>> When fine-grained locking got sprinkled throughout the kernel, we all to
>>> often found ourselves wanting to do allocations without the possibility of
>>> blocking. So, M_NOWAIT became commonplace, where it wasn't before.
>>>
>>> This had the unintended consequence of introducing a lot of memory
>>> allocations in the top-half of the kernel, i.e., non-interrupt handling
>>> code, that were digging deep into the cache/free page queues.
>>>
>>> Also, ironically, in today's kernel an "M_NOWAIT | M_USE_RESERVE"
>>> allocation is less likely to succeed than an "M_NOWAIT" allocation.
>>> However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it
>>> could only allocate a free page. M_USE_RESERVE said that it ok to allocate
>>> a cached page even though M_NOWAIT was specified. Consequently, the system
>>> wouldn't dig as far into the free page queue if M_USE_RESERVE was
>>> specified, because it was allowed to reclaim a cached page.
>>>
>>> In conclusion, I think it's time that we change M_NOWAIT so that it doesn't
>>> dig any deeper into the cache/free page queues than M_WAITOK does and
>>> reintroduce a M_USE_RESERVE-like flag that says dig deep into the
>>> cache/free page queues. The trouble is that we then need to identify all
>>> of those places that are implicitly depending on the current behavior of
>>> M_NOWAIT also digging deep into the cache/free page queues so that we can
>>> add an explicit M_USE_RESERVE.
>>>
>>> Alan
>>>
>>> P.S. I suspect that we should also increase the size of the "page reserve"
>>> that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How
>>> many legitimate users of a new M_USE_RESERVE-like flag in today's kernel
>>> could actually be satisfied by two pages?
>> I am almost sure that most of people who put the M_NOWAIT flag, do not
>> know the 'allow the deeper drain of free queue' effect. As such, I believe
>> we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations
>> of the problematic places would be in the swapout path.
>>
>> I found a single explicit use of M_USE_RESERVE in the kernel,
>> so the flip is relatively simple.
> Agreed. Most recently I eliminated several uses from the arm pmap
> implementations. There is, however, one other use:
>
> ofed/include/linux/gfp.h:#define GFP_ATOMIC (M_NOWAIT |
> M_USE_RESERVE)
>
>> Below is the patch which I only compile-tested on amd64, and which booted
>> fine.
>>
>> Peter, could you, please, give it a run, to see obvious deadlocks, if any ?
>>
>> diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c
>> index dc9c307..ab1e869 100644
>> --- a/sys/amd64/amd64/uma_machdep.c
>> +++ b/sys/amd64/amd64/uma_machdep.c
>> @@ -29,6 +29,7 @@ __FBSDID("$FreeBSD$");
>>
>> #include <sys/param.h>
>> #include <sys/lock.h>
>> +#include <sys/malloc.h>
>> #include <sys/mutex.h>
>> #include <sys/systm.h>
>> #include <vm/vm.h>
>> @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>> int pflags;
>>
>> *flags = UMA_SLAB_PRIV;
>> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>> - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
>> - else
>> - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
>> - if (wait & M_ZERO)
>> - pflags |= VM_ALLOC_ZERO;
>> + pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED);
>> for (;;) {
>> m = vm_page_alloc(NULL, 0, pflags);
>> if (m == NULL) {
>> diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c
>> index f60cdb1..75366e3 100644
>> --- a/sys/arm/arm/vm_machdep.c
>> +++ b/sys/arm/arm/vm_machdep.c
>> @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>> ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT));
>> return (ret);
>> }
>> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>> - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>> - else
>> - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
>> - if (wait & M_ZERO)
>> - pflags |= VM_ALLOC_ZERO;
>> + pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>> for (;;) {
>> m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
>> if (m == NULL) {
>> diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c
>> index 71caa29..2ce1ca6 100644
>> --- a/sys/fs/devfs/devfs_devs.c
>> +++ b/sys/fs/devfs/devfs_devs.c
>> @@ -121,7 +121,7 @@ devfs_alloc(int flags)
>> struct cdev *cdev;
>> struct timespec ts;
>>
>> - cdp = malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO |
>> + cdp = malloc(sizeof *cdp, M_CDEVP, M_ZERO |
>> ((flags & MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK));
>> if (cdp == NULL)
>> return (NULL);
>> diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c
>> index 37353ff..9f77762 100644
>> --- a/sys/ia64/ia64/uma_machdep.c
>> +++ b/sys/ia64/ia64/uma_machdep.c
>> @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>> int pflags;
>>
>> *flags = UMA_SLAB_PRIV;
>> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>> - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>> - else
>> - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
>> - if (wait & M_ZERO)
>> - pflags |= VM_ALLOC_ZERO;
>> + pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>>
>> for (;;) {
>> m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
>> diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c
>> index 798e632..24baef0 100644
>> --- a/sys/mips/mips/uma_machdep.c
>> +++ b/sys/mips/mips/uma_machdep.c
>> @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>> void *va;
>>
>> *flags = UMA_SLAB_PRIV;
>> -
>> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>> - pflags = VM_ALLOC_INTERRUPT;
>> - else
>> - pflags = VM_ALLOC_SYSTEM;
>> + pflags = m2vm_flags(wait, 0);
>>
>> for (;;) {
>> m = pmap_alloc_direct_page(0, pflags);
>
> This smells fishy, but not because of anything that you did. It appears
> that the mips uma_small_alloc() is unconditionally asking for a
> pre-zeroed page. I'll take a look at this later today.
>
I verified this. The current implementation of uma_small_alloc() on
MIPS unconditionally zeroes the page. Moreover, if M_ZERO is specified
to uma_small_alloc(), the same page is zeroed twice, once in
pmap_alloc_direct_page() and again in uma_small_alloc().
I expect to commit the following patch tomorrow. Kostik, it will
trivially conflict with your current patch.
Index: mips/include/pmap.h
===================================================================
--- mips/include/pmap.h (revision 242939)
+++ mips/include/pmap.h (working copy)
@@ -179,7 +179,6 @@ void pmap_kenter_temporary_free(vm_paddr_t pa);
void pmap_flush_pvcache(vm_page_t m);
int pmap_emulate_modified(pmap_t pmap, vm_offset_t va);
void pmap_grow_direct_page_cache(void);
-vm_page_t pmap_alloc_direct_page(unsigned int index, int req);
#endif /* _KERNEL */
Index: mips/mips/pmap.c
===================================================================
--- mips/mips/pmap.c (revision 242939)
+++ mips/mips/pmap.c (working copy)
@@ -163,6 +163,7 @@ static vm_page_t pmap_pv_reclaim(pmap_t locked_pma
static void pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t
va);
static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap,
vm_offset_t va);
+static vm_page_t pmap_alloc_direct_page(unsigned int index, int req);
static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va,
vm_page_t m, vm_prot_t prot, vm_page_t mpte);
static int pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq,
vm_offset_t va,
@@ -1041,7 +1042,7 @@ pmap_grow_direct_page_cache()
#endif
}
-vm_page_t
+static vm_page_t
pmap_alloc_direct_page(unsigned int index, int req)
{
vm_page_t m;
Index: mips/mips/uma_machdep.c
===================================================================
--- mips/mips/uma_machdep.c (revision 242939)
+++ mips/mips/uma_machdep.c (working copy)
@@ -50,12 +50,14 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8
*flags = UMA_SLAB_PRIV;
if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
- pflags = VM_ALLOC_INTERRUPT;
+ pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
else
- pflags = VM_ALLOC_SYSTEM;
+ pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
+ if (wait & M_ZERO)
+ pflags |= VM_ALLOC_ZERO;
for (;;) {
- m = pmap_alloc_direct_page(0, pflags);
+ m = vm_page_alloc_freelist(VM_FREELIST_DIRECT, pflags);
if (m == NULL) {
if (wait & M_NOWAIT)
return (NULL);
More information about the freebsd-mips
mailing list