Memory reserves or lack thereof

Mon Nov 12 21:28:10 UTC 2012

This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. 

Sushanth 

--- On Mon, 11/12/12, Konstantin Belousov <kostikbel at gmail.com> wrote:

> From: Konstantin Belousov <kostikbel at gmail.com>
> Subject: Re: Memory reserves or lack thereof
> To: alc at freebsd.org
> Cc: pho at freebsd.org, "Sears, Steven" <Steven.Sears at netapp.com>, "freebsd-hackers at freebsd.org" <freebsd-hackers at freebsd.org>
> Date: Monday, November 12, 2012, 5:36 AM
> On Sun, Nov 11, 2012 at 03:40:24PM
> -0600, Alan Cox wrote:
> > On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov
> <kostikbel at gmail.com>wrote:
> > 
> > > On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears,
> Steven wrote:
> > > > I have a memory subsystem design question
> that I'm hoping someone can
> > > answer.
> > > >
> > > > I've been looking at a machine that is
> completely out of memory, as in
> > > >
> > > >  v_free_count = 0,
> > > >  v_cache_count = 0,
> > > >
> > > > I wondered how a machine could completely run
> out of memory like this,
> > > especially after finding a lack of interrupt
> storms or other pathologies
> > > that would tend to overcommit memory. So I started
> investigating.
> > > >
> > > > Most allocators come down to vm_page_alloc(),
> which has this guard:
> > > >
> > > >       if ((curproc
> == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) {
> > > >           
>    page_req = VM_ALLOC_SYSTEM;
> > > >       };
> > > >
> > > >       if
> (cnt.v_free_count + cnt.v_cache_count >
> cnt.v_free_reserved ||
> > > >       
>    (page_req == VM_ALLOC_SYSTEM &&
> > > >       
>    cnt.v_free_count + cnt.v_cache_count >
> > > cnt.v_interrupt_free_min) ||
> > > >       
>    (page_req == VM_ALLOC_INTERRUPT
> &&
> > > >       
>    cnt.v_free_count + cnt.v_cache_count >
> 0)) {
> > > >
> > > > The key observation is if VM_ALLOC_INTERRUPT
> is set, it will allocate
> > > every last page.
> > > >
> > > > >From the name one might expect
> VM_ALLOC_INTERRUPT to be somewhat rare,
> > > perhaps only used from interrupt threads. Not so,
> see kmem_malloc() or
> > > uma_small_alloc() which both contain this
> mapping:
> > > >
> > > >       if ((flags
> & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> > > >           
>    pflags = VM_ALLOC_INTERRUPT |
> VM_ALLOC_WIRED;
> > > >       else
> > > >           
>    pflags = VM_ALLOC_SYSTEM |
> VM_ALLOC_WIRED;
> > > >
> > > > Note that M_USE_RESERVE has been deprecated
> and is used in just a
> > > handful of places. Also note that lots of code
> paths come through these
> > > routines.
> > > >
> > > > What this means is essentially _any_
> allocation using M_NOWAIT will
> > > bypass whatever reserves have been held back and
> will take every last page
> > > available.
> > > >
> > > > There is no documentation stating M_NOWAIT
> has this side effect of
> > > essentially being privileged, so any innocuous
> piece of code that can't
> > > block will use it. And of course M_NOWAIT is
> literally used all over.
> > > >
> > > > It looks to me like the design goal of the
> BSD allocators is on
> > > recovery; it will give all pages away knowing it
> can recover.
> > > >
> > > > Am I missing anything? I would have expected
> some small number of pages
> > > to be held in reserve just in case. And I didn't
> expect M_NOWAIT to be a
> > > sort of back door for grabbing memory.
> > > >
> > >
> > > Your analysis is right, there is nothing to add or
> correct.
> > > This is the reason to strongly prefer M_WAITOK.
> > >
> > 
> > Agreed.  Once upon time, before SMPng, M_NOWAIT
> was rarely used.  It was
> > well understand that it should only be used by
> interrupt handlers.
> > 
> > The trouble is that M_NOWAIT conflates two orthogonal
> things.  The obvious
> > being that the allocation shouldn't sleep.  The
> other being how far we're
> > willing to deplete the cache/free page queues.
> > 
> > When fine-grained locking got sprinkled throughout the
> kernel, we all to
> > often found ourselves wanting to do allocations without
> the possibility of
> > blocking.  So, M_NOWAIT became commonplace, where
> it wasn't before.
> > 
> > This had the unintended consequence of introducing a
> lot of memory
> > allocations in the top-half of the kernel, i.e.,
> non-interrupt handling
> > code, that were digging deep into the cache/free page
> queues.
> > 
> > Also, ironically, in today's kernel an "M_NOWAIT |
> M_USE_RESERVE"
> > allocation is less likely to succeed than an "M_NOWAIT"
> allocation.
> > However, prior to FreeBSD 7.x, M_NOWAIT couldn't
> allocate a cached page; it
> > could only allocate a free page.  M_USE_RESERVE
> said that it ok to allocate
> > a cached page even though M_NOWAIT was specified. 
> Consequently, the system
> > wouldn't dig as far into the free page queue if
> M_USE_RESERVE was
> > specified, because it was allowed to reclaim a cached
> page.
> > 
> > In conclusion, I think it's time that we change
> M_NOWAIT so that it doesn't
> > dig any deeper into the cache/free page queues than
> M_WAITOK does and
> > reintroduce a M_USE_RESERVE-like flag that says dig
> deep into the
> > cache/free page queues.  The trouble is that we
> then need to identify all
> > of those places that are implicitly depending on the
> current behavior of
> > M_NOWAIT also digging deep into the cache/free page
> queues so that we can
> > add an explicit M_USE_RESERVE.
> > 
> > Alan
> > 
> > P.S. I suspect that we should also increase the size of
> the "page reserve"
> > that is kept for VM_ALLOC_INTERRUPT allocations in
> vm_page_alloc*().  How
> > many legitimate users of a new M_USE_RESERVE-like flag
> in today's kernel
> > could actually be satisfied by two pages?
> 
> I am almost sure that most of people who put the M_NOWAIT
> flag, do not
> know the 'allow the deeper drain of free queue' effect. As
> such, I believe
> we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My
> only expectations
> of the problematic places would be in the swapout path.
> 
> I found a single explicit use of M_USE_RESERVE in the
> kernel,
> so the flip is relatively simple.
> 
> Below is the patch which I only compile-tested on amd64, and
> which booted
> fine.
> 
> Peter, could you, please, give it a run, to see obvious
> deadlocks, if any ?
> 
> diff --git a/sys/amd64/amd64/uma_machdep.c
> b/sys/amd64/amd64/uma_machdep.c
> index dc9c307..ab1e869 100644
> --- a/sys/amd64/amd64/uma_machdep.c
> +++ b/sys/amd64/amd64/uma_machdep.c
> @@ -29,6 +29,7 @@ __FBSDID("$FreeBSD$");
>  
>  #include <sys/param.h>
>  #include <sys/lock.h>
> +#include <sys/malloc.h>
>  #include <sys/mutex.h>
>  #include <sys/systm.h>
>  #include <vm/vm.h>
> @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int
> bytes, u_int8_t *flags, int wait)
>      int pflags;
>  
>      *flags = UMA_SLAB_PRIV;
> -    if ((wait &
> (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -        pflags =
> VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
> -    else
> -        pflags =
> VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
> -    if (wait & M_ZERO)
> -        pflags |=
> VM_ALLOC_ZERO;
> +    pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ
> | VM_ALLOC_WIRED);
>      for (;;) {
>          m =
> vm_page_alloc(NULL, 0, pflags);
>          if (m == NULL) {
> diff --git a/sys/arm/arm/vm_machdep.c
> b/sys/arm/arm/vm_machdep.c
> index f60cdb1..75366e3 100644
> --- a/sys/arm/arm/vm_machdep.c
> +++ b/sys/arm/arm/vm_machdep.c
> @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int
> bytes, u_int8_t *flags, int wait)
>             
> ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT));
>             
> return (ret);
>          }
> -        if ((wait &
> (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -           
> pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -        else
> -           
> pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -        if (wait &
> M_ZERO)
> -           
> pflags |= VM_ALLOC_ZERO;
> +        pflags =
> m2vm_flags(wait, VM_ALLOC_WIRED);
>          for (;;) {
>              m
> = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
>              if
> (m == NULL) {
> diff --git a/sys/fs/devfs/devfs_devs.c
> b/sys/fs/devfs/devfs_devs.c
> index 71caa29..2ce1ca6 100644
> --- a/sys/fs/devfs/devfs_devs.c
> +++ b/sys/fs/devfs/devfs_devs.c
> @@ -121,7 +121,7 @@ devfs_alloc(int flags)
>      struct cdev *cdev;
>      struct timespec ts;
>  
> -    cdp = malloc(sizeof *cdp, M_CDEVP,
> M_USE_RESERVE | M_ZERO |
> +    cdp = malloc(sizeof *cdp, M_CDEVP,
> M_ZERO |
>          ((flags &
> MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK));
>      if (cdp == NULL)
>          return (NULL);
> diff --git a/sys/ia64/ia64/uma_machdep.c
> b/sys/ia64/ia64/uma_machdep.c
> index 37353ff..9f77762 100644
> --- a/sys/ia64/ia64/uma_machdep.c
> +++ b/sys/ia64/ia64/uma_machdep.c
> @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int
> bytes, u_int8_t *flags, int wait)
>      int pflags;
>  
>      *flags = UMA_SLAB_PRIV;
> -    if ((wait &
> (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -        pflags =
> VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -    else
> -        pflags =
> VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -    if (wait & M_ZERO)
> -        pflags |=
> VM_ALLOC_ZERO;
> +    pflags = m2vm_flags(wait,
> VM_ALLOC_WIRED);
>  
>      for (;;) {
>          m =
> vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
> diff --git a/sys/mips/mips/uma_machdep.c
> b/sys/mips/mips/uma_machdep.c
> index 798e632..24baef0 100644
> --- a/sys/mips/mips/uma_machdep.c
> +++ b/sys/mips/mips/uma_machdep.c
> @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int
> bytes, u_int8_t *flags, int wait)
>      void *va;
>  
>      *flags = UMA_SLAB_PRIV;
> -
> -    if ((wait &
> (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -        pflags =
> VM_ALLOC_INTERRUPT;
> -    else
> -        pflags =
> VM_ALLOC_SYSTEM;
> +    pflags = m2vm_flags(wait, 0);
>  
>      for (;;) {
>          m =
> pmap_alloc_direct_page(0, pflags);
> diff --git a/sys/powerpc/aim/mmu_oea64.c
> b/sys/powerpc/aim/mmu_oea64.c
> index a491680..3e320b9 100644
> --- a/sys/powerpc/aim/mmu_oea64.c
> +++ b/sys/powerpc/aim/mmu_oea64.c
> @@ -1369,12 +1369,7 @@ moea64_uma_page_alloc(uma_zone_t
> zone, int bytes, u_int8_t *flags, int wait)
>      *flags = UMA_SLAB_PRIV;
>      needed_lock =
> !PMAP_LOCKED(kernel_pmap);
>  
> -        if ((wait &
> (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -               
> pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -        else
> -               
> pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -        if (wait & M_ZERO)
> -               
> pflags |= VM_ALLOC_ZERO;
> +    pflags = m2vm_flags(wait,
> VM_ALLOC_WIRED);
>  
>          for (;;) {
>              
>    m = vm_page_alloc(NULL, 0, pflags |
> VM_ALLOC_NOOBJ);
> diff --git a/sys/powerpc/aim/slb.c b/sys/powerpc/aim/slb.c
> index 162c7fb..3882bfa 100644
> --- a/sys/powerpc/aim/slb.c
> +++ b/sys/powerpc/aim/slb.c
> @@ -483,12 +483,7 @@ slb_uma_real_alloc(uma_zone_t zone, int
> bytes, u_int8_t *flags, int wait)
>          realmax =
> platform_real_maxaddr();
>  
>      *flags = UMA_SLAB_PRIV;
> -    if ((wait & (M_NOWAIT |
> M_USE_RESERVE)) == M_NOWAIT)
> -        pflags =
> VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
> -    else
> -        pflags =
> VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
> -    if (wait & M_ZERO)
> -        pflags |=
> VM_ALLOC_ZERO;
> +    pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ
> | VM_ALLOC_WIRED);
>  
>      for (;;) {
>          m =
> vm_page_alloc_contig(NULL, 0, pflags, 1, 0, realmax,
> diff --git a/sys/powerpc/aim/uma_machdep.c
> b/sys/powerpc/aim/uma_machdep.c
> index 39deb43..23a333f 100644
> --- a/sys/powerpc/aim/uma_machdep.c
> +++ b/sys/powerpc/aim/uma_machdep.c
> @@ -56,12 +56,7 @@ uma_small_alloc(uma_zone_t zone, int
> bytes, u_int8_t *flags, int wait)
>      int pflags;
>      
>      *flags = UMA_SLAB_PRIV;
> -    if ((wait &
> (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -        pflags =
> VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -    else
> -        pflags =
> VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -    if (wait & M_ZERO)
> -        pflags |=
> VM_ALLOC_ZERO;
> +    pflags = m2vm_flags(wait,
> VM_ALLOC_WIRED);
>  
>      for (;;) {
>          m =
> vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
> diff --git a/sys/sparc64/sparc64/vm_machdep.c
> b/sys/sparc64/sparc64/vm_machdep.c
> index cdb94c7..573ab3a 100644
> --- a/sys/sparc64/sparc64/vm_machdep.c
> +++ b/sys/sparc64/sparc64/vm_machdep.c
> @@ -501,14 +501,7 @@ uma_small_alloc(uma_zone_t zone, int
> bytes, u_int8_t *flags, int wait)
>      PMAP_STATS_INC(uma_nsmall_alloc);
>  
>      *flags = UMA_SLAB_PRIV;
> -
> -    if ((wait &
> (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -        pflags =
> VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -    else
> -        pflags =
> VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -
> -    if (wait & M_ZERO)
> -        pflags |=
> VM_ALLOC_ZERO;
> +    pflags = m2vm_flags(wait,
> VM_ALLOC_WIRED);
>  
>      for (;;) {
>          m =
> vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
> diff --git a/sys/vm/vm_kern.c b/sys/vm/vm_kern.c
> index 46e7f1c..e4c3704 100644
> --- a/sys/vm/vm_kern.c
> +++ b/sys/vm/vm_kern.c
> @@ -222,12 +222,7 @@ kmem_alloc_attr(vm_map_t map, vm_size_t
> size, int flags, vm_paddr_t low,
>      vm_object_reference(object);
>      vm_map_insert(map, object, offset, addr,
> addr + size, VM_PROT_ALL,
>          VM_PROT_ALL, 0);
> -    if ((flags & (M_NOWAIT |
> M_USE_RESERVE)) == M_NOWAIT)
> -        pflags =
> VM_ALLOC_INTERRUPT | VM_ALLOC_NOBUSY;
> -    else
> -        pflags =
> VM_ALLOC_SYSTEM | VM_ALLOC_NOBUSY;
> -    if (flags & M_ZERO)
> -        pflags |=
> VM_ALLOC_ZERO;
> +    pflags = m2vm_flags(flags,
> VM_ALLOC_NOBUSY);
>      VM_OBJECT_LOCK(object);
>      end_offset = offset + size;
>      for (; offset < end_offset; offset +=
> PAGE_SIZE) {
> @@ -296,14 +291,7 @@ kmem_alloc_contig(vm_map_t map,
> vm_size_t size, int flags, vm_paddr_t low,
>      vm_object_reference(object);
>      vm_map_insert(map, object, offset, addr,
> addr + size, VM_PROT_ALL,
>          VM_PROT_ALL, 0);
> -    if ((flags & (M_NOWAIT |
> M_USE_RESERVE)) == M_NOWAIT)
> -        pflags =
> VM_ALLOC_INTERRUPT | VM_ALLOC_NOBUSY;
> -    else
> -        pflags =
> VM_ALLOC_SYSTEM | VM_ALLOC_NOBUSY;
> -    if (flags & M_ZERO)
> -        pflags |=
> VM_ALLOC_ZERO;
> -    if (flags & M_NODUMP)
> -        pflags |=
> VM_ALLOC_NODUMP;
> +    pflags = m2vm_flags(flags,
> VM_ALLOC_NOBUSY);
>      VM_OBJECT_LOCK(object);
>      tries = 0;
>  retry:
> @@ -487,11 +475,7 @@ kmem_back(vm_map_t map, vm_offset_t
> addr, vm_size_t size, int flags)
>          entry->wired_count == 0
> && (entry->eflags & MAP_ENTRY_IN_TRANSITION)
>          == 0, ("kmem_back: entry
> not found or misaligned"));
>  
> -    if ((flags &
> (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -        pflags =
> VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -    else
> -        pflags =
> VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -
> +    pflags = m2vm_flags(flags,
> VM_ALLOC_WIRED);
>      if (flags & M_ZERO)
>          pflags |=
> VM_ALLOC_ZERO;
>      if (flags & M_NODUMP)
> diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
> index 70b8416..0286a6d 100644
> --- a/sys/vm/vm_page.h
> +++ b/sys/vm/vm_page.h
> @@ -344,6 +344,24 @@ extern struct mtx_padalign
> vm_page_queue_mtx;
>  #define   
> VM_ALLOC_COUNT_SHIFT    16
>  #define   
> VM_ALLOC_COUNT(count)    ((count) <<
> VM_ALLOC_COUNT_SHIFT)
>  
> +#ifdef M_NOWAIT
> +static inline int
> +m2vm_flags(int malloc_flags, int alloc_flags)
> +{
> +    int pflags;
> +
> +    if ((malloc_flags & (M_NOWAIT |
> M_USE_RESERVE)) == M_NOWAIT)
> +        pflags =
> VM_ALLOC_SYSTEM | alloc_flags;
> +    else
> +        pflags =
> VM_ALLOC_INTERRUPT | alloc_flags;
> +    if (malloc_flags & M_ZERO)
> +        pflags |=
> VM_ALLOC_ZERO;
> +    if (malloc_flags & M_NODUMP)
> +        pflags |=
> VM_ALLOC_NODUMP;
> +    return (pflags);
> +}
> +#endif
> +
>  void vm_page_busy(vm_page_t m);
>  void vm_page_flash(vm_page_t m);
>  void vm_page_io_start(vm_page_t m);
>