kernel memory allocator: UMA or malloc?
Rick Macklem
rmacklem at uoguelph.ca
Sat Mar 15 01:44:13 UTC 2014
John-Mark Gurney wrote:
> Rick Macklem wrote this message on Thu, Mar 13, 2014 at 18:22 -0400:
> > John-Mark Gurney wrote:
> > > Rick Macklem wrote this message on Wed, Mar 12, 2014 at 21:59
> > > -0400:
> > > > John-Mark Gurney wrote:
> > > > > Rick Macklem wrote this message on Tue, Mar 11, 2014 at 21:32
> > > > > -0400:
> > > > > > I've been working on a patch provided by wollman@, where
> > > > > > he uses UMA instead of malloc() to allocate an iovec array
> > > > > > for use by the NFS server's read.
> > > > > >
> > > > > > So, my question is:
> > > > > > When is it preferable to use UMA(9) vs malloc(9) if the
> > > > > > allocation is going to be a fixed size?
> > > > >
> > > > > UMA has benefits if the structure size is uniform and a
> > > > > non-power
> > > > > of
> > > > > 2..
> > > > > In this case, it can pack the items more densely, say, a 192
> > > > > byte
> > > > > allocation can fit 21 allocations in a 4k page size verse
> > > > > malloc
> > > > > which
> > > > > would round it up to 256 bytes leaving only 16 per page...
> > > > > These
> > > > > counts per page are probably different as UMA may keep some
> > > > > information
> > > > > in the page...
> > > > >
> > > > Ok, this one might apply. I need to look at the size.
> > > >
> > > > > It also has the benefit of being able to keep allocations
> > > > > "half
> > > > > alive"..
> > > > > "freed" objects can be partly initalized with references to
> > > > > buffers
> > > > > and
> > > > > other allocations still held by them... Then if the systems
> > > > > needs
> > > > > to
> > > > > fully free your allocation, it can, and will call your
> > > > > function
> > > > > to
> > > > > release these remaining resources... look at the ctor/dtor
> > > > > uminit/fini
> > > > > functions in uma(9) for more info...
> > > > >
> > > > > uma also allows you to set a hard limit on the number of
> > > > > allocations
> > > > > the zone provides...
> > > > >
> > > > Yep. None of the above applies to this case, but thanks for the
> > > > good points
> > > > for a future case. (I've seen where this gets used for the
> > > > "secondary zone"
> > > > for mbufs+cluster.)
> > > >
> > > > > Hope this helps...
> > > > >
> > > > Yes, it did. Thanks.
> > > >
> > > > Does anyone know if there is a significant performance
> > > > difference
> > > > if the allocation
> > > > is a power of 2 and the "half alive" cases don't apply?
> > >
> > > From my understanding, the malloc case is "slightly" slower as it
> > > needs to look up which bucket to use, but after the lookup, the
> > > buckets
> > > are UMA, so the performance will be the same...
> > >
> > > > Thanks all for your help, rick
> > > > ps: Garrett's patch switched to using a fixed size allocation
> > > > and
> > > > using UMA(9).
> > > > Since I have found that a uma allocation request with
> > > > M_WAITOK
> > > > can get the thread
> > > > stuck sleeping in "btalloc", I am a bit shy of using it
> > > > when
> > > > I've never
> > >
> > > Hmm... I took a look at the code, and if you're stuck in
> > > btalloc,
> > > either pause(9) isn't working, or you're looping, which probably
> > > means
> > > you're really low on memory...
> > >
> > Well, this was an i386 with the default of about 400Mbytes of
> > kernel
> > memory (address space if I understand it correctly). Since it
> > seemed
> > to persist in this state, I assumed that it was looping and,
> > therefore,
> > wasn't able to find a page sized and page aligned chunk of kernel
> > address space to use. (The rest of the system was still running
> > ok.)
>
> It looks like vm.phys_free would have some useful information about
> the
> availability of free memory... I'm not sure if this is where the
> allocators get their memory or not.. I was about to say it seamed
> weird
> we only have 16K as the largest allocation, but that's 16MEGs..
>
I can't reproduce it reliably. I saw it twice during several days of
testing.
> > I did email about this and since no one had a better
> > explanation/fix,
> > I avoided the problem by using M_NOWAIT on the m_getjcl() call.
> >
> > Although I couldn't reproduce this reliably, it seemed to happen
> > more
> > easily when my code was doing a mix of MCLBYTES and MJUMPAGESIZE
> > cluster
> > allocation. Again, just a hunch, but maybe the MCLBYTE cluster
> > allocations
> > were fragmenting the address space to the point where a page sized
> > chunk
> > aligned to a page boundary couldn't be found.
>
> By definition, you would be out of memory if there is not a page free
> (that is aligned to a page boundary, which all pages are)...
>
> It'd be interesting to put a printf w/ the pause to see if it is
> looping, and to get a sysctl -a from the machine when it is
> happening...
>
> > Alternately, the code for M_WAITOK is broken in some way not
> > obvious
> > to me.
> >
> > Either way, I avoid it by using M_NOWAIT. I also fall back on:
> > MGET(..M_WAITOK);
> > MCLGET(..M_NOWAIT);
> > which has a "side effect" of draining the mbuf cluster zone if the
> > MCLGET(..M_NOWAIT) fails to get a cluster. (For some reason
> > m_getcl()
> > and m_getjcl() do not drain the cluster zone when they fail?)
>
> Why aren't you using m_getcl(9) which does both of the above
> automaticly
> for you? And is faster, since there is a special uma zone that has
> both an mbuf and an mbuf cluster paired up already?
>
Well, remember this is only done as a fallback if m_getjcl(..M_NOWAIT..)
fails (returns NULL).
--> It will rarely happen when there are no easily allocatable clusters.
For that case, I wanted something that will reliably get at least an
mbuf without getting stuck in "btalloc".
If I used m_getcl(..M_NOWAIT..) it could still fail, then I don't even
have an mbuf.
If I used m_getcl(..M_WAITOK..) it could get stuck in "btalloc". Since
m_getjcl(..M_NOWAIT..) already failed, it is constrained at this time.
Also (and I don't know why), only m_clget(..M_NOWAIT..) does a drain
on the mbuf cluster zone. This is not done by m_getcl() or m_getjcl()
from what I saw when I looked at the code.
Note that the above uses M_WAITOK for m_get() and M_NOWAIT for m_clget(),
it may only get an mbuf and no cluster, but I can live with that.
> > One of the advantages of having very old/small hardware to test
> > on;-)
>
> :)
>
> > > > had a problem with malloc(). Btw, this was for a pagesize
> > > > cluster allocation,
> > > > so it might be related to the alignment requirement (and
> > > > running on a small
> > > > i386, so the kernel address space is relatively small).
> > >
> > > Yeh, if you put additional alignment requirements, that's
> > > probably
> > > it,
> > > but if you needed these alignment requirements, how was malloc
> > > satisfying your request?
> > >
> > This was for a m_getjcl(MJUMAGEIZE, M_WAITOK..), so for this case
> > I've never done a malloc(). The code in head (which my patch uses
> > as
> > a fallback when m_getjcl(..M_NOWAIT..) fails does (as above):
> > MGET(..M_WAITOK);
> > MCLGET(..M_NOWAIT);
>
> When that fails, an netstat -m would also be useful to see what the
> stats think of the availability of page size clusters...
>
This has never failed in testing. The case that would get stuck in
"btalloc" was a:
m_getjcl(..M_WAITOK..);
- same as m_getcl(), but sometimes asking for a MJUMPAGESIZE cluster
instead of a MCLBYTES cluster.
The current patch still does a m_getjcl() call, but with M_NOWAIT.
Then, if that returns NULL, it falls back to the old reliable way, as
above.
> > > > I do see that switching to a fixed size allocation to cover
> > > > the
> > > > common
> > > > case is a good idea, but I'm not sure if setting up a uma
> > > > zone
> > > > is worth
> > > > the effort over malloc()?
> > >
> > > I'd say it depends upon how many and the number... If you're
> > > allocating
> > > many megabytes of memory, and the wastage is 50%+, then think
> > > about
> > > it, but if it's just a few objects, then the coding time and
> > > maintenance isn't worth it..
> > >
> > Btw, I think the allocation is a power of 2. (It is a power of 2
> > times
> > sizeof(struct iovec) and it looks to me that sizeof(struct iovec)
> > is
> > a power of 2 as well. (I know i386 is 8 and I think most 64bits
> > arches
> > will make it 16, since it is a pointer and a size_t.)
>
> yes, struct iovec is 16 on amd64...
>
> (kgdb) print sizeof(struct iovec)
> $1 = 16
>
> > This was part of Garrett's patch, so I'll admit I would have been
> > to
> > lazy to do it.;-) Now it's in the current patch, so unless there
> > seems
> > to be a reason to take it out..??
> >
> > Garrett mentioned that UMA(9) has a per-CPU cache. I'll admit I
> > don't
> > know what that implies?
>
> a per-CPU cache means that on an SMP system, you can lock the local
> pool instead of grabing a global lock.. This will be MUCH faster as
> the local lock won't have to bounce around CPUs like a global lock
> does, plus it should never contend which really puts the breaks on
> sync primities...
>
> > - I might guess that a per-CPU cache would be useful for items that
> > get
> > re-allocated a lot with minimal change to the data in the slab.
> > --> It seems to me that if most of the bytes in the slab have the
> > same bits, then you might improve hit rate on the CPU's
> > memory
> > caches, but since I haven't looked at this, I could be way
> > off??
>
> caching will help some, but the lock is the main one...
>
> > - For this case, the iovec array that is allocated is filled in
> > with
> > different mbuf data addresses each time, so minimal change
> > doesn`t
> > apply.
>
> So, this is where a UMA half alive object could be helpful... Say
> that
> you always need to allocate an iovec + 8 mbuf clusters to populate
> the
> iovec... What you can do is have a uma uminit function that
> allocates
> the memory for the iovec and 8 mbuf clusters, and populates the iovec
> w/ the correct addresses... Then when you call uma_zalloc, the iovec
> is already initalized, and you just go on your merry way instead of
> doing all that work... when you uma_zfree, you don't have to worry
> about loosing the clusters as the next uma_zalloc might return the
> same object w/ the clusters already present... When the system gets
> low on memory, it will call your fini function which will need to
> free the clusters....
>
> > - Does the per-CPU cache help w.r.t. UMA(9) internal code perf?
> >
> > So, lots of questions that I don't have an answer for. However,
> > unless
> > there is a downside to using UMA(9) for this, the code is written
> > and
> > I'm ok with it.
>
> Nope, not really...
>
> --
> John-Mark Gurney Voice: +1 415 225 5579
>
> "All that I will do, has been done, All that I have, has not."
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to
> "freebsd-hackers-unsubscribe at freebsd.org"
>
More information about the freebsd-hackers
mailing list