Terrible NFS performance under 9.2-RELEASE?
John-Mark Gurney
jmg at funkthat.com
Tue Jan 28 00:28:31 UTC 2014
Rick Macklem wrote this message on Mon, Jan 27, 2014 at 18:47 -0500:
> John-Mark Gurney wrote:
> > Rick Macklem wrote this message on Sun, Jan 26, 2014 at 21:16 -0500:
> > > Btw, thanks go to Garrett Wollman for suggesting the change to
> > > MJUMPAGESIZE
> > > clusters.
> > >
> > > rick
> > > ps: If the attachment doesn't make it through and you want the
> > > patch, just
> > > email me and I'll send you a copy.
> >
> > The patch looks good, but we probably shouldn't change _readlink..
> > The chances of a link being >2k are pretty slim, and the chances of
> > the link being >32k are even smaller...
> >
> Yea, I already thought of that, actually. However, see below w.r.t.
> NFSv4.
>
> However, at this point I
> mostly want to find out if it the long mbuf chain that causes problems
> for TSO enabled network interfaces.
I agree, though a long mbuf chain is more of a driver issue than an
NFS issue...
> > In fact, we might want to switch _readlink to MGET (could be
> > conditional
> > upon cnt) so that if it fits in an mbuf we don't allocate a cluster
> > for
> > it...
> >
> For NFSv4, what was an RPC for NFSv3 becomes one of several Ops. in
> a compound RPC. As such, there is no way to know how much additional
> RPC message there will be. So, although the readlink reply won't use
> much of the 4K allocation, replies for subsequent Ops. in the compound
> certainly could. (Is it more efficient to allocate 4K now and use
> part of it for subsequent message reply stuff or allocate additional
> mbuf clusters later for subsequent stuff, as required? On a small
> memory constrained machine, I suspect the latter is correct, but for
> the kind of hardware that has TSO scatter/gather enabled network
> interfaces, I'm not so sure. At this point, I wouldn't even say
> that using 4K clusters is going to be a win and my hunch is that
> any win wouldn't apply to small memory constrained machines.)
Though the code that was patched wasn't using any partial buffers,
it was always allocating a new buffer... If the code in
_read/_readlinks starts using a previous mbuf chain, then obviously
things are different and I'd agree, always allocating a 2k/4k
cluster makes sense...
> My test server has 256Mbytes of ram and it certainly doesn't show
> any improvement (big surprise;-), but it also doesn't show any
> degradation for the limited testing I've done.
I'm not too surprised, unless you're on a heavy server pushing
>200MB/sec, the allocation cost is probably cheap enough that it
doesn't show up... going to 4k means immediately half as many mbufs
are needed/allocated, and as they are page sized, don't have the
problems of physical memory fragmentation, nor do they have to do an
IPI/tlb shoot down in the case of multipage allocations... (I'm
dealing w/ this for geli.)
> Again, my main interest at this point is whether reducing the
> number of mbufs in the chain fixes the TSO issues. I think
> the question of whether or not 4K clusters are performance
> improvement in general, is an interesting one that comes later.
Another thing I noticed is that we are getting an mbuf and then
allocating a cluster... Is there a reason we aren't using something
like m_getm or m_getcl? We have a special uma zone that has
mbuf and mbuf cluster already paired meaning we save some lock
operations for each segment allocated...
--
John-Mark Gurney Voice: +1 415 225 5579
"All that I will do, has been done, All that I have, has not."
More information about the freebsd-net
mailing list