NFS server bottlenecks

Thu Oct 11 16:47:17 UTC 2012

On Oct 11, 2012, at 7:20 PM, Nikolay Denev <ndenev at gmail.com> wrote:

> On Oct 11, 2012, at 8:46 AM, Nikolay Denev <ndenev at gmail.com> wrote:
> 
>> 
>> On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
>> 
>>> Nikolay Denev wrote:
>>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem at uoguelph.ca>
>>>> wrote:
>>>> 
>>>>> Nikolay Denev wrote:
>>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem at uoguelph.ca>
>>>>>> wrote:
>>>>>> 
>>>>>>> Garrett Wollman wrote:
>>>>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>>>>>>>> <rmacklem at uoguelph.ca> said:
>>>>>>>> 
>>>>>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>>>>>> entry
>>>>>>>>>> is on, rather than a global lock for everything. This would
>>>>>>>>>> reduce
>>>>>>>>>> the mutex contention, but I'm not sure how significantly since
>>>>>>>>>> I
>>>>>>>>>> don't have the means to measure it yet.
>>>>>>>>>> 
>>>>>>>>> Well, since the cache trimming is removing entries from the
>>>>>>>>> lists,
>>>>>>>>> I
>>>>>>>>> don't
>>>>>>>>> see how that can be done with a global lock for list updates?
>>>>>>>> 
>>>>>>>> Well, the global lock is what we have now, but the cache trimming
>>>>>>>> process only looks at one list at a time, so not locking the list
>>>>>>>> that
>>>>>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>>>>>> some
>>>>>>>> mechanism (that I didn't see) for entries to move from one list
>>>>>>>> to
>>>>>>>> another. Note that I'm considering each hash bucket a separate
>>>>>>>> "list". (One issue to worry about in that case would be
>>>>>>>> cache-line
>>>>>>>> contention in the array of hash buckets; perhaps
>>>>>>>> NFSRVCACHE_HASHSIZE
>>>>>>>> ought to be increased to reduce that.)
>>>>>>>> 
>>>>>>> Yea, a separate mutex for each hash list might help. There is also
>>>>>>> the
>>>>>>> LRU list that all entries end up on, that gets used by the
>>>>>>> trimming
>>>>>>> code.
>>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
>>>>>>> looked
>>>>>>> at
>>>>>>> it in a while.)
>>>>>>> 
>>>>>>> Also, increasing the hash table size is probably a good idea,
>>>>>>> especially
>>>>>>> if you reduce how aggressively the cache is trimmed.
>>>>>>> 
>>>>>>>>> Only doing it once/sec would result in a very large cache when
>>>>>>>>> bursts of
>>>>>>>>> traffic arrives.
>>>>>>>> 
>>>>>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>>>>>> 
>>>>>>> This code was originally "production tested" on a server with
>>>>>>> 1Gbyte,
>>>>>>> so times have changed a bit;-)
>>>>>>> 
>>>>>>>>> I'm not sure I see why doing it as a separate thread will
>>>>>>>>> improve
>>>>>>>>> things.
>>>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if
>>>>>>>>> you
>>>>>>>>> wish)
>>>>>>>>> and having a bunch more "cache trimming threads" would just
>>>>>>>>> increase
>>>>>>>>> contention, wouldn't it?
>>>>>>>> 
>>>>>>>> Only one cache-trimming thread. The cache trim holds the (global)
>>>>>>>> mutex for much longer than any individual nfsd service thread has
>>>>>>>> any
>>>>>>>> need to, and having N threads doing that in parallel is why it's
>>>>>>>> so
>>>>>>>> heavily contended. If there's only one thread doing the trim,
>>>>>>>> then
>>>>>>>> the nfsd service threads aren't spending time either contending
>>>>>>>> on
>>>>>>>> the
>>>>>>>> mutex (it will be held less frequently and for shorter periods).
>>>>>>>> 
>>>>>>> I think the little drc2.patch which will keep the nfsd threads
>>>>>>> from
>>>>>>> acquiring the mutex and doing the trimming most of the time, might
>>>>>>> be
>>>>>>> sufficient. I still don't see why a separate trimming thread will
>>>>>>> be
>>>>>>> an advantage. I'd also be worried that the one cache trimming
>>>>>>> thread
>>>>>>> won't get the job done soon enough.
>>>>>>> 
>>>>>>> When I did production testing on a 1Gbyte server that saw a peak
>>>>>>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>>>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no
>>>>>>> longer relevant, I recently recall someone trying to run FreeBSD
>>>>>>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>>>>>> 
>>>>>>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>>>>>>> threads doing it would be a (I believe negligible) increase in
>>>>>>>>> RPC
>>>>>>>>> response times (the time the nfsd thread spends trimming the
>>>>>>>>> cache).
>>>>>>>>> As noted, I think this time would be negligible compared to disk
>>>>>>>>> I/O
>>>>>>>>> and network transit times in the total RPC response time?
>>>>>>>> 
>>>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and
>>>>>>>> 10G
>>>>>>>> network connectivity, spinning on a contended mutex takes a
>>>>>>>> significant amount of CPU time. (For the current design of the
>>>>>>>> NFS
>>>>>>>> server, it may actually be a win to turn off adaptive mutexes --
>>>>>>>> I
>>>>>>>> should give that a try once I'm able to do more testing.)
>>>>>>>> 
>>>>>>> Have fun with it. Let me know when you have what you think is a
>>>>>>> good
>>>>>>> patch.
>>>>>>> 
>>>>>>> rick
>>>>>>> 
>>>>>>>> -GAWollman
>>>>>>>> _______________________________________________
>>>>>>>> freebsd-hackers at freebsd.org mailing list
>>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>>>>> To unsubscribe, send any mail to
>>>>>>>> "freebsd-hackers-unsubscribe at freebsd.org"
>>>>>>> _______________________________________________
>>>>>>> freebsd-fs at freebsd.org mailing list
>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>>>>> To unsubscribe, send any mail to
>>>>>>> "freebsd-fs-unsubscribe at freebsd.org"
>>>>>> 
>>>>>> My quest for IOPS over NFS continues :)
>>>>>> So far I'm not able to achieve more than about 3000 8K read
>>>>>> requests
>>>>>> over NFS,
>>>>>> while the server locally gives much more.
>>>>>> And this is all from a file that is completely in ARC cache, no
>>>>>> disk
>>>>>> IO involved.
>>>>>> 
>>>>> Just out of curiousity, why do you use 8K reads instead of 64K
>>>>> reads.
>>>>> Since the RPC overhead (including the DRC functions) is per RPC,
>>>>> doing
>>>>> fewer larger RPCs should usually work better. (Sometimes large
>>>>> rsize/wsize
>>>>> values generate too large a burst of traffic for a network interface
>>>>> to
>>>>> handle and then the rsize/wsize has to be decreased to avoid this
>>>>> issue.)
>>>>> 
>>>>> And, although this experiment seems useful for testing patches that
>>>>> try
>>>>> and reduce DRC CPU overheads, most "real" NFS servers will be doing
>>>>> disk
>>>>> I/O.
>>>>> 
>>>> 
>>>> This is the default blocksize the Oracle and probably most databases
>>>> use.
>>>> It uses also larger blocks, but for small random reads in OLTP
>>>> applications this is what is used.
>>>> 
>>> If the client is doing 8K reads, you could increase the read ahead
>>> "readahead=N" (N up to 16), to try and increase the bandwidth.
>>> (But if the CPU is 99% busy, then I don't think it will matter.)
>> 
>> I'll try to check if this is possible to be set, as we are testing not only with the Linux NFS client,
>> but also with the Oracle's built in so called DirectNFS client that is built in to the app.
>> 
>>> 
>>>> 
>>>>>> I've snatched some sample DTrace script from the net : [
>>>>>> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
>>>>>> ]
>>>>>> 
>>>>>> And modified it for our new NFS server :
>>>>>> 
>>>>>> #!/usr/sbin/dtrace -qs
>>>>>> 
>>>>>> fbt:kernel:nfsrvd_*:entry
>>>>>> {
>>>>>> self->ts = timestamp;
>>>>>> @counts[probefunc] = count();
>>>>>> }
>>>>>> 
>>>>>> fbt:kernel:nfsrvd_*:return
>>>>>> / self->ts > 0 /
>>>>>> {
>>>>>> this->delta = (timestamp-self->ts)/1000000;
>>>>>> }
>>>>>> 
>>>>>> fbt:kernel:nfsrvd_*:return
>>>>>> / self->ts > 0 && this->delta > 100 /
>>>>>> {
>>>>>> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50);
>>>>>> }
>>>>>> 
>>>>>> fbt:kernel:nfsrvd_*:return
>>>>>> / self->ts > 0 /
>>>>>> {
>>>>>> @dist[probefunc, "ms"] = quantize(this->delta);
>>>>>> self->ts = 0;
>>>>>> }
>>>>>> 
>>>>>> END
>>>>>> {
>>>>>> printf("\n");
>>>>>> printa("function %-20s %@10d\n", @counts);
>>>>>> printf("\n");
>>>>>> printa("function %s(), time in %s:%@d\n", @dist);
>>>>>> printf("\n");
>>>>>> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
>>>>>> }
>>>>>> 
>>>>>> And here's a sample output from one or two minutes during the run
>>>>>> of
>>>>>> Oracle's ORION benchmark
>>>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G
>>>>>> ethernet:
>>>>>> 
>>>>>> [16:01]root at goliath:/home/ndenev# ./nfsrvd.d
>>>>>> ^C
>>>>>> 
>>>>>> function nfsrvd_access 4
>>>>>> function nfsrvd_statfs 10
>>>>>> function nfsrvd_getattr 14
>>>>>> function nfsrvd_commit 76
>>>>>> function nfsrvd_sentcache 110048
>>>>>> function nfsrvd_write 110048
>>>>>> function nfsrvd_read 283648
>>>>>> function nfsrvd_dorpc 393800
>>>>>> function nfsrvd_getcache 393800
>>>>>> function nfsrvd_rephead 393800
>>>>>> function nfsrvd_updatecache 393800
>>>>>> 
>>>>>> function nfsrvd_access(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
>>>>>> 1 | 0
>>>>>> 
>>>>>> function nfsrvd_statfs(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
>>>>>> 1 | 0
>>>>>> 
>>>>>> function nfsrvd_getattr(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
>>>>>> 1 | 0
>>>>>> 
>>>>>> function nfsrvd_sentcache(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
>>>>>> 1 | 0
>>>>>> 
>>>>>> function nfsrvd_rephead(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>>>> 1 | 0
>>>>>> 
>>>>>> function nfsrvd_updatecache(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>>>> 1 | 0
>>>>>> 
>>>>>> function nfsrvd_getcache(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
>>>>>> 1 | 1
>>>>>> 2 | 0
>>>>>> 4 | 1
>>>>>> 8 | 0
>>>>>> 
>>>>>> function nfsrvd_write(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
>>>>>> 1 | 5
>>>>>> 2 | 4
>>>>>> 4 | 0
>>>>>> 
>>>>>> function nfsrvd_read(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
>>>>>> 1 | 19
>>>>>> 2 | 3
>>>>>> 4 | 2
>>>>>> 8 | 0
>>>>>> 16 | 1
>>>>>> 32 | 0
>>>>>> 64 | 0
>>>>>> 128 | 0
>>>>>> 256 | 1
>>>>>> 512 | 0
>>>>>> 
>>>>>> function nfsrvd_commit(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
>>>>>> 1 |@@@@@@@ 14
>>>>>> 2 | 0
>>>>>> 4 |@ 1
>>>>>> 8 |@ 1
>>>>>> 16 | 0
>>>>>> 32 |@@@@@@@ 14
>>>>>> 64 |@ 2
>>>>>> 128 | 0
>>>>>> 
>>>>>> 
>>>>>> function nfsrvd_commit(), time in ms for >= 100 ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> < 100 | 0
>>>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>>>> 150 | 0
>>>>>> 
>>>>>> function nfsrvd_read(), time in ms for >= 100 ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> 250 | 0
>>>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>>>> 350 | 0
>>>>>> 
>>>>>> 
>>>>>> Looks like the nfs server cache functions are quite fast, but
>>>>>> extremely frequently called.
>>>>>> 
>>>>> Yep, they are called for every RPC.
>>>>> 
>>>>> I may try coding up a patch that replaces the single mutex with
>>>>> one for each hash bucket, for TCP.
>>>>> 
>>>>> I'll post if/when I get this patch to a testing/review stage, rick
>>>>> 
>>>> 
>>>> Cool.
>>>> 
>>>> I've readjusted the precision of the dtrace script a bit, and I can
>>>> see
>>>> now the following three functions as taking most of the time :
>>>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()
>>>> 
>>>> This was recorded during a oracle benchmark run called SLOB, which
>>>> caused 99% cpu load on the NFS server.
>>>> 
>>> Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater?
>>> (Assuming the mounts are TCP ones.)
>>> 
>>> Have fun with it, rick
>>> 
>> 
>> I had upped it, but probably not enough. I'm now running with vfs.nfsd.tcphighwater set
>> to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500.
>> So far it looks like good improvement as those functions no longer show up in the dtrace script output.
>> I'll run some more benchmarks and testing today.
>> 
>> Thanks!
>> 
>>>> 
>>>>>> I hope someone can find this information useful.
>>>>>> 
>>>>>> _______________________________________________
>>>>>> freebsd-hackers at freebsd.org mailing list
>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>>> To unsubscribe, send any mail to
>>>>>> "freebsd-hackers-unsubscribe at freebsd.org"
>> 
> 
> I haven't had the opportunity today to run more DB tests over NFS as the DBA was busy with something else,
> however I tested a bit the large file transfers.
> And I'm seeing something strange probably not only NFS but also ZFS and ARC related.
> 
> When I first tested the drc2.patch I reported a huge bandwidth improvement,
> but now I think that this was probably because of the machine freshly rebooted instead of just the patch.
> The patch surely improved things, especially CPU utilization combined with the increased cache.
> But today I'm again having a file completely cached in ZFS's ARC cache, which when transferred over NFS
> reaches about 300MB/s, when at some tests it reached 900+MB/s (as reported in my first email).
> The file locally can be read at about 3GB/s as reported by dd.
> 
> Local:
> [17:51]root at goliath:/tank/spa_db/undo# dd if=data.dbf of=/dev/null bs=1M              
> 30720+1 records in
> 30720+1 records out
> 32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec)
> 
> Over NFS:
> [17:48]root at spa:/mnt/spa_db/undo# dd if=data.dbf of=/dev/null bs=1M             
> 30720+1 records in
> 30720+1 records out
> 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s
> 
> The machines are almost idle during this transfer and I can't see a reason why it can't reach the full bandwith when it's
> just reading it from RAM.
> 
> I've tried again tracing with DTrace to see what's happening with this script :
> 
> fbt:kernel:nfs*:entry
> {
> 	this->ts = timestamp;
> 	@counts[probefunc] = count();
> }
> 
> fbt:kernel:nfs*:return
> / this->ts > 0 /
> {
> 	@time[probefunc] = avg(timestamp - this->ts);
> }
> 
> END
> {
> 	trunc(@counts, 10);
> 	trunc(@time, 10);
> 	printf("Top 10 called functions\n\n");	
> 	printa(@counts);
> 	printf("\n\nTop 10 slowest functions\n\n");
> 	printa(@time);
> }
> 
> And here's the result (several seconds during the dd test):
> 
> Top 10 called functions
>  nfsrc_freecache                                               88849
>  nfsrc_wanted                                                  88849
>  nfsrv_fillattr                                                88849
>  nfsrv_postopattr                                              88849
>  nfsrvd_read                                                   88849
>  nfsrvd_rephead                                                88849
>  nfsrvd_updatecache                                            88849
>  nfsvno_testexp                                                88849
>  nfsrc_trimcache                                              177697
>  nfsvno_getattr                                               177698
> 
> Top 10 slowest functions
>  nfsd_excred                                                    5673
>  nfsrc_freecache                                                5674
>  nfsrv_postopattr                                               5970
>  nfsrv_servertimer                                              6327
>  nfssvc_nfscommon                                               6596
>  nfsd_fhtovp                                                    8000
>  nfsrvd_read                                                    8380
>  nfssvc_program                                                92752
>  nfsvno_read                                                  124979
>  nfsvno_fhtovp                                               1789523
> 
> I might try now to trace what nfsvno_fhtovp() is doing and where is spending it's time.
> 
> Any other ideas are welcome :)
> 

To take the network out of the equation I redid the test by mounting the same filesystem over NFS on the server:

[18:23]root at goliath:~#  mount -t nfs -o rw,hard,intr,tcp,nfsv3,rsize=1048576,wsize=1048576 localhost:/tank/spa_db/undo /mnt
[18:24]root at goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M 
30720+1 records in
30720+1 records out
32212262912 bytes transferred in 79.793343 secs (403696120 bytes/sec)
[18:25]root at goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M
30720+1 records in
30720+1 records out
32212262912 bytes transferred in 12.033420 secs (2676900110 bytes/sec)

During the first run I saw several nfsd threads in top, along with dd and again zero disk I/O.
There was increase in memory usage because of the double buffering ARC->buffercahe.
The second run was with all of the nfsd threads totally idle, and read directly from the buffercache.