Re: optimising nfs and nfsd
- Reply: void : "Re: optimising nfs and nfsd"
- In reply to: void : "Re: optimising nfs and nfsd"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 31 Oct 2023 00:31:50 UTC
On Mon, Oct 30, 2023 at 6:48 AM void <void@f-m.fm> wrote: > > Hi Rick, thanks for the info > > On Sun, 29 Oct 2023, at 20:28, Rick Macklem wrote: > > > In summary, if you are getting near wire speed and you > > are comfortable with your security situation, then there > > isn't much else to do. > > It seems to depend on the nature of the workload. Sometimes > wire speed, sometimes half that. And then: > > 1. some clients - many reads of small files, hardly any writes > 2. others - many reads, loads of writes > 3. same as {1,2} above, huge files > 4. how many clients access at once > 5. how many clients of [1] and [2] types access at the same time Well, here's a couple more things to look at: - Number of nfsd threads. I prefer to set the min/max to the same value (which is what the "-n" option on nfsd does). Then, after the server has been running for a while in production load, I do: # ps axHl | fgrep nfsd and I look to see how many of the threads have a TIME of 0:00.00. (These are extra tthreads that are not needed.) If there is a moderate number of these, I consider it aok. If there are none of these, more could improve NFS performance. If there are lots of these, the number can be decreased, but they don't result in much overhead, so I err on the large # side. - If you have min set to less than max, the above trick doesn't work, but I'd say that if the command shows the max# of threads, it could be increased. This number can be configured via options on the nfsd command line. If you aren't running nfsd in a jail, you can also fiddle with them via the sysctls: vfs.nfsd.minthreads vfs.nfsd.maxthreads The caveat is that, if the NFS server is also doing other things, increasing the number of nfsd threads can result in nfsd "hogging" the system. --> You might be forced to reduce the number of threads to avoid this. I prefer to set min/max to the same value for a couple of reasons... - The above trick for determining if I have enough threads works. - NFS traffic is very bursty. I want the threads to be sitting there ready to handle a burst of RPC requests, instead of the server code spinning up threads after it sees the burst of requests. - Extra threads are not much overhead. An entry in the proc table plus a few Kbytes for a kernel stack. (Others will disagree with this, I suspect;-) NFSv4 server hash table sizes: Run "nfsstat -E -s" on the server after it has been up under production load for a while. Look at the section near the end called "Server:". The number under "Clients" should be roughly the number of client systems that have NFSv4 mounts against the server. The two tunables: vfs.nfsd.clienthashsize vfs.nfsd.sessionhashsize should be something like 10% of the number of Clients. Then add the numbers under "Opens", "Locks" and "Delegs": The two tunables: vfs.nfsd.fhhashsize vfs.nfsd.statehashsize should be something like 5-10% of that total. If the sizes are a lot less that the above, the nfsd will spend more CPU rattling down rather long lists of entries, searching for a match. The above four tunables must be set in /boot/loader.conf and the NFS server system rebooted for the change to take effect. Now, this one is in the "buyer beware" category... NFS clients can do writes one of two ways (there are actually others but they aren't worth discussing): A - Write/unstable, Write/unstable,...,Commit B - Write/file_sync, Write/file_sync,... After the Commit for (A) and after every Write for (B), the server is required to have all data/metadata changes committed to stable storage, so that a crash immediately after replying to the RPC will not result in data loss/corruption. The problem is that this can result in slow write performance for an NFS server. If you understand that data loss/corruption can occur after a server crash/reboot and can live with that, an NFS server can be configured to "cheat" and not commit the data/metadata to stable storage right away, improving performance. I'm no ZFS guy, but I think "sync=disabled" does this for ZFS. You can also set: vfs.nfsd.async=1 to make the NFS server reply that data has been File_sync'd so that the client never needs to do a Commit even when it specified Unstable. *** Do this at your peril. Back when I worked for a living, I did this on a NFS server that stored undergrad student home dirs. The server was slow but solid and undergrads could have survived some corruption if the server did crash/reboot (I don't recall that it ever did crash). Again, I'm not an NFS guy, but I think that setting up a ZIL on a dedicated fast storage device (or a mirrored pair of them) is the better/correct way to deal with this. NIC performance: - Most NFS requests/replies are small (100-200byte) messages that end up in their own net packet. This implies that a 1Gbps NIC might handle 1000+ messages in each direction per second, concurrently. --> I strongly suspect that not all 1Gbps NICs/drivers can handle 1000+ sends and 1000+ receives in a seconds. If it cannot, that will impact NFS performance. A simple test that will load a NFS server for this is a "ls -lR" of a large subtree of small directories on the NFS mount. --> The fix is probably using a different NIC/driver. > > looking for an all-in-one synthetic tester if there's such a thing. None that I am aware of: SPEC had (does SPEC still exist?) a NFS server load benchmark, but it was not a freebie, so I have no access to it. (If I recall correctly, you/your company had to become a SPEC member, agree to the terms under which testing and publication of results could be done, etc and so forth.) rick > > Large single client transfers client to server are wire speed. > Not tested much else, (not sure how), except with dd but that's > not really a real-world workload. I'll try the things you suggested. > > what I can report now, on the server, so before nfs is considered: > > dd if=/dev/urandom of=test-128k.bin bs=128k count=64000 status=progress > 8346009600 bytes (8346 MB, 7959 MiB) transferred 59.001s, 141 MB/s > > dd if=test-128k.bin of=/dev/null bs=128k status=progress > 6550061056 bytes (6550 MB, 6247 MiB) transferred 3.007s, 2178 MB/s > > dd if=/dev/urandom of=test-4k.bin bs=4k count=2048000 status=progress > 8301215744 bytes (8301 MB, 7917 MiB) transferred 78.063s, 106 MB/s > > dd if=test-4k.bin of=/dev/null bs=4k status=progress > 7725998080 bytes (7726 MB, 7368 MiB) transferred 10.002s, 772 MB/s > > dd if=/dev/urandom of=test-512b.bin bs=512 count=16384000 status=progress > 8382560256 bytes (8383 MB, 7994 MiB) transferred 208.019s, 40 MB/s > > dd if=test-512b.bin of=/dev/null bs=512 status=progress > 8304610304 bytes (8305 MB, 7920 MiB) transferred 63.062s, 132 MB/s >