silly write caching in nfs3
Rick Macklem
rmacklem at uoguelph.ca
Sat Feb 27 04:00:50 UTC 2016
Bruce Evans wrote:
> nfs3 is slower than in old versions of FreeBSD. I debugged one of the
> reasons today.
>
> Writes have apparently always done silly caching. Typical behaviour
> is for iozone writing a 512MB file where the file fits in the buffer
> cache/VMIO. The write is cached perfectly. But then when nfs_open()
> reeopens the file, it calls vinvalbuf() to discard all of the cached
> data. Thus nfs write caching usually discards useful older data to
> make space for newer data that will never be never used (unless the
> file is opened r/w and read using the same fd (and is not accessed
> for a setattr or advlock operation -- these call vinvalbuf() too, if
> NMODIFIED)). The discarding may be delayed for a long time. Then
> keeping the useless data causes even more older data to be discarded.
> Discarding it on close would at least prevent further loss. It would
> have to be committed on close before discarding it of course.
> Committing it on close does some good things even without discarding
> there, and in oldnfs it gives a bug that prevents discaring in open --
> see below.
>
> nfs_open() does the discarding for different reasons in the NMODIFIED
> and !NMODIFIED cases. In the NMODIFED case, it discard unconditionally.
> This case can be avoided by fsync() before close or setting the sysctl
> to commit in close. iozone does he fsync(). This helps in oldnfs but
> not in newfs. With it, iozone on newfs now behaves like it did on oldnfs
> 10-20 years ago. Something (perhaps just the timestamp bugs discussed
> later) "fixed" the discarding on oldnfs 5-10 years ago.
>
> I think not committing in close is supposed to be an optimization, but
> it is actually a pessimization for my kernel build tests (with object
> files on nfs, which I normally avoid). Builds certainly have to reopen
> files after writing them, to link them and perhaps to install them.
> This causes the discarding. My kernel build tests also do a lot of
> utimes() calls which cause the discarding before commit-on-close can
> avoid the above cause for it it by clearing NMODIFIED. Enabling
> commit-on-close gives a small optimisation with oldnfs by avoiding all
> of the discarding except for utimes(). It reduces read RPCs by about
> 25% without increasing write RPCs or real time. It decreases real time
> by a few percent.
>
> The other reason for discarding is because the timestamps changed -- you
> just wrote them, so the timestamps should have changed. Different bugs
> in comparing the timestamps gave different misbehaviours.
>
You could easily test to see if second-resolution timestamps make a
difference by redefining the NFS_TIMESPEC_COMPARE() macro
{ in sys/fs/nfsclient/nfsnode.h } so that it only compares the
tv_sec field and not the tv_nsec field.
--> Then the client would only think the mtime has changed when tv_sec
changes.
rick
> In old versions of FreeBSD and/or nfs, the timestamps had seconds
> granularity, so many changes were missed. This explains mysterious
> behaviours by iozone 10-20 years ago: the write caching is seen to
> work perfectly for most small total sizes, since all the writes take
> less than 1 second so the timestamps usually don't change (but sometimes
> the writes lie across a seconds boundary so the timestamps do change).
>
> oldnfs was fixed many years ago to use timestamps with nanoseconds
> resolution, but it doesn't suffer from the discarding in nfs_open()
> in the !NMODIFIED case which is reached by either fsync() before close
> of commit on close. I think this is because it updates n_mtime to
> the server's new timestamp in nfs_writerpc(). This seems to be wrong,
> since the file might have been written to by other clients and then
> the change would not be noticed until much later if ever (setting the
> timestamp prevents seeing it change when it is checked later, but you
> might be able to see another metadata change).
>
> newfs has quite different code for nfs_writerpc(). Most of it was
> moved to another function in nanother file. I understand this even
> less, but it doesn't seem to have fetch the server's new timestamp or
> update n_mtime in the v3 case.
>
> There are many other reasons why nfs is slower than in old versions.
> One is that writes are more often done out of order. This tends to
> give a slowness factor of about 2 unless the server can fix up the
> order. I use an old server which can do the fixup for old clients but
> not for newer clients starting in about FreeBSD-9 (or 7?). I suspect
> that this is just because Giant locking in old clients gave accidental
> serialization. Multiple nfsiod's and/or nfsd's are are clearly needed
> for performance if you have multiple NICs serving multiple mounts.
> Other cases are less clear. For the iozone benchmark, there is only
> 1 stream and multiple nfsiod's pessimize it into multiple streams that
> give buffers which arrive out of order on the server if the multiple
> nfsiod's are actually active. I use the following configuration to
> ameliorate this, but the slowness factor is still often about 2 for
> iozone:
> - limit nfsd's to 4
> - limit nfsiod's to 4
> - limit nfs i/o sizes to 8K. The server fs block size is 16K, and
> using a smaller block size usually helps by giving some delayed
> writes which can be clustered better. (The non-nfs parts of the
> server could be smarter and do this intentionally. The out-of-order
> buffers look like random writes to the server.) 16K i/o sizes
> otherwise work OK, but 32K i/o sizes are much slower for unknown
> reasons.
>
> Bruce
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>
More information about the freebsd-fs
mailing list