Terrible NFS performance under 9.2-RELEASE?

Fri Jan 24 23:36:58 UTC 2014

J David wrote:
> INTRODUCTION
> 
> While researching NFS performance problems in our environment, I ran
> a
> series of four iozone tests twice, once with rsize=4096,wsize=4096
> and
> again with rsize=8192,wsize=8192.  That produced additional avenues
> of
> inquiry leading to the identification of two separate potential
> problems with NFS performance.  This message discusses the first
> problem, apparently extraneous fixed-sized reads before writes.
> 
> In order to maximize network throughput (>16Gbit/sec) and minimize
> network latency (<0.1ms), these tests were performed using two
> FreeBSD
> guests on the same KVM host node with bridged virtio adapters having
> TSO and LRO enabled.  To remove local or network-backed virtual disks
> as a bottleneck, a 2GiB memory-backed-UFS-over-NFS filesystem served
> as the NFS export from the server to the client.
> 
> BENCHMARK DATA
> 
> Here is the iozone against the 4k NFS mount:
> 
> $ iozone -e -I -s 1g -r 4k -i 0 -i 2
> Iozone: Performance Test of File I/O
>        Version $Revision: 3.420 $
> [...]
> Include fsync in write timing
> File size set to 1048576 KB
> Record Size 4 KB
> Command line used: iozone -e -s 1g -r 4k -i 0 -i 2
> Output is in Kbytes/sec
> Time Resolution = 0.000005 seconds.
> Processor cache size set to 1024 Kbytes.
> Processor cache line size set to 32 bytes.
> File stride size set to 17 * record size.
>                                                             random
>                                                              random
>               KB  reclen   write rewrite    read    reread    read
>                 write
>          1048576       4   71167   73714                     27525
>            46349
> 
> iozone test complete.
> 
> Now, here are the read and write columns from nfsstat -s -w 1 over
> the
> entire run:
> 
> Read Write
> 0 0
> 0 0
> 0 0
> 0 11791  <-- sequential write test begins
> 0 17215
> 0 18260
> 0 17968
> 0 17810
> 0 17839
> 0 17959
> 0 17912
> 0 18180
> 0 18285
> 0 18636
> 0 18554
> 0 18178
> 0 17361
> 0 17803 <-- sequential rewrite test begins
> 0 18358
> 0 18188
> 0 18817
> 0 17757
> 0 18153
> 0 18924
> 0 19444
> 0 18775
> 0 18995
> 0 18198
> 0 18949
> 0 17978
> 0 18879
> 0 19055
> 7904 67 <-- random read test begins
> 7502 0
> 7194 0
> 7432 0
> 6995 0
> 6844 0
> 6730 0
> 6761 0
> 7011 0
> 7058 0
> 7477 0
> 7139 0
> 6793 0
> 7047 0
> 6402 0
> 6621 0
> 7111 0
> 6911 0
> 7413 0
> 7431 0
> 7047 0
> 7002 0
> 7104 0
> 6987 0
> 6849 0
> 6580 0
> 6268 0
> 6868 0
> 6775 0
> 6335 0
> 6588 0
> 6595 0
> 6587 0
> 6512 0
> 6861 0
> 6953 0
> 7273 0
> 5184 1688 <--  random write test begins
> 0 11795
> 0 11915
> 0 11916
> 0 11838
> 0 12035
> 0 11408
> 0 11780
> 0 11488
> 0 11836
> 0 11787
> 0 11824
> 0 12099
> 0 11863
> 0 12154
> 0 11127
> 0 11434
> 0 11815
> 0 11960
> 0 11510
> 0 11623
> 0 11714
> 0 11896
> 0 1637 <--  test finished
> 0 0
> 0 0
> 0 0
> 
> This looks exactly like you would expect.
> 
> Now, we re-run the exact same test with rsize/wsize set to 8k:
> 
> $ iozone -e -I -s 1g -r 4k -i 0 -i 2
> Iozone: Performance Test of File I/O
>        Version $Revision: 3.420 $
> Compiled for 64 bit mode.
> Build: freebsd
> 
> Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
>             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
>             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
>             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
>             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave
>             Boone,
>             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
>             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
>             Vangel Bojaxhi, Ben England, Vikentsi Lapa.
> 
> Run began: Fri Jan 24 07:45:06 2014
> 
> Include fsync in write timing
> File size set to 1048576 KB
> Record Size 4 KB
> Command line used: iozone -e -s 1g -r 4k -i 0 -i 2
> Output is in Kbytes/sec
> Time Resolution = 0.000005 seconds.
> Processor cache size set to 1024 Kbytes.
> Processor cache line size set to 32 bytes.
> File stride size set to 17 * record size.
>                                                             random
>                                                              random
>               KB  reclen   write rewrite    read    reread    read
>                 write
>          1048576       4  124820   34850                     35996
>            16258
> 
> iozone test complete.
> 
> And here are the second-by-second counts of corresponding read/write
> operations from nfsstat:
> 
> Read Write
> 0 0
> 0 0
> 0 0
> 11 10200 <--  sequential write test begins
> 13 16109
> 10 15360
> 12 16055
> 4 15623
> 8 15533
> 9 16241
> 11 16045
> 1891 11874 <--  sequential rewrite test begins
> 4508 4508
> 4476 4476
> 4353 4353
> 4350 4351
> 4230 4229
> 4422 4422
> 4731 4731
> 4600 4600
> 4550 4550
> 4314 4314
> 4273 4273
> 4422 4422
> 4284 4284
> 4609 4609
> 4600 4600
> 4362 4361
> 4386 4387
> 4451 4451
> 4467 4467
> 4370 4370
> 4394 4394
> 4252 4252
> 4155 4155
> 4427 4428
> 4436 4435
> 4300 4300
> 3989 3989
> 4438 4438
> 4632 4632
> 6440 1836 <--  random read test begins
> 6681 0
> 6416 0
> 6290 0
> 6560 0
> 6569 0
> 6237 0
> 6331 0
> 6403 0
> 6310 0
> 6010 0
> 6023 0
> 6271 0
> 6068 0
> 6044 0
> 6003 0
> 5957 0
> 5683 0
> 5833 0
> 5774 0
> 5708 0
> 5810 0
> 5829 0
> 5787 0
> 5791 0
> 5838 0
> 5922 0
> 6002 0
> 6191 0
> 4162 2624 <--  random write test begins
> 3998 4068
> 4056 4100
> 3929 3969
> 4164 4128
> 4188 4173
> 4256 4164
> 3899 4068
> 4018 3933
> 4077 4114
> 4210 4222
> 4050 4116
> 4157 3983
> 4066 4122
> 4129 4162
> 4303 4199
> 4235 4412
> 4100 3932
> 4056 4254
> 4192 4186
> 4204 4120
> 4085 4138
> 4103 4146
> 4272 4193
> 3943 4028
> 4161 4163
> 4153 4131
> 4218 4017
> 4132 4233
> 4195 4101
> 4041 4145
> 3994 4086
> 4297 4193
> 4205 4341
> 4101 4103
> 4134 3962
> 4297 4408
> 4282 4242
> 4180 4175
> 4286 4216
> 4217 4397
> 4364 4253
> 3673 3720
> 3843 3807
> 4147 4183
> 4171 4181
> 4280 4224
> 4126 4158
> 3977 4074
> 4146 3919
> 4147 4362
> 4079 4060
> 3755 3760
> 4157 4130
> 4087 4109
> 4006 3873
> 3860 3967
> 3982 4048
> 4146 3963
> 4188 4203
> 4040 4063
> 3976 4046
> 3859 3815
> 4114 4193
> 393 447 <--  test finished
> 0 0
> 0 0
> 0 0
> 
> To go through the tests individually...
> 
> SEQUENTIAL WRITE TEST
> 
> The first test is sequential write, the ktrace for this test looks
> like this:
> 
>   3648 iozone   CALL  write(0x3,0x801500000,0x1000)
>   3648 iozone   GIO   fd 3 wrote 4096 bytes
>   3648 iozone   RET   write 4096/0x1000
>   3648 iozone   CALL  write(0x3,0x801500000,0x1000)
>   3648 iozone   GIO   fd 3 wrote 4096 bytes
>   3648 iozone   RET   write 4096/0x1000
> 
> It’s just writing 4k in a loop 262,144 times.
> 
> This test is substantially faster with wsize=8192 than wsize=4096
> (124,820 KiB/sec at 8k vs 71,167 KiB/sec at 4k).  We see from the
> nfsstat output that this is because writes are being coalesced: there
> are half as many writes in the 8k test than the 4k test (8k =
> 131,072,
> 4k = 262,144).
> 
> So far so good.
> 
> SEQUENTIAL REWRITE TEST
> 
> But when we move on to the rewrite test, there’s trouble in River
> City.  Here’s what it looks like in ktrace:
> 
>   3648 iozone   CALL  write(0x3,0x801500000,0x1000)
>   3648 iozone   GIO   fd 3 wrote 4096 bytes
>   3648 iozone   RET   write 4096/0x1000
>   3648 iozone   CALL  write(0x3,0x801500000,0x1000)
>   3648 iozone   GIO   fd 3 wrote 4096 bytes
>   3648 iozone   RET   write 4096/0x1000
> 
> As with the write test, it just writes 4k in a loop 262,144 times.
> 
> Yet despite being exactly the same sequence of calls, this test is is
> less than half as fast with 8k NFS as 4k NFS (34,850 KiB/sec at 8k vs
> 73,714 KiB/sec at 4k).  The only difference between this test and the
> previous one is that now the file exists.  And although the 4k NFS
> client writes it the same way as the first test and therefore gets
> almost exactly same speed, the 8k NFS client is reading and writing
> in
> roughly equal proportion, and drops by a factor of 3.5x from its own
> performance on the first write test.  Note that the total number of
> writes/sec on the 4k NFS rewrite test hovers around 18,500.  On the
> 8k
> NFS test, the reads and writes are roughly balanced at around 4242.
> So the interleaved read/writes performed by the NFS client cut the
> total IOPs available by more than 50%, twice, leaving less than a
> quarter available for the application.  And these numbers roughly
> correlate to the observed speeds: 4252 * 8KiB = 33,936kiB, 18,500 *
> 4kiB = 74,000 kiB.  A total of 131,072 NFS writes are issued during
> the 8k NFS test for a maximum possible write of 1 GiB.  Iozone is
> writing a total of 1GiB (262,144 x 4kiB).  So none of the read data
> is
> being used.  So those extraneous reads appear to be the source of the
> performance drop.
> 
Yes, this is to be expected, because NFS will read in the buffer cache
block (8K) before modifying part of it. Doing it this way wasn't always
how BSD (as in 4.4BSD) did it. Way back when I wrote this code in the 1980s,
it wouldn't first read in a block before doing a partial write.

The problem with not reading the block before a partial write, is that
a write must be done synchronously to the server if/when a read of a
part of the block not just written happens, followed by a read.

Someone changed this a long time ago (well over a decade, I believe).
(I looked in the commit log and hit the end of it before I found the
 commit.) Here's the comment out of the old client code (cloned for
the new client):

1077 	* Issue a READ if B_CACHE is not set. In special-append
1078 	* mode, B_CACHE is based on the buffer prior to the write
1079 	* op and is typically set, avoiding the read. If a read
1080 	* is required in special append mode, the server will
1081 	* probably send us a short-read since we extended the file
1082 	* on our end, resulting in b_resid == 0 and, thusly,
1083 	* B_CACHE getting set.
1084 	*
1085 	* We can also avoid issuing the read if the write covers
1086 	* the entire buffer. We have to make sure the buffer state
1087 	* is reasonable in this case since we will not be initiating
1088 	* I/O. See the comments in kern/vfs_bio.c's getblk() for
1089 	* more information.
1090 	*
1091 	* B_CACHE may also be set due to the buffer being cached
1092 	* normally.
1093 	*/

If a real client was predominately running an app. that did 4K writes
sequentially then, yes, wsize=4096 might be the best w.r.t. performance.
(I doubt most real client mounts have predominately apps. doing 4K writes,
 but the mount option is there to handle that case.)

Everything in the NFS client's caching is a compromise that experience
over time (much of it done by others, not me) has shown to be a good
compromise for most real client file activity.

rick

> RANDOM READ TEST
> 
> Considering the random read test, things look better again.  Here’s
> what it looks like in ktrace:
> 
>   3947 iozone   CALL  lseek(0x3,0x34caa000,SEEK_SET)
>   3947 iozone   RET   lseek 885694464/0x34caa000
>   3947 iozone   CALL  read(0x3,0x801500000,0x1000)
>   3947 iozone   GIO   fd 3 read 4096 bytes
>   3947 iozone   RET   read 4096/0x1000
>   3947 iozone   CALL  lseek(0x3,0x3aa53000,SEEK_SET)
>   3947 iozone   RET   lseek 983904256/0x3aa53000
>   3947 iozone   CALL  read(0x3,0x801500000,0x1000)
>   3947 iozone   GIO   fd 3 read 4096 bytes
>   3947 iozone   RET   read 4096/0x1000
> 
> This test lseeks to a random location (on a 4k boundary) then reads
> 4k, in a loop, 262,144 times.
> 
> On this test, the 8k NFS moderately outperforms the 4k NFS (35,996
> kiB/sec at 8k vs 27,525 kiB/sec at 4k).  This is probably
> attributable
> to lucky adjacent requests, which are visible in the total number of
> reads.  4k NFS produces the expected 262,144 NFS read ops, but 8k NFS
> performs about 176,000, an average of about 6k per read, suggesting
> it
> was doing 8k reads of which the whole 8k was useful about 50% of the
> time.
> 
> RANDOM WRITE TEST
> 
> Finally, the random write test.  Here’s the ktrace output from this
> one:
> 
>   3262 iozone   CALL  lseek(0x3,0x29665000,SEEK_SET)
>   3262 iozone   RET   lseek 694571008/0x29665000
>   3262 iozone   CALL  write(0x3,0x801500000,0x1000)
>   3262 iozone   GIO   fd 3 wrote 4096 bytes
>   3262 iozone   RET   write 4096/0x1000
>   3262 iozone   CALL  lseek(0x3,0x486c000,SEEK_SET)
>   3262 iozone   RET   lseek 75939840/0x486c000
>   3262 iozone   CALL  write(0x3,0x801500000,0x1000)
>   3262 iozone   GIO   fd 3 wrote 4096 bytes
>   3262 iozone   RET   write 4096/0x1000
>   3262 iozone   CALL  lseek(0x3,0x33310000,SEEK_SET)
>   3262 iozone   RET   lseek 858849280/0x33310000
>   3262 iozone   CALL  write(0x3,0x801500000,0x1000)
>   3262 iozone   GIO   fd 3 wrote 4096 bytes
>   3262 iozone   RET   write 4096/0x1000
>   3262 iozone   CALL  lseek(0x3,0x2ae3000,SEEK_SET)
> 
> That’s all it is, lseek() then write(), repeated 262144 times.  No
> reads, and again, all the lseek’s are on 4k boundaries.
> 
> This test is dramatically slower with the 8k NFS, falling from 46,349
> kiB/sec for 4k NFS to 16,258 kiB/sec for 8k NFS.
> 
> Like the rewrite test, the nfsstat output shows an even mix of reads
> and writes during the 8k NFS run of this test, even though no such
> writes are ever issued by the client.  Unlike the 8k NFS rewrite
> test,
> which performs 131,072 NFS write operations, the 8k NFS random write
> test performs 262,025 operations.  This strongly suggests it was
> converting each 4kiB write into an 8kiB read-modify-write cycle, with
> the commensurate performance penalty in almost every case, with a
> handful of packets (about 119) lucky enough to be adjacent writes.
> Since the NFS server is supposed to be able to handle
> arbitrary-length
> writes at arbitrary file offsets, this appears to be unnecessary.
> 
> To confirm that this is the case, and to determine whether it is
> doing
> 4kiB or 8kiB writes, we will have to go to the packets.  Here is a
> sample from the random write test:
> 
> 09:31:52.338749 IP 172.20.20.169.1877003014 > 172.20.20.162.2049: 128
> read fh 1326,127488/4 8192 bytes @ 606494720
> 09:31:52.338877 IP 172.20.20.162.2049 > 172.20.20.169.1877003014:
> reply ok 8320 read
> 09:31:52.338944 IP 172.20.20.169.1877003015 > 172.20.20.162.2049:
> 4232
> write fh 1326,127488/4 4096 (4096) bytes @ 606498816
> 09:31:52.338964 IP 172.20.20.169.1877003016 > 172.20.20.162.2049: 128
> read fh 1326,127488/4 8192 bytes @ 85237760
> 09:31:52.339037 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.],
> ack 4008446365, win 29118, options [nop,nop,TS val 4241448191 ecr
> 17929512], length 0
> 09:31:52.339076 IP 172.20.20.162.2049 > 172.20.20.169.1877003015:
> reply ok 160 write [|nfs]
> 09:31:52.339117 IP 172.20.20.162.2049 > 172.20.20.169.1877003016:
> reply ok 8320 read
> 09:31:52.339142 IP 172.20.20.169.719 > 172.20.20.162.2049: Flags [.],
> ack 8488, win 515, options [nop,nop,TS val 17929512 ecr 4241448191],
> length 0
> 09:31:52.339183 IP 172.20.20.169.1877003017 > 172.20.20.162.2049:
> 4232
> write fh 1326,127488/4 4096 (4096) bytes @ 85241856
> 09:31:52.339201 IP 172.20.20.169.1877003018 > 172.20.20.162.2049: 128
> read fh 1326,127488/4 8192 bytes @ 100843520
> 09:31:52.339271 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.],
> ack 4369, win 29118, options [nop,nop,TS val 4241448191 ecr
> 17929512],
> length 0
> 09:31:52.339310 IP 172.20.20.162.2049 > 172.20.20.169.1877003017:
> reply ok 160 write [|nfs]
> 09:31:52.339332 IP 172.20.20.162.2049 > 172.20.20.169.1877003018:
> reply ok 8320 read
> 09:31:52.339355 IP 172.20.20.169.719 > 172.20.20.162.2049: Flags [.],
> ack 16976, win 515, options [nop,nop,TS val 17929512 ecr 4241448191],
> length 0
> 09:31:52.339408 IP 172.20.20.169.1877003019 > 172.20.20.162.2049: 128
> read fh 1326,127488/4 8192 bytes @ 330153984
> 09:31:52.339514 IP 172.20.20.162.2049 > 172.20.20.169.1877003019:
> reply ok 8320 read
> 09:31:52.339562 IP 172.20.20.169.1877003020 > 172.20.20.162.2049: 128
> read fh 1326,127488/4 8192 bytes @ 500056064
> 09:31:52.339669 IP 172.20.20.162.2049 > 172.20.20.169.1877003020:
> reply ok 8320 read
> 09:31:52.339728 IP 172.20.20.169.1877003022 > 172.20.20.162.2049: 128
> read fh 1326,127488/4 8192 bytes @ 778158080
> 09:31:52.339758 IP 172.20.20.169.1877003021 > 172.20.20.162.2049:
> 4232
> write fh 1326,127488/4 4096 (4096) bytes @ 500060160
> 09:31:52.339834 IP 172.20.20.162.2049 > 172.20.20.169.719: Flags [.],
> ack 9001, win 29118, options [nop,nop,TS val 4241448191 ecr
> 17929512],
> length 0
> 
> This confirms that the NFS client is indeed performing a read-write
> cycle.  However, they also show that the writes are 4kiB, not 8kiB,
> so
> no modify is occurring.  E.g. it reads 8kiB at offset 606494720 and
> then immediately writes the 4kiB from the application at offset
> 606498816 (which is 606494720+4096).  It’s not clear what the purpose
> of the read is in this scenario.
> 
What is needed is knowledge of what real apps. do w.r.t. I/O. I would
suspect they would use larger I/O sizes than 4K, but I don't know. The
default read/write size done by the stdio library would be a good start,
but I'll admit I don't know what that is, either.

> FOLLOW-UP TESTS
> 
> To help determine which direction to go next, I conducted two
> follow-up tests.  First, I re-ran the test with the default 64k
> rsize/wsize values, an effort to confirm that it would do 64k reads
> followed by 4k writes.  It was necessary to do this with a much
> smaller file size due to the poor performance, and with
> net.inet.tcp.delacktime=10 on both client and server to work around a
> separate issue with TCP writes >64kiB, but the results are sufficient
> for this purpose:
> 
> $ iozone -e -s 32m -r 4k -i 0 -i 2
> Iozone: Performance Test of File I/O
>        Version $Revision: 3.420 $
> Compiled for 64 bit mode.
> Build: freebsd
> 
> Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
>             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
>             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
>             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
>             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave
>             Boone,
>             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
>             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
>             Vangel Bojaxhi, Ben England, Vikentsi Lapa.
> 
> Run began: Fri Jan 24 10:35:14 2014
> 
> Include fsync in write timing
> File size set to 32768 KB
> Record Size 4 KB
> Command line used: iozone -e -s 32m -r 4k -i 0 -i 2
> Output is in Kbytes/sec
> Time Resolution = 0.000005 seconds.
> Processor cache size set to 1024 Kbytes.
> Processor cache line size set to 32 bytes.
> File stride size set to 17 * record size.
>                                                             random
> random    bkwd   record   stride
>               KB  reclen   write rewrite    read    reread    read
> write    read  rewrite     read   fwrite frewrite   fread  freread
>            32768       4   12678    5292                      5355
> 741
> 
> iozone test complete.
> 
> Although the use of TCP writes >64kiB largely mangles tcpdump’s
> ability to interpret the NFS messages due to fragmentation, it was
> possible to grep out enough to see that the suspected 64kiB read /
> 4kiB write pairs are indeed occurring:
> 
> 10:36:11.356519 IP 172.20.20.169.1874848588 > 172.20.20.162.2049: 128
> read fh 1326,127488/5 65536 bytes @ 18087936
> 10:36:11.496905 IP 172.20.20.169.1874848606 > 172.20.20.162.2049:
> 4232
> write fh 1326,127488/5 4096 (4096) bytes @ 18096128
> 
> As before, the 64kiB read appears not to be used.  The result is that
> roughly 94% of the network I/O spent on a 4kiB write is wasted, with
> commensurate impact on performance.
> 
> In an attempt to verify the theory that the data being read is not
> necessary, the 8k NFS test was retried with a Debian Linux guest.
>  The
> hypothesis was that if the Debian Linux client did not perform the
> reads, that would indicate that they were not necessary.  The results
> appeared to support that hypothesis.  Not only did the Linux client
> handily outperform the FreeBSD client on all four tests:
> 
> # iozone -e -s 1g -r 4k -i 0 -i 2
> Iozone: Performance Test of File I/O
>        Version $Revision: 3.397 $
> Compiled for 64 bit mode.
> Build: linux-AMD64
> [...]
> Include fsync in write timing
> File size set to 1048576 KB
> Record Size 4 KB
> Command line used: iozone -e -s 1g -r 4k -i 0 -i 2
> Output is in Kbytes/sec
> Time Resolution = 0.000001 seconds.
> Processor cache size set to 1024 Kbytes.
> Processor cache line size set to 32 bytes.
> File stride size set to 17 * record size.
>                                                             random
>                                                              random
>               KB  reclen   write rewrite    read    reread    read
>                 write
>          1048576       4  142393  158101                     79471
>           133974
> 
> iozone test complete.
> 
> But it also did not perform any reads during the rewrite and random
> write tests, as shown by the server’s nfsstat output:
> 
> Read Write
> 0 0
> 0 0
> 0 0
> 0 1905 <--  sequential write test begins
> 0 19571
> 0 20066
> 0 20214
> 0 20036
> 0 20051
> 0 17221
> 0 13311 <--  rewrite test begins
> 0 20721
> 0 22237
> 0 21710
> 0 21619
> 0 21104
> 0 18339
> 6363 4040 <--  random read test begins
> 8906 0
> 9885 0
> 10921 0
> 12487 0
> 13311 0
> 14141 0
> 16598 0
> 16178 0
> 17138 0
> 17796 0
> 18765 0
> 16760 0
> 2816 6787 <--  random write test begins
> 0 20892
> 0 19528
> 0 20879
> 0 20758
> 0 17040
> 0 21327
> 0 19713 <--  tests finished
> 0 0
> 0 0
> 0 0
> 
> 
> CONCLUSION
> 
> The rewrite and random write iozone tests appear to demonstrate that
> in some cases the FreeBSD NFS client treats the rsize/wsize settings
> as a fixed block size rather than a not-to-exceed size.  The data
> suggests that it requests a full rsize-sized block read from the
> server before performing a 4kiB write issued by an application, and
> that the result of the read is not used.
> Linux does not perform the
> same read-before-write strategy, and consequently achieves much
> higher
> throughput.
> 
I shouldn't say this, because I am not sure, but I believe that Linux
caches files in pages (which happen to be 4K for x86). As such "-r 4k"
is optimal for Linux.

Try some iozone tests with other "-r" sizes. I suspect with 1K or 2K,
you will see Linux doing the read before write, but I could be wrong.

You should also try both Linux and FreeBSD with "-r 32k" and rsize=32768,wsize=32768
for both and see how they compare (you should collect nfsstat RPC counts
as well as iozone's transfer rates).

The 64K case seems to be broken for FreeBSD, I suspect due to some issue
related to the virtio driver using a 64K MTU, so I left that out of the above.
(It would be nice to resolve this, but I have no idea if glebius@ can help
w.r.t. this one.)

> The next step would probably be to examine the NFS implementation to
> determine the source of the apparently unnecessary reads.  If they
> could be eliminated, it may lead to performance improvements of up to
> 20x on some workloads.
> 
> Thanks!
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe at freebsd.org"
>