Terrible NFS performance under 9.2-RELEASE?

Fri Jan 24 05:06:50 UTC 2014

On Thu, Jan 23, 2014 at 9:27 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> Well, my TCP is pretty rusty, but...
> Since your stats didn't show any jumbo frames, each IP
> datagram needs to fit in the MTU of 1500bytes. NFS hands an mbuf
> list of just over 64K (or 32K) to TCP in a single sosend(), then TCP
> will generate about 45 (or about 23 for 32K) TCP segments and put
> each in an IP datagram, then hand it to the network device driver
> for transmission.

This is *not* what happens with TSO/LRO.

With TSO, TCP generates IP datagrams of up to 64k which are passed
directly to the driver, which passes them directly to the hardware.

Furthermore, in this unique case (two virtual machines on the same
host and bridge with both TSO and LRO enabled end-to-end), the packet
is *never* fragmented.  The host takes the 64k packet off of one
guest's output ring and puts it onto the other guest's input ring,
intact.

This is, as you might expect, a *massive* performance win.

With TSO & LRO:

$ time iperf -c 172.20.20.162  -d

------------------------------------------------------------

Server listening on TCP port 5001

TCP window size: 1.00 MByte (default)

------------------------------------------------------------

------------------------------------------------------------

Client connecting to 172.20.20.162, TCP port 5001

TCP window size: 1.00 MByte (default)

------------------------------------------------------------

[  5] local 172.20.20.169 port 60889 connected with 172.20.20.162 port 5001

[  4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port 44101

[ ID] Interval       Transfer     Bandwidth

[  5]  0.0-10.0 sec  17.0 GBytes  14.6 Gbits/sec

[  4]  0.0-10.0 sec  17.4 GBytes  14.9 Gbits/sec

real 0m10.061s

user 0m0.229s

sys 0m7.711s

Without TSO & LRO:

$ time iperf -c 172.20.20.162  -d

------------------------------------------------------------

Server listening on TCP port 5001

TCP window size: 1.00 MByte (default)

------------------------------------------------------------

------------------------------------------------------------

Client connecting to 172.20.20.162, TCP port 5001

TCP window size: 1.26 MByte (default)

------------------------------------------------------------

[  5] local 172.20.20.169 port 22088 connected with 172.20.20.162 port 5001

[  4] local 172.20.20.169 port 5001 connected with 172.20.20.162 port 48615

[ ID] Interval       Transfer     Bandwidth

[  5]  0.0-10.0 sec   637 MBytes   534 Mbits/sec

[  4]  0.0-10.0 sec   767 MBytes   642 Mbits/sec

real 0m10.057s

user 0m0.231s

sys 0m3.935s

Look at the difference.  In this bidirectional test, TSO is over 25x
faster using not even 2x the CPU.  This shows how essential TSO/LRO is
if you plan to move data at real world speeds and still have enough
CPU left to operate on that data.

> I recall you saying you tried turning off TSO with no
> effect. You might also try turning off checksum offload. I doubt it will
> be where things are broken, but might be worth a try.

That was not me, that was someone else.  If there is a problem with
NFS and TSO, the solution is *not* to disable TSO.  That is, at best,
a workaround that produces much more CPU load and much less
throughput.  The solution is to find the problem and fix it.

More data to follow.

Thanks!