Terrible ix performance
Lawrence Stewart
lstewart at freebsd.org
Thu Jul 4 03:41:07 UTC 2013
On 07/04/13 13:06, Outback Dingo wrote:
> On Wed, Jul 3, 2013 at 10:01 PM, Lawrence Stewart <lstewart at freebsd.org
> <mailto:lstewart at freebsd.org>> wrote:
>
> On 07/04/13 10:18, Kevin Oberman wrote:
> > On Wed, Jul 3, 2013 at 4:21 PM, Steven Hartland
> <killing at multiplay.co.uk <mailto:killing at multiplay.co.uk>>wrote:
[snip]
> >>
> >> Out of interest have you tried limiting the number of queues?
> >>
> >> If not give it a try see if it helps, add the following to
> >> /boot/loader.conf:
> >> hw.ixgbe.num_queues=1
> >>
> >> If nothing else will give you another data point.
>
> As noted in my first post to this thread, if iperf is able to push a
> single flow at 8Gbps, then the NIC is unlikely to be the source of the
> problem and trying to tune it is a waste of time (at least at this
> stage).
>
> iperf tests memory-network-memory transfer speed without any disk
> involvement, so the fact that it can get 8Gbps and ftp is getting around
> 4Gbps implies that either the iperf TCP tuning is better (only likely to
> be relevant if the RTT is very large - Outback Dingo you still haven't
> provided us with the RTT) or the disk subsystem at one or both ends is
> slowing things down.
>
> Outback Dingo: can you please run another iperf test without the -w
> switch on both client and server to see if your send/receive window
> autotuning on both ends is working. If all is well, you should see the
> same results of ~8Gbps.
>
> >> You might also try SIFTR to analyze the behavior and perhaps even
> figure
> > out what the limiting factor might be.
> >
> > kldload siftr
> > See "Run-time Configuration" in the siftr(4) man page for details.
> >
> > I'm a little surprised Lawrence didn't already suggest this as he
> is one of
> > the authors. (The "Bugs" section is rather long and he might know
> that it
> > won't be useful in this case, but it has greatly helped me look at
> > performance issues.)
>
> siftr is useful if you suspect a TCP/netstack tuning issue. Given that
> iperf gets good results and the OP's tuning settings should be adequate
> to achieve good performance if the RTT is low (4MB
> sendbuf_max/recvbuf_max), I suspect the disk subsystem and/or VM is more
> likely to be the issue i.e. siftr data is probably irrelevant.
>
> Outback Dingo: Can you confirm you have appropriate tuning on both sides
> of the connection? You didn't specify if the loader.conf/sysctl.conf
> parameters you provided in the reply to Jack are only on one side of the
> connection or both.
>
>
> Yeah i concur, im starting to think the bottleneck is the zpool
>
>
> iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M
> ------------------------------------------------------------
> Client connecting to 10.10.1.11, TCP port 5001
> TCP window size: 257 KByte (default)
> ------------------------------------------------------------
> [ 3] local 10.10.1.178 port 47360 connected with 10.10.1.11 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 9.61 GBytes 8.26 Gbits/sec
> [ 3] 10.0-20.0 sec 8.83 GBytes 7.58 Gbits/sec
> [ 3] 0.0-20.0 sec 18.4 GBytes 7.92 Gbits/sec
> nas4free: /testing # iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M
> ------------------------------------------------------------
> Client connecting to 10.10.1.11, TCP port 5001
> TCP window size: 257 KByte (default)
> ------------------------------------------------------------
> [ 3] local 10.10.1.178 port 37691 connected with 10.10.1.11 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 5.29 GBytes 4.54 Gbits/sec
> [ 3] 10.0-20.0 sec 8.06 GBytes 6.93 Gbits/sec
> [ 3] 0.0-20.0 sec 13.4 GBytes 5.73 Gbits/sec
> nas4free: /testing # iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M
> ------------------------------------------------------------
> Client connecting to 10.10.1.11, TCP port 5001
> TCP window size: 257 KByte (default)
> ------------------------------------------------------------
> [ 3] local 10.10.1.178 port 17560 connected with 10.10.1.11 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 9.48 GBytes 8.14 Gbits/sec
> [ 3] 10.0-20.0 sec 8.68 GBytes 7.46 Gbits/sec
> [ 3] 0.0-20.0 sec 18.2 GBytes 7.80 Gbits/sec
> nas4free: /testing # iperf -i 10 -t 20 -c 10.10.1.11 -l 2.5M
> ------------------------------------------------------------
> Client connecting to 10.10.1.11, TCP port 5001
> TCP window size: 257 KByte (default)
> ------------------------------------------------------------
> [ 3] local 10.10.1.178 port 14729 connected with 10.10.1.11 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0-10.0 sec 7.81 GBytes 6.71 Gbits/sec
> [ 3] 10.0-20.0 sec 9.11 GBytes 7.82 Gbits/sec
> [ 3] 0.0-20.0 sec 16.9 GBytes 7.27 Gbits/sec
Ok. It does seem like your issue is VM/disk related rather than
network/protocol related in that case. Going forward, I suggest that you
test with FTP as you make tweaks in order to keep things as close to raw
TCP bulk transfer as possible but including the disks/VM i.e. don't use
NFS/SSH/CIFS to evaluate effectiveness of tuning tweaks.
> The current configuration on both boxes is
> kernel="kernel"
> bootfile="kernel"
> kernel_options=""
> kern.hz="20000"
Why such a high hz setting? I'd suggest lowering to 2000 on both
machines unless you have good reason for it to be so high.
> hw.est.msr_info="0"
> hw.hptrr.attach_generic="0"
> kern.maxfiles="65536"
> kern.maxfilesperproc="50000"
> kern.cam.boot_delay="8000"
> autoboot_delay="5"
> isboot_load="YES"
> zfs_load="YES"
> hw.ixgbe.enable_aim=0
>
> and
> cat /etc/sysctl.conf
> # Disable core dump
> kern.coredump=0
> # System tuning
> net.inet.tcp.delayed_ack=0
> # System tuning
> net.inet.tcp.rfc1323=1
> # System tuning
> net.inet.tcp.sendspace=262144
> # System tuning
> net.inet.tcp.recvspace=262144
> # System tuning
> net.inet.tcp.sendbuf_max=4194304
> # System tuning
> net.inet.tcp.sendbuf_inc=262144
> # System tuning
> net.inet.tcp.sendbuf_auto=1
> # System tuning
> net.inet.tcp.recvbuf_max=4194304
> # System tuning
> net.inet.tcp.recvbuf_inc=262144
> # System tuning
> net.inet.tcp.recvbuf_auto=1
> # System tuning
> net.inet.udp.recvspace=65536
> # System tuning
> net.inet.udp.maxdgram=57344
> # System tuning
> net.local.stream.recvspace=65536
> # System tuning
> net.local.stream.sendspace=65536
> # System tuning
> kern.ipc.maxsockbuf=16777216
> # System tuning
> kern.ipc.somaxconn=8192
> # System tuning
> kern.ipc.nmbclusters=262144
> # System tuning
> kern.ipc.nmbjumbop=262144
> # System tuning
> kern.ipc.nmbjumbo9=131072
> # System tuning
> kern.ipc.nmbjumbo16=65536
> # System tuning
> kern.maxfiles=65536
> # System tuning
> kern.maxfilesperproc=50000
> # System tuning
> net.inet.icmp.icmplim=300
> # System tuning
> net.inet.icmp.icmplim_output=1
> # System tuning
> net.inet.tcp.path_mtu_discovery=0
> # System tuning
> hw.intr_storm_threshold=9000
Your network-related tuning looks good to me.
> Box A is
> zpool status
> pool: testing
> state: ONLINE
> scan: none requested
> config:
>
> NAME STATE READ WRITE CKSUM
> testing ONLINE 0 0 0
> da0.nop ONLINE 0 0 0
> da1.nop ONLINE 0 0 0
> da2.nop ONLINE 0 0 0
> da3.nop ONLINE 0 0 0
> da4.nop ONLINE 0 0 0
> da5.nop ONLINE 0 0 0
> da6.nop ONLINE 0 0 0
> da7.nop ONLINE 0 0 0
> da8.nop ONLINE 0 0 0
> da9.nop ONLINE 0 0 0
> da10.nop ONLINE 0 0 0
> da11.nop ONLINE 0 0 0
> da12.nop ONLINE 0 0 0
> da13.nop ONLINE 0 0 0
> da14.nop ONLINE 0 0 0
> da15.nop ONLINE 0 0 0
>
> fio --direct=1 --rw=randwrite --bs=4k --size=2G --numjobs=1 --runtime=60
> --group_reporting --name=randwrite
> fio: this platform does not support process shared mutexes, forcing use
> of threads. Use the 'thread' option to get rid of this warning.
> randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
> iodepth=1
> fio-2.0.15
> Starting 1 process
> Jobs: 1 (f=1): [w] [100.0% done] [0K/150.9M/0K /s] [0 /38.7K/0 iops]
> [eta 00m:00s]
> randwrite: (groupid=0, jobs=1): err= 0: pid=101192: Wed Jul 3 23:01:09 2013
> write: io=2048.0MB, bw=147916KB/s, iops=36978 , runt= 14178msec
> clat (usec): min=9 , max=122101 , avg=24.17, stdev=229.23
> lat (usec): min=10 , max=122101 , avg=24.42, stdev=229.23
> clat percentiles (usec):
> | 1.00th=[ 11], 5.00th=[ 12], 10.00th=[ 14], 20.00th=[ 21],
> | 30.00th=[ 21], 40.00th=[ 22], 50.00th=[ 22], 60.00th=[ 23],
> | 70.00th=[ 23], 80.00th=[ 24], 90.00th=[ 29], 95.00th=[ 35],
> | 99.00th=[ 99], 99.50th=[ 114], 99.90th=[ 131], 99.95th=[ 137],
> | 99.99th=[ 181]
> bw (KB/s) : min=58200, max=223112, per=99.93%, avg=147815.61,
> stdev=31976.97
> lat (usec) : 10=0.01%, 20=15.49%, 50=82.15%, 100=1.39%, 250=0.96%
> lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
> lat (msec) : 2=0.01%, 20=0.01%, 250=0.01%
> cpu : usr=11.05%, sys=87.08%, ctx=563, majf=0, minf=0
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
> issued : total=r=0/w=524288/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> WRITE: io=2048.0MB, aggrb=147915KB/s, minb=147915KB/s,
> maxb=147915KB/s, mint=14178msec, maxt=14178msec
> fio --direct=1 --rw=randread --bs=4k --size=2G --numjobs=1 --runtime=60
> --group_reporting --name=randread
> fio: this platform does not support process shared mutexes, forcing use
> of threads. Use the 'thread' option to get rid of this warning.
> randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.0.15
> Starting 1 process
> randread: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [r] [100.0% done] [292.9M/0K/0K /s] [74.1K/0 /0 iops]
> [eta 00m:00s]
> randread: (groupid=0, jobs=1): err= 0: pid=101304: Wed Jul 3 23:02:08 2013
> read : io=2048.0MB, bw=327578KB/s, iops=81894 , runt= 6402msec
> clat (usec): min=4 , max=20418 , avg=10.15, stdev=28.54
> lat (usec): min=4 , max=20418 , avg=10.27, stdev=28.54
> clat percentiles (usec):
> | 1.00th=[ 5], 5.00th=[ 6], 10.00th=[ 6], 20.00th=[ 8],
> | 30.00th=[ 10], 40.00th=[ 10], 50.00th=[ 10], 60.00th=[ 11],
> | 70.00th=[ 11], 80.00th=[ 11], 90.00th=[ 12], 95.00th=[ 13],
> | 99.00th=[ 22], 99.50th=[ 31], 99.90th=[ 77], 99.95th=[ 95],
> | 99.99th=[ 145]
> bw (KB/s) : min=290024, max=520016, per=100.00%, avg=328490.00,
> stdev=63941.66
> lat (usec) : 10=28.85%, 20=69.83%, 50=1.19%, 100=0.09%, 250=0.05%
> lat (msec) : 50=0.01%
> cpu : usr=18.08%, sys=81.57%, ctx=144, majf=0, minf=1
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
> issued : total=r=524288/w=0/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> READ: io=2048.0MB, aggrb=327577KB/s, minb=327577KB/s,
> maxb=327577KB/s, mint=6402msec, maxt=6402msec
>
>
> Box B
> zpool status
> pool: backup
> state: ONLINE
> scan: none requested
> config:
>
> NAME STATE READ WRITE CKSUM
> backup ONLINE 0 0 0
> mfid0.nop ONLINE 0 0 0
> mfid1.nop ONLINE 0 0 0
> mfid2.nop ONLINE 0 0 0
> mfid3.nop ONLINE 0 0 0
> mfid4.nop ONLINE 0 0 0
> mfid5.nop ONLINE 0 0 0
> mfid6.nop ONLINE 0 0 0
> mfid7.nop ONLINE 0 0 0
> mfid8.nop ONLINE 0 0 0
> mfid9.nop ONLINE 0 0 0
> mfid10.nop ONLINE 0 0 0
> mfid11.nop ONLINE 0 0 0
> mfid12.nop ONLINE 0 0 0
> mfid13.nop ONLINE 0 0 0
> mfid14.nop ONLINE 0 0 0
> mfid15.nop ONLINE 0 0 0
> mfid16.nop ONLINE 0 0 0
> mfid17.nop ONLINE 0 0 0
> mfid18.nop ONLINE 0 0 0
> mfid19.nop ONLINE 0 0 0
> mfid20.nop ONLINE 0 0 0
> mfid21.nop ONLINE 0 0 0
> mfid22.nop ONLINE 0 0 0
> mfid23.nop ONLINE 0 0 0
>
>
>
> fio --direct=1 --rw=randwrite --bs=4k --size=2G --numjobs=1 --runtime=60
> --group_reporting --name=randwrite
> fio: this platform does not support process shared mutexes, forcing use
> of threads. Use the 'thread' option to get rid of this warning.
> randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync,
> iodepth=1
> fio-2.0.15
> Starting 1 process
> Jobs: 1 (f=1): [w] [100.0% done] [0K/1948K/0K /s] [0 /487 /0 iops] [eta
> 00m:00s]
> randwrite: (groupid=0, jobs=1): err= 0: pid=101023: Thu Jul 4 03:03:05 2013
> write: io=65592KB, bw=1093.2KB/s, iops=273 , runt= 60002msec
> clat (usec): min=9 , max=157723 , avg=3654.65, stdev=5733.27
> lat (usec): min=9 , max=157724 , avg=3654.98, stdev=5733.29
> clat percentiles (usec):
> | 1.00th=[ 12], 5.00th=[ 13], 10.00th=[ 18], 20.00th=[ 23],
> | 30.00th=[ 25], 40.00th=[ 740], 50.00th=[ 756], 60.00th=[ 4048],
> | 70.00th=[ 5856], 80.00th=[ 7648], 90.00th=[ 9408], 95.00th=[10304],
> | 99.00th=[11584], 99.50th=[19072], 99.90th=[96768], 99.95th=[117248],
> | 99.99th=[140288]
> bw (KB/s) : min= 174, max= 2184, per=99.67%, avg=1089.37, stdev=392.80
> lat (usec) : 10=0.21%, 20=11.34%, 50=25.24%, 100=0.04%, 750=9.51%
> lat (usec) : 1000=5.17%
> lat (msec) : 2=0.30%, 4=7.89%, 10=33.89%, 20=5.99%, 50=0.28%
> lat (msec) : 100=0.05%, 250=0.10%
> cpu : usr=0.16%, sys=1.01%, ctx=10488, majf=0, minf=0
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
> issued : total=r=0/w=16398/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> WRITE: io=65592KB, aggrb=1093KB/s, minb=1093KB/s, maxb=1093KB/s,
> mint=60002msec, maxt=60002msec
>
> fio --direct=1 --rw=randread --bs=4k --size=2G --numjobs=1 --runtime=60
> --group_reporting --name=randread
> fio: this platform does not support process shared mutexes, forcing use
> of threads. Use the 'thread' option to get rid of this warning.
> randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> fio-2.0.15
> Starting 1 process
> randread: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [r] [-.-% done] [608.5M/0K/0K /s] [156K/0 /0 iops] [eta
> 00m:00s]
> randread: (groupid=0, jobs=1): err= 0: pid=101025: Thu Jul 4 03:04:35 2013
> read : io=2048.0MB, bw=637045KB/s, iops=159261 , runt= 3292msec
> clat (usec): min=3 , max=83 , avg= 5.25, stdev= 1.39
> lat (usec): min=3 , max=83 , avg= 5.32, stdev= 1.39
> clat percentiles (usec):
> | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 5], 20.00th=[ 5],
> | 30.00th=[ 5], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5],
> | 70.00th=[ 5], 80.00th=[ 6], 90.00th=[ 6], 95.00th=[ 6],
> | 99.00th=[ 10], 99.50th=[ 14], 99.90th=[ 22], 99.95th=[ 25],
> | 99.99th=[ 45]
> bw (KB/s) : min=621928, max=644736, per=99.72%, avg=635281.33,
> stdev=10139.68
> lat (usec) : 4=0.05%, 10=98.94%, 20=0.86%, 50=0.14%, 100=0.01%
> cpu : usr=14.83%, sys=85.14%, ctx=60, majf=0, minf=1
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
> issued : total=r=524288/w=0/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> READ: io=2048.0MB, aggrb=637044KB/s, minb=637044KB/s,
> maxb=637044KB/s, mint=3292msec, maxt=3292msec
So if I interpret the above correctly, Box A can crank ~140MB/s random
write and ~300MB/s random read and Box B cranks ~1MB/s random write and
630MB/s random read?
A few thoughts:
- What's up with Box B's 1MB/s write bandwidth? I'm guessing something
fired up at the same time as your IO test and killed your random write
throughput.
- Random read/write is not really a useful test here as ftp is
effectively a sequential streaming read/write workload. The random
read/write throughput is irrelevant.
- I recall some advice that zpool's should not have more than about 8 or
10 disks in them, and you should instead create multiple zpools if you
have more disks. Perhaps investigate the source of that rumour and if
it's true, try create 2 x 8 disk zpools in Box A and 3 x 8 disk zpools
in box B and see if that changes things at all.
Cheers,
Lawrence
More information about the freebsd-net
mailing list