high bandwidth tcp connection stalls on igb (was: igb enable_aim or flow_control causing tcp stalls?)

Steven Hartland killing at multiplay.co.uk
Mon Jul 18 16:11:51 UTC 2011


Confirmed with blade to blade transfer. Also noticed that if two transfers are
happening at the same time, both will stall, not just one, but ssh consoles
don't seem to be effected only high volume transfers like scp and rsync.

It also seems like the more active connections the more likely a stall will
happen.

Another thing I've noticed is the trace from the source host shows a large
number of "TCP ACKed lost segment" interspersed by 8 - 16K IP packets starting
just after the ssh handshake, when looking at the trace in wireshark, might
this be relavent?

I've tried with as many hardware options disabled as I could find, but no change
-tso -rxcsum -txcsum -lro -vlanhwtag
net.inet.tcp.tso=0
dev.igb.0.enable_aim=0
dev.igb.0.flow_control=0
dev.igb.1.enable_aim=0
dev.igb.1.flow_control=0

Here's the stats from the suspect device which has just stalled again Jack does
anything look suspect here and do you have any ideas what this might be?


dev.igb.0.%desc: Intel(R) PRO/1000 Network Connection version - 2.0.7
dev.igb.0.%driver: igb
dev.igb.0.%location: slot=0 function=0
dev.igb.0.%pnpinfo: vendor=0x8086 device=0x10e7 subvendor=0x15d9 subdevice=0x10e7 class=0x020000
dev.igb.0.%parent: pci5
dev.igb.0.nvm: -1
dev.igb.0.flow_control: 0
dev.igb.0.enable_aim: 0
dev.igb.0.rx_processing_limit: 100
dev.igb.0.link_irq: 4
dev.igb.0.dropped: 0
dev.igb.0.tx_dma_fail: 0
dev.igb.0.rx_overruns: 0
dev.igb.0.watchdog_timeouts: 0
dev.igb.0.device_control: 14424641
dev.igb.0.rx_control: 67141634
dev.igb.0.interrupt_mask: 4
dev.igb.0.extended_int_mask: 2147484159
dev.igb.0.tx_buf_alloc: 0
dev.igb.0.rx_buf_alloc: 0
dev.igb.0.fc_high_water: 58976
dev.igb.0.fc_low_water: 58960
dev.igb.0.queue0.interrupt_rate: 8000
dev.igb.0.queue0.txd_head: 266
dev.igb.0.queue0.txd_tail: 266
dev.igb.0.queue0.no_desc_avail: 0
dev.igb.0.queue0.tx_packets: 9462610
dev.igb.0.queue0.rxd_head: 891
dev.igb.0.queue0.rxd_tail: 890
dev.igb.0.queue0.rx_packets: 15326075
dev.igb.0.queue0.rx_bytes: 19146964251
dev.igb.0.queue0.lro_queued: 0
dev.igb.0.queue0.lro_flushed: 0
dev.igb.0.queue1.interrupt_rate: 8000
dev.igb.0.queue1.txd_head: 225
dev.igb.0.queue1.txd_tail: 225
dev.igb.0.queue1.no_desc_avail: 0
dev.igb.0.queue1.tx_packets: 15985904
dev.igb.0.queue1.rxd_head: 999
dev.igb.0.queue1.rxd_tail: 998
dev.igb.0.queue1.rx_packets: 25696231
dev.igb.0.queue1.rx_bytes: 32902117763
dev.igb.0.queue1.lro_queued: 0
dev.igb.0.queue1.lro_flushed: 0
dev.igb.0.queue2.interrupt_rate: 8000
dev.igb.0.queue2.txd_head: 157
dev.igb.0.queue2.txd_tail: 157
dev.igb.0.queue2.no_desc_avail: 0
dev.igb.0.queue2.tx_packets: 12697405
dev.igb.0.queue2.rxd_head: 778
dev.igb.0.queue2.rxd_tail: 777
dev.igb.0.queue2.rx_packets: 20780810
dev.igb.0.queue2.rx_bytes: 26096219675
dev.igb.0.queue2.lro_queued: 0
dev.igb.0.queue2.lro_flushed: 0
dev.igb.0.queue3.interrupt_rate: 8000
dev.igb.0.queue3.txd_head: 242
dev.igb.0.queue3.txd_tail: 242
dev.igb.0.queue3.no_desc_avail: 0
dev.igb.0.queue3.tx_packets: 11831167
dev.igb.0.queue3.rxd_head: 111
dev.igb.0.queue3.rxd_tail: 110
dev.igb.0.queue3.rx_packets: 18590831
dev.igb.0.queue3.rx_bytes: 25894011731
dev.igb.0.queue3.lro_queued: 0
dev.igb.0.queue3.lro_flushed: 0
dev.igb.0.queue4.interrupt_rate: 8000
dev.igb.0.queue4.txd_head: 841
dev.igb.0.queue4.txd_tail: 841
dev.igb.0.queue4.no_desc_avail: 0
dev.igb.0.queue4.tx_packets: 13540958
dev.igb.0.queue4.rxd_head: 835
dev.igb.0.queue4.rxd_tail: 834
dev.igb.0.queue4.rx_packets: 21880643
dev.igb.0.queue4.rx_bytes: 28291440234
dev.igb.0.queue4.lro_queued: 0
dev.igb.0.queue4.lro_flushed: 0
dev.igb.0.queue5.interrupt_rate: 8000
dev.igb.0.queue5.txd_head: 941
dev.igb.0.queue5.txd_tail: 941
dev.igb.0.queue5.no_desc_avail: 0
dev.igb.0.queue5.tx_packets: 11124540
dev.igb.0.queue5.rxd_head: 214
dev.igb.0.queue5.rxd_tail: 213
dev.igb.0.queue5.rx_packets: 18048214
dev.igb.0.queue5.rx_bytes: 22957384083
dev.igb.0.queue5.lro_queued: 0
dev.igb.0.queue5.lro_flushed: 0
dev.igb.0.queue6.interrupt_rate: 8000
dev.igb.0.queue6.txd_head: 782
dev.igb.0.queue6.txd_tail: 783
dev.igb.0.queue6.no_desc_avail: 0
dev.igb.0.queue6.tx_packets: 13581988
dev.igb.0.queue6.rxd_head: 504
dev.igb.0.queue6.rxd_tail: 503
dev.igb.0.queue6.rx_packets: 21590520
dev.igb.0.queue6.rx_bytes: 29030489548
dev.igb.0.queue6.lro_queued: 0
dev.igb.0.queue6.lro_flushed: 0
dev.igb.0.queue7.interrupt_rate: 8000
dev.igb.0.queue7.txd_head: 961
dev.igb.0.queue7.txd_tail: 961
dev.igb.0.queue7.no_desc_avail: 0
dev.igb.0.queue7.tx_packets: 14163482
dev.igb.0.queue7.rxd_head: 38
dev.igb.0.queue7.rxd_tail: 37
dev.igb.0.queue7.rx_packets: 23149606
dev.igb.0.queue7.rx_bytes: 29114500225
dev.igb.0.queue7.lro_queued: 0
dev.igb.0.queue7.lro_flushed: 0
dev.igb.0.mac_stats.excess_coll: 0
dev.igb.0.mac_stats.single_coll: 0
dev.igb.0.mac_stats.multiple_coll: 0
dev.igb.0.mac_stats.late_coll: 0
dev.igb.0.mac_stats.collision_count: 0
dev.igb.0.mac_stats.symbol_errors: 0
dev.igb.0.mac_stats.sequence_errors: 0
dev.igb.0.mac_stats.defer_count: 0
dev.igb.0.mac_stats.missed_packets: 0
dev.igb.0.mac_stats.recv_no_buff: 0
dev.igb.0.mac_stats.recv_undersize: 0
dev.igb.0.mac_stats.recv_fragmented: 0
dev.igb.0.mac_stats.recv_oversize: 0
dev.igb.0.mac_stats.recv_jabber: 0
dev.igb.0.mac_stats.recv_errs: 0
dev.igb.0.mac_stats.crc_errs: 0
dev.igb.0.mac_stats.alignment_errs: 0
dev.igb.0.mac_stats.coll_ext_errs: 0
dev.igb.0.mac_stats.xon_recvd: 0
dev.igb.0.mac_stats.xon_txd: 0
dev.igb.0.mac_stats.xoff_recvd: 0
dev.igb.0.mac_stats.xoff_txd: 0
dev.igb.0.mac_stats.total_pkts_recvd: 165067073
dev.igb.0.mac_stats.good_pkts_recvd: 165062852
dev.igb.0.mac_stats.bcast_pkts_recvd: 7827
dev.igb.0.mac_stats.mcast_pkts_recvd: 20
dev.igb.0.mac_stats.rx_frames_64: 18346
dev.igb.0.mac_stats.rx_frames_65_127: 2395695
dev.igb.0.mac_stats.rx_frames_128_255: 6686114
dev.igb.0.mac_stats.rx_frames_256_511: 9501896
dev.igb.0.mac_stats.rx_frames_512_1023: 14475414
dev.igb.0.mac_stats.rx_frames_1024_1522: 131985387
dev.igb.0.mac_stats.good_octets_recvd: 214093372362
dev.igb.0.mac_stats.good_octets_txd: 7388817393
dev.igb.0.mac_stats.total_pkts_txd: 102387885
dev.igb.0.mac_stats.good_pkts_txd: 102387885
dev.igb.0.mac_stats.bcast_pkts_txd: 4
dev.igb.0.mac_stats.mcast_pkts_txd: 0
dev.igb.0.mac_stats.tx_frames_64: 662
dev.igb.0.mac_stats.tx_frames_65_127: 102263884
dev.igb.0.mac_stats.tx_frames_128_255: 44518
dev.igb.0.mac_stats.tx_frames_256_511: 25033
dev.igb.0.mac_stats.tx_frames_512_1023: 18188
dev.igb.0.mac_stats.tx_frames_1024_1522: 35600
dev.igb.0.mac_stats.tso_txd: 0
dev.igb.0.mac_stats.tso_ctx_fail: 0
dev.igb.0.interrupts.asserts: 27233325
dev.igb.0.interrupts.rx_pkt_timer: 165061054
dev.igb.0.interrupts.rx_abs_timer: 0
dev.igb.0.interrupts.tx_pkt_timer: 0
dev.igb.0.interrupts.tx_abs_timer: 165062852
dev.igb.0.interrupts.tx_queue_empty: 102387060
dev.igb.0.interrupts.tx_queue_min_thresh: 0
dev.igb.0.interrupts.rx_desc_min_thresh: 0
dev.igb.0.interrupts.rx_overrun: 0
dev.igb.0.host.breaker_tx_pkt: 0
dev.igb.0.host.host_tx_pkt_discard: 0
dev.igb.0.host.rx_pkt: 1798
dev.igb.0.host.breaker_rx_pkts: 0
dev.igb.0.host.breaker_rx_pkt_drop: 0
dev.igb.0.host.tx_good_pkt: 825
dev.igb.0.host.breaker_tx_pkt_drop: 0
dev.igb.0.host.rx_good_bytes: 214093406063
dev.igb.0.host.tx_good_bytes: 7388817393
dev.igb.0.host.length_errors: 0
dev.igb.0.host.serdes_violation_pkt: 0
dev.igb.0.host.header_redir_missed: 0

----- Original Message ----- 
From: "Steven Hartland" <killing at multiplay.co.uk>
To: "Kevin Oberman" <kob6558 at gmail.com>
Cc: <freebsd-net at freebsd.org>
Sent: Monday, July 18, 2011 12:20 PM
Subject: Re: igb enable_aim or flow_control causing tcp stalls?


> ----- Original Message ----- 
> From: "Kevin Oberman" <kob6558 at gmail.com>
>>
>> Use "tcpdump -s0 -w file.pcap host remote-system" to see how it fails. You
>> may want to capture on both ends. Then use wireshark (in ports) to analyze
>> the data.
>>
>> There are other tools to provide other types of analysis, depending on the
>> type of problem.
>
> I've managed to get a capture from both ends but its doesn't really make too
> much sense to me. You can clearly see the stall which starts at the 2.1 second
> mark, and recovers at the 65 second mark but what's causing it is a mystery.
>
> I've attached what I believe is the relevant a snippet from each trace.
>
> At this point I believe I've eliminated aim and flow_control as these where
> both off when this test was preformed
>
> Any advice would be appreciated.
>
> The layout for this test was:-
> Source (7.0-RELEASE-p2 on em0) -> Cisco 6509 -> supermicro blade -> Target
> (8.2-RELEASE on igb0)
>
> I'm going to try and eliminate the Cisco next by going from two blades
> on the local supermicro blade switch.
>
>    Regards
>    Steve
>
>
>
> ================================================
> This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the 
> event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any 
> information contained in it.
>
> In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
> or return the E.mail to postmaster at multiplay.co.uk.


--------------------------------------------------------------------------------


> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org" 


================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster at multiplay.co.uk.



More information about the freebsd-net mailing list