epair failure in production on 11.1-STABLE (r328930) ? weird!
Dr Josef Karthauser
joe at truespeed.com
Mon Jul 2 21:11:40 UTC 2018
We’re experiencing a strange issue in production failure with epair (which we’re using to talk vimage to jails).
FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb 6 16:05:59 GMT 2018 root at s5:/usr/obj/usr/src/sys/TRUESPEED amd64
Looks like epair has suddenly stopped forwarding packets between the pair interfaces. Our server has been up for 82 days and it’s been working fine, but suddenly packets have stopped being forwarded between epairs across the entire system. (We’ve got around 30 epairs on the host). So, we’ve got a sudden ARP resolution failure which is affecting all services. :(.
Here’s the test. On a working machine this works fine:
# Create an email and put an IP address on it, so we can generate ARP traffic with PING.
root at magnesium:/usr/home/systems # ifconfig epair create
epair7a
root at magnesium:/usr/home/systems # ifconfig epair7a up
root at magnesium:/usr/home/systems # ifconfig epair7b up
root at magnesium:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30
# Generate ARP traffic over the epair… should see arp requests on epair7b.
root at magnesium:/usr/home/systems # ping 10.140.0.2
PING 10.140.0.2 (10.140.0.2): 56 data bytes
# Watch traffic coming in from the epair
root at magnesium:/usr/home/systems # tcpdump -i epair7b
10:22:27.446651 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, length 28
10:22:28.475086 ARP, Request who-has 10.140.0.2 tell 10.140.0.1, length 28
^C
2 packets captured
2 packets received by filter
0 packets dropped by kernel
Works fine.
However, on the failing machine we don’t get any packets forwarded (any more — remember it’s been working fine for a few months - suddenly stopped working :( ).
root at s5:/usr/home/systems # ifconfig pair create
epair19a
root at s5:/usr/home/systems # ifconfig epair19a up
root at s5:/usr/home/systems # ifconfig epair7b up
root at s5:/usr/home/systems # ifconfig epair7a inet 10.140.0.1/30
root at s5:/usr/home/systems # ping 10.140.0.2
PING 10.140.0.2 (10.140.0.2): 56 data bytes
root at s5:/usr/home/systems # tcpdump -ni epair19a
09:24:20.396384 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, length 28
09:24:21.404737 ARP, Request who-has 10.130.0.2 tell 10.130.0.1, length 28
^C
root at s5:/usr/home/systems # tcpdump -ni epair19b
[Tumble weed - no traffic seen]
^C
Has anyone seen this before? We’re going to reboot and see if that fixes the problem.
The failing kernel in question is:
FreeBSD s5 11.1-STABLE FreeBSD 11.1-STABLE #2 r328930: Tue Feb 6 16:05:59 GMT 2018 root at s5:/usr/obj/usr/src/sys/TRUESPEED amd64
Break break. We’ve just seen a bug bugzilla report 22710, reporting that epair fails when the queue limit is hit (net.link.epair.netisr_maxqlen). We’ve just introduced a high bandwidth service on this machine and so it’s probably that that’s what’s caused the issue.
We’ve currently got a value of:
net.link.epair.netisr_maxqlen: 2100
root at s5:/usr/home/systems # netstat -Q
Configuration:
Setting Current Limit
Thread count 1 1
Default queue limit 256 10240
Dispatch policy direct n/a
Threads bound to CPUs disabled n/a
Protocols:
Name Proto QLimit Policy Dispatch Flags
ip 1 256 flow default ---
igmp 2 256 source default ---
rtsock 3 256 source default ---
arp 4 256 source default ---
ether 5 256 source direct ---
ip6 6 256 flow default ---
epair 8 2100 cpu default CD-
Workstreams:
WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued Handled
0 0 ip 0 253 385468689 0 0 49360754 434829441
0 0 igmp 0 0 0 0 0 0 0
0 0 rtsock 0 5 0 0 0 1144 1144
0 0 arp 0 0 5573045 0 0 0 5573045
0 0 ether 0 0 1125223166 0 0 0 1125223166
0 0 ip6 0 4 90 0 0 1220274 1220364
0 0 epair 0 2100 0 0 214 4994675481 4994675481
But we can’t see how much of the queue is currently being used, or what size we need to set it to.
But, why has hitting the queue limit broken it entirely!
Help!
Cheers,
Joe
—
Dr Josef Karthauser
Chief Technical Officer
(01225) 300371 / (07703) 596893
www.truespeed.com <http://www.truespeed.com/>
/ theTRUESPEED <http://www.facebook.com/theTRUESPEED>
@theTRUESPEED <https://twitter.com/thetruespeed>
This email contains TrueSpeed information, which may be privileged or confidential. It's meant only for the individual(s) or entity named above. If you're not the intended recipient, note that disclosing, copying, distributing or using this information is prohibited. If you've received this email in error, please let me know immediately on the email address above. Thank you.
We monitor our email system, and may record your emails.
More information about the freebsd-net
mailing list