TCP stack lock contention with short-lived connections
Julien Charbon
jcharbon at verisign.com
Thu Feb 27 10:33:09 UTC 2014
Hi,
On 07/11/13 14:55, Julien Charbon wrote:
> On Mon, 04 Nov 2013 22:21:04 +0100, Julien Charbon
> <jcharbon at verisign.com> wrote:
>> just a follow-up of vBSDCon discussions about FreeBSD TCP
>> performances with short-lived connections. In summary: <snip>
>>
>> I have put technical and how-to-repeat details in below PR:
>>
>> kern/183659: TCP stack lock contention with short-lived connections
>> http://www.freebsd.org/cgi/query-pr.cgi?pr=183659
>>
>> We are currently working on this performance improvement effort; it
>> will impact only the TCP locking strategy not the TCP stack logic
>> itself. We will share on freebsd-net the patches we made for
>> reviewing and improvement propositions; anyway this change might also
>> require enough eyeballs to avoid tricky race conditions introduction
>> in TCP stack.
Just a follow-up on this TCP performance improvements task:
1. Our first related patch has been applied in HEAD:
Decrease lock contention within the TCP accept case by removing
the INP_INFO lock from tcp_usr_accept.
http://svnweb.freebsd.org/base?view=revision&revision=261242
Thanks to reviewers.
2. We studied an another lock contention related to INP_INFO when TCP
connections in TIME_WAIT state are cleaned-up. The context:
This lock contention was found when checking why our Intel 10G was
dropping packets in reception even with plenty of free bandwidth. To
study this issue we computed the distribution of TCP connection time
lengths (i.e. time between the first SYN and the last ACK for a given
TCP session) at the maximum rate of TCP connections per second _without_
a single packet drop using FreeBSD 10.0-RELEASE (see joined
conntime-bsd10.0-release.pdf)
In this graph, in X you have time in second where the SYN was
received, and in Y you have TCP session duration in millisecond (i.e.
the difference between first SYN received time and last ACK send time).
As you can see at some point every 500ms the TCP connection time
raises in spikes. This periodicity led us to tcp_slowtimo() that calls
tcp_tw_2msl_scan() with the INP_INFO lock taken.
Our theory is: Every 500ms there is a competition for the INP_INFO
lock between tcp_slowtimo() and tcp_input(), and tcp_input() is indeed
directly called by the NIC RX interruption handler. Then, during the
whole duration of tcp_tw_2msl_scan() call, packets are no more dequeued
from NIC RX rings, and once all RX rings are full the NIC starts to drop
packets in reception.
The calculus of time needed to filled-up all available RX rings
descriptors follows this theory:
- 40k TCP connections per second (with 5 packets received per TCP
connection) = 200k packets per second
- We use 4 RX queues of 2048 descriptors each = 8192 RX descriptors
overall
Time to filled-up all available NIC descriptors at 200k packet per
second: 8192/200000 = 40.96 milliseconds
Which is coherent with what we see on the
conntime-bsd10.0-release.pdf graph.
To confirm this theory, we introduced a new lock (see joined patch
tw-lock.patch) to protect the TIME_WAIT global list instead of using
INP_INFO, and now the TIME_WAIT states are cleanup one by one in order
to prioritize the NIC interruption handler against tcp_tw_2msl_scan().
See joined conntime-bsd10.0-patched.pdf: No more spikes and the maximum
TCP connection rate without dropping a single packet becomes:
- FreeBSD 10.0: 40k
- FreeBSD 10.0 + patch: 53k
Obviously, to mitigate this lock contention there are various solutions:
- Introduce a new time-wait lock as proposed in joined patch
- Call tcp_tw_2msl_scan() more often in case of high workload
- Use INP_INFO_TRY_WLOCK() in tcp_tw_2msl_scan() to clean-up time-wait
objects only when nobody else handles INP_INFO lock
- Etc.
The strategy being to prioritize packet reception over time-wait
objects cleaned-up as:
- we hate dropping packet in reception when the bandwidth is far from
being full
- the maximum of used time-wait objects is configurable
(net.inet.tcp.maxtcptw)
- in case of time-wait objects memory exhaustion, the current behavior
is already optimal: The oldest time-wait object is recycled and
directly reused.
We picked the time-wait lock way because it suits well our long-term
strategy to completely mitigate the INP_INFO lock contention everywhere
in TCP stack.
Any thoughts on this particular behavior?
--
Julien
-------------- next part --------------
A non-text attachment was scrubbed...
Name: conntime-bsd10.0-release.pdf
Type: application/pdf
Size: 13253 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-net/attachments/20140227/2cf9738a/attachment-0002.pdf>
-------------- next part --------------
diff --git a/sys/netinet/tcp_timer.c b/sys/netinet/tcp_timer.c
index bde7503..b45a9ea 100644
--- a/sys/netinet/tcp_timer.c
+++ b/sys/netinet/tcp_timer.c
@@ -144,9 +144,7 @@ tcp_slowtimo(void)
VNET_LIST_RLOCK_NOSLEEP();
VNET_FOREACH(vnet_iter) {
CURVNET_SET(vnet_iter);
- INP_INFO_WLOCK(&V_tcbinfo);
- (void) tcp_tw_2msl_scan(0);
- INP_INFO_WUNLOCK(&V_tcbinfo);
+ tcp_tw_2msl_scan();
CURVNET_RESTORE();
}
VNET_LIST_RUNLOCK_NOSLEEP();
diff --git a/sys/netinet/tcp_timer.h b/sys/netinet/tcp_timer.h
index 3115fb3..c04723a 100644
--- a/sys/netinet/tcp_timer.h
+++ b/sys/netinet/tcp_timer.h
@@ -178,7 +178,8 @@ extern int tcp_fast_finwait2_recycle;
void tcp_timer_init(void);
void tcp_timer_2msl(void *xtp);
struct tcptw *
- tcp_tw_2msl_scan(int _reuse); /* XXX temporary */
+ tcp_tw_2msl_reuse(void); /* XXX temporary? */
+void tcp_tw_2msl_scan(void);
void tcp_timer_keep(void *xtp);
void tcp_timer_persist(void *xtp);
void tcp_timer_rexmt(void *xtp);
diff --git a/sys/netinet/tcp_timewait.c b/sys/netinet/tcp_timewait.c
index 7e6128b..0230b88 100644
--- a/sys/netinet/tcp_timewait.c
+++ b/sys/netinet/tcp_timewait.c
@@ -49,6 +49,7 @@ __FBSDID("$FreeBSD$");
#include <sys/socketvar.h>
#include <sys/protosw.h>
#include <sys/random.h>
+#include <sys/refcount.h>
#include <vm/uma.h>
@@ -98,13 +99,59 @@ static int maxtcptw;
* The timed wait queue contains references to each of the TCP sessions
* currently in the TIME_WAIT state. The queue pointers, including the
* queue pointers in each tcptw structure, are protected using the global
- * tcbinfo lock, which must be held over queue iteration and modification.
+ * timewait lock, which must be held over queue iteration and modification.
*/
static VNET_DEFINE(TAILQ_HEAD(, tcptw), twq_2msl);
#define V_twq_2msl VNET(twq_2msl)
-static void tcp_tw_2msl_reset(struct tcptw *, int);
-static void tcp_tw_2msl_stop(struct tcptw *);
+/* Global timewait lock */
+static VNET_DEFINE(struct rwlock, tw_lock);
+#define V_tw_lock VNET(tw_lock)
+
+#define TW_LOCK_INIT(tw, d) rw_init_flags(&(tw), (d), 0)
+#define TW_LOCK_DESTROY(tw) rw_destroy(&(tw))
+#define TW_RLOCK(tw) rw_rlock(&(tw))
+#define TW_WLOCK(tw) rw_wlock(&(tw))
+#define TW_RUNLOCK(tw) rw_runlock(&(tw))
+#define TW_WUNLOCK(tw) rw_wunlock(&(tw))
+#define TW_LOCK_ASSERT(tw) rw_assert(&(tw), RA_LOCKED)
+#define TW_RLOCK_ASSERT(tw) rw_assert(&(tw), RA_RLOCKED)
+#define TW_WLOCK_ASSERT(tw) rw_assert(&(tw), RA_WLOCKED)
+#define TW_UNLOCK_ASSERT(tw) rw_assert(&(tw), RA_UNLOCKED)
+
+/*
+ * tw_pcbref() bumps the reference count on an tw in order to maintain
+ * stability of an tw pointer despite the tw lock being released.
+ */
+static void
+tw_pcbref(struct tcptw *tw)
+{
+ KASSERT(tw->tw_refcount > 0, ("%s: refcount 0", __func__));
+ refcount_acquire(&tw->tw_refcount);
+}
+
+/*
+ * Drop a refcount on an tw elevated using tw_pcbref(). If it is
+ * valid, we return with the tw lock held.
+ */
+static int
+tw_pcbrele(struct tcptw *tw)
+{
+ TW_WLOCK_ASSERT(V_tw_lock);
+ KASSERT(tw->tw_refcount > 0, ("%s: refcount 0", __func__));
+
+ if (!refcount_release(&tw->tw_refcount)) {
+ TW_WUNLOCK(V_tw_lock);
+ return (0);
+ }
+
+ uma_zfree(V_tcptw_zone, tw);
+ TW_WUNLOCK(V_tw_lock);
+ return (1);
+}
+
+static void tcp_tw_2msl_reset(struct tcptw *, int ream);
+static void tcp_tw_2msl_stop(struct tcptw *, int reuse);
static int
tcptw_auto_size(void)
@@ -171,6 +218,7 @@ tcp_tw_init(void)
else
uma_zone_set_max(V_tcptw_zone, maxtcptw);
TAILQ_INIT(&V_twq_2msl);
+ TW_LOCK_INIT(V_tw_lock, "tcptw");
}
#ifdef VIMAGE
@@ -179,11 +227,14 @@ tcp_tw_destroy(void)
{
struct tcptw *tw;
- INP_INFO_WLOCK(&V_tcbinfo);
- while((tw = TAILQ_FIRST(&V_twq_2msl)) != NULL)
- tcp_twclose(tw, 0);
- INP_INFO_WUNLOCK(&V_tcbinfo);
+ TW_WLOCK(V_tw_lock);
+ while((tw = TAILQ_FIRST(&V_twq_2msl)) != NULL) {
+ tcp_twclose(tw, 0, 1);
+ TW_WLOCK(V_tw_lock);
+ }
+ TW_WUNLOCK(V_tw_lock);
+ TW_LOCK_DESTROY(V_tw_lock);
uma_zdestroy(V_tcptw_zone);
}
#endif
@@ -204,7 +255,7 @@ tcp_twstart(struct tcpcb *tp)
int isipv6 = inp->inp_inc.inc_flags & INC_ISIPV6;
#endif
- INP_INFO_WLOCK_ASSERT(&V_tcbinfo); /* tcp_tw_2msl_reset(). */
+ INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
INP_WLOCK_ASSERT(inp);
if (V_nolocaltimewait) {
@@ -229,7 +280,7 @@ tcp_twstart(struct tcpcb *tp)
tw = uma_zalloc(V_tcptw_zone, M_NOWAIT);
if (tw == NULL) {
- tw = tcp_tw_2msl_scan(1);
+ tw = tcp_tw_2msl_reuse();
if (tw == NULL) {
tp = tcp_close(tp);
if (tp != NULL)
@@ -238,6 +289,7 @@ tcp_twstart(struct tcpcb *tp)
}
}
tw->tw_inpcb = inp;
+ refcount_init(&tw->tw_refcount, 1);
/*
* Recover last window size sent.
@@ -356,7 +408,6 @@ tcp_twcheck(struct inpcb *inp, struct tcpopt *to, struct tcphdr *th,
int thflags;
tcp_seq seq;
- /* tcbinfo lock required for tcp_twclose(), tcp_tw_2msl_reset(). */
INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
INP_WLOCK_ASSERT(inp);
@@ -458,11 +509,11 @@ tcp_twclose(struct tcptw *tw, int reuse)
inp = tw->tw_inpcb;
KASSERT((inp->inp_flags & INP_TIMEWAIT), ("tcp_twclose: !timewait"));
KASSERT(intotw(inp) == tw, ("tcp_twclose: inp_ppcb != tw"));
- INP_INFO_WLOCK_ASSERT(&V_tcbinfo); /* tcp_tw_2msl_stop(). */
+ INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
INP_WLOCK_ASSERT(inp);
tw->tw_inpcb = NULL;
- tcp_tw_2msl_stop(tw);
+ tcp_tw_2msl_stop(tw, reuse);
inp->inp_ppcb = NULL;
in_pcbdrop(inp);
@@ -494,11 +545,6 @@ tcp_twclose(struct tcptw *tw, int reuse)
} else
in_pcbfree(inp);
TCPSTAT_INC(tcps_closed);
- crfree(tw->tw_cred);
- tw->tw_cred = NULL;
- if (reuse)
- return;
- uma_zfree(V_tcptw_zone, tw);
}
int
@@ -616,34 +662,83 @@ tcp_tw_2msl_reset(struct tcptw *tw, int rearm)
INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
INP_WLOCK_ASSERT(tw->tw_inpcb);
+
+ TW_WLOCK(V_tw_lock);
if (rearm)
TAILQ_REMOVE(&V_twq_2msl, tw, tw_2msl);
tw->tw_time = ticks + 2 * tcp_msl;
TAILQ_INSERT_TAIL(&V_twq_2msl, tw, tw_2msl);
+ TW_WUNLOCK(V_tw_lock);
}
static void
-tcp_tw_2msl_stop(struct tcptw *tw)
+tcp_tw_2msl_stop(struct tcptw *tw, int reuse)
{
INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
+
+ TW_WLOCK(V_tw_lock);
TAILQ_REMOVE(&V_twq_2msl, tw, tw_2msl);
+ crfree(tw->tw_cred);
+ tw->tw_cred = NULL;
+
+ if (!reuse) {
+ tw_pcbrele(tw);
+ return;
+ }
+
+ TW_WUNLOCK(V_tw_lock);
}
struct tcptw *
-tcp_tw_2msl_scan(int reuse)
+tcp_tw_2msl_reuse(void)
{
- struct tcptw *tw;
INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
+
+ struct tcptw *tw;
+
+ TW_WLOCK(V_tw_lock);
+ tw = TAILQ_FIRST(&V_twq_2msl);
+ if (tw == NULL) {
+ TW_WUNLOCK(V_tw_lock);
+ return NULL;
+ }
+ TW_WUNLOCK(V_tw_lock);
+
+ INP_WLOCK(tw->tw_inpcb);
+ tcp_twclose(tw, 1);
+
+ return (tw);
+}
+
+void
+tcp_tw_2msl_scan(void)
+{
+
+ struct tcptw *tw;
for (;;) {
+ TW_RLOCK(V_tw_lock);
tw = TAILQ_FIRST(&V_twq_2msl);
- if (tw == NULL || (!reuse && (tw->tw_time - ticks) > 0))
+ if (tw == NULL || ((tw->tw_time - ticks) > 0)) {
+ TW_RUNLOCK(V_tw_lock);
break;
+ }
+ tw_pcbref(tw);
+ TW_RUNLOCK(V_tw_lock);
+
+ /* Close timewait state */
+ INP_INFO_WLOCK(&V_tcbinfo);
+
+ TW_WLOCK(V_tw_lock);
+ if(tw_pcbrele(tw))
+ continue;
+
+ KASSERT(tw->tw_inpcb != NULL,
+ ("%s: tw->tw_inpcb == NULL", __func__));
+
INP_WLOCK(tw->tw_inpcb);
- tcp_twclose(tw, reuse);
- if (reuse)
- return (tw);
+ tcp_twclose(tw, 0);
+ INP_INFO_WUNLOCK(&V_tcbinfo);
}
- return (NULL);
}
diff --git a/sys/netinet/tcp_var.h b/sys/netinet/tcp_var.h
index e3197e5..b44672d 100644
--- a/sys/netinet/tcp_var.h
+++ b/sys/netinet/tcp_var.h
@@ -355,6 +355,7 @@ struct tcptw {
TAILQ_ENTRY(tcptw) tw_2msl;
void *tw_pspare; /* TCP_SIGNATURE */
u_int *tw_spare; /* TCP_SIGNATURE */
+ u_int tw_refcount; /* refcount */
};
#define intotcpcb(ip) ((struct tcpcb *)(ip)->inp_ppcb)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: conntime-bsd10.0-patched.pdf
Type: application/pdf
Size: 10422 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-net/attachments/20140227/2cf9738a/attachment-0003.pdf>
More information about the freebsd-net
mailing list