Changes in the network interface queueing handoff model

Robert Watson rwatson at FreeBSD.org
Sun Jul 30 14:04:49 UTC 2006


5BOne of the ideas that I, Scott Long, and a few others have been bouncing 
around for some time is a restructuring of the network interface packet 
transmission API to reduce the number of locking operations and allow network 
device drivers increased control of the queueing behavior.  Right now, it 
works something like that following:

- When a network protocol wants to transmit, it calls the ifnet's link layer
   output routine via ifp->if_output() with the ifnet pointer, packet,
   destination address information, and route information.

- The link layer (e.g., ether_output() + ether_output_frame()) encapsulates
   the packet as necessary, performs a link layer address translation (such as
   ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(), which
   accepts the ifnet pointer and packet.

- The ifnet layer enqueues the packet in the ifnet send queue (ifp->if_snd),
   and then looks at the driver's IFF_DRV_OACTIVE flag to determine if it needs
   to "start" output by the driver.  If the driver is already active, it
   doesn't, and otherwise, it does.

- The driver dequeues the packet from ifp->if_snd, performs any driver
   encapsulation and wrapping, and notifies the hardware.  In modern hardware,
   this consists of hooking the data of the packet up to the descriptor ring
   and notifying the hardware to pick it up via DMA.  In order hardware, the
   driver would perform a series of I/O operations to send the entire packet
   directly to the card via a system bus.

Why change this?  A few reasons:

- The ifnet layer send queue is becoming decreasingly useful over time.  Most
   modern hardware has a significant number of slots in its transmit descriptor
   ring, tuned for the performance of the hardware, etc, which is the effective
   transmit queue in practice.  The additional queue depth doesn't increase
   throughput substantially (if at all) but does consume memory.

- On extremely fast hardware (with respect to CPU speed), the queue remains
   essentially empty, so we pay the cost of enqueueing and dequeuing a packet
   from an empty queue.

- The ifnet send queue is a separately locked object from the device driver,
   meaning that for a single enqueue/dequeue pair, we pay an extra four lock
   operations (two for insert, two for remove) per packet.

- For synthetic link layer drivers, such as if_vlan, which have no need for
   queueing at all, the cost of queueing is eliminated.

- IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the
   driver, which helps eliminate a latent race condition involving use of the
   flag.

The proposed change is simple: right now one or more enqueue operations 
occurs, when a call to ifp->if_start() is made to notify the driver that it 
may need to do something (if the ACTIVE flag isn't set).  In the new world 
order, the driver is directly passed the mbuf, and may then choose to queue it 
or otherwise handle it as it sees fit.  The immediate practical benefit is 
clear: if the queueing at the ifnet layer is unnecessary, it is entirely 
avoided, skipping enqueue, dequeue, and four mutex operations.  This applies 
immediately for VLAN processing, but also means that for modern gigabit cards, 
the hardware queue (which will be used anyway) is the only queue necessary.

There are a few downsides, of course:

- For older hardware without its own queueing, the queue is still required --
   not only that, but we've now introduced an unconditional function pointer
   invocation, which on older hardware, is has more significant relative cost
   than it has on more recent CPUs.

- If drivers still require or use a queue, they must now synchronize access to
   the queue.  The obvious choices are to use the ifq lock (and restore the
   above four lock operations), or to use the driver mutex (and risk higher
   contention).  Right now, if the driver is busy (driver mutex held) then an
   enqueue is still possible, but with this change and a single mutex
   protecting the send queue and driver, that is no longer possible.

Attached is a patch that maintains the current if_start, but adds 
if_startmbuf.  If a device driver implements if_startmbuf and the global 
sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the 
driver will be used.  Otherwise, if_start is used.  I have modified the if_em 
driver to implement if_startmbuf also.  If there is no packet backlog in the 
if_snd queue, it directly places the packet in the transmit descriptor ring. 
If there is a backlog, it uses the if_snd queue protected by driver mutex, 
rather than a separate ifq mutex.

In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte 
paylod PPS on UP, and a 10% improvement on SMP.  I saw a 1.7% performance 
improvement in the bulk serving of 1k files over HTTP.  These are only 
micro-benchmarks, and reflect a configuration in which the CPU is unable to 
keep up with the output rate of the 1gbps ethernet card in the device, so 
reductions in host CPU usage are immediately visible in increased output as 
the CPU is able to better keep up with the network hardware.  Other 
configurations are also of interest of interesting, especially ones in which 
the network device is unable to keep up with the CPU, resulting in more 
queueing.

Conceptual review as well as banchmarking, etc, would be most welcome.

Robert N M Watson
Computer Laboratory
University of Cambridge
-------------- next part --------------
--- //depot/vendor/freebsd/src/sys/dev/em/if_em.c	2006/07/27 00:46:24
+++ //depot/user/rwatson/ifnet/src/sys/dev/em/if_em.c	2006/07/29 18:43:14
@@ -735,6 +735,95 @@
 	EM_UNLOCK(sc);
 }
 
+static int
+em_startmbuf(struct ifnet *ifp, struct mbuf *m)
+{
+        struct mbuf    *m_head;
+        struct em_softc *sc = ifp->if_softc;
+	struct ifqueue *ifq = (struct ifqueue *)&ifp->if_snd;
+
+	/*
+	 * Three cases:
+	 *
+	 * (1) Interface isn't running, link is down, or is already active,
+	 *     etc, simply enqueue.
+	 *
+	 * (2) The interface is running, not too busy, and we have no mbufs
+	 *     in the ifnet send queue, so try to hand directly to hardware.
+	 *
+	 * (3) The interface is running, but we have a backlog.  Insert the
+	 *     current mbuf into the queue and process in-order, if possible.
+	 */
+	EM_LOCK(sc);
+	if (((ifp->if_drv_flags & (IFF_DRV_RUNNING|IFF_DRV_OACTIVE)) !=
+	    IFF_DRV_RUNNING) || !sc->link_active) {
+		if (_IF_QFULL(ifq)) {
+			_IF_DROP(ifq);
+			EM_UNLOCK(sc);
+			m_freem(m);
+			return (ENOBUFS);
+		}
+		_IF_ENQUEUE(ifq, m);
+		EM_UNLOCK(sc);
+		return (0);
+	}
+
+	/*
+	 * XXXRW: Various cases here have historically counted as successes,
+	 * but perhaps they should return ENOBUFS?
+	 */
+	if (_IF_QLEN(ifq) == 0) {
+	 	/*
+		 * em_encap() can modify our pointer, and or make it NULL on
+		 * failure.  In that event, we can't enqueue.
+		 */
+		if (em_encap(sc, &m)) {
+			if (m == NULL) {
+				EM_UNLOCK(sc);
+				return (0);
+			}
+			ifp->if_flags |= IFF_DRV_OACTIVE;
+			_IF_PREPEND(ifq, m);
+			EM_UNLOCK(sc);
+			return (0);
+		}
+		BPF_MTAP(ifp, m);
+		ifp->if_timer = EM_TX_TIMEOUT;
+		EM_UNLOCK(sc);
+		return (0);
+	}
+
+	if (_IF_QFULL(ifq)) {
+		_IF_DROP(ifq);
+		EM_UNLOCK(sc);
+		m_freem(m);
+		return (ENOBUFS);
+	}
+	_IF_ENQUEUE(ifq, m);
+
+	while (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) {
+		IFQ_DRV_DEQUEUE(&ifp->if_snd, m_head);
+		if (m_head == NULL)
+			break;
+	 	/*
+		 * em_encap() can modify our pointer, and or make it NULL on
+		 * failure.  In that event, we can't requeue.
+		 */
+		if (em_encap(sc, &m_head)) {
+			if (m_head == NULL)
+				break;
+			ifp->if_drv_flags |= IFF_DRV_OACTIVE;
+			IFQ_DRV_PREPEND(&ifp->if_snd, m_head);
+			break;
+		}
+		BPF_MTAP(ifp, m_head);
+		ifp->if_timer = EM_TX_TIMEOUT;
+	}
+
+	EM_UNLOCK(sc);
+	return (0);
+}
+
 /*********************************************************************
  *  Ioctl entry point
  *
@@ -2154,6 +2243,7 @@
 	ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST;
 	ifp->if_ioctl = em_ioctl;
 	ifp->if_start = em_start;
+	ifp->if_startmbuf = em_startmbuf;
 	ifp->if_watchdog = em_watchdog;
 	IFQ_SET_MAXLEN(&ifp->if_snd, sc->num_tx_desc - 1);
 	ifp->if_snd.ifq_drv_maxlen = sc->num_tx_desc - 1;
--- //depot/vendor/freebsd/src/sys/net/if.c	2006/07/09 06:06:25
+++ //depot/user/rwatson/ifnet/src/sys/net/if.c	2006/07/26 17:32:50
@@ -2486,28 +2486,111 @@
 	(ifp->if_start)(ifp);
 }
 
+static int	startmbuf_enabled;
+SYSCTL_INT(_net, OID_AUTO, startmbuf_enabled, CTLFLAG_RW, &startmbuf_enabled,
+    0, "");
+
+/*
+ * XXXRW:
+ *
+ * if_var.h and the interface handoff are some of the nastiest pieces of the
+ * BSD network stack.  Generations of hacks, variants, inconsistency, and
+ * foolishness have resulted in essentially unreadable code.  For example,
+ * why are the ifq_* interfaces the ones that use the default ifnet send
+ * queue, and the if_* interfaces the ones that use alternative queues,
+ * possibly with no ifnet at all?  And why do some interfaces return errno
+ * values, but others booleans?
+ */
+
+/*
+ * Handoff function for simple ifnet structures.  Returns an errno value.
+ */
 int
-if_handoff(struct ifqueue *ifq, struct mbuf *m, struct ifnet *ifp, int adjust)
+ifq_handoff(struct ifnet *ifp, struct mbuf *m, int adjust)
+{
+	int error, len, startmbuf;
+	short mflags;
+
+	len = m->m_pkthdr.len;
+	mflags = m->m_flags;
+
+	if (startmbuf_enabled && ifp->if_startmbuf != NULL)
+		startmbuf = 1;
+	else
+		startmbuf = 0;
+
+	if (startmbuf)
+		error = ifp->if_startmbuf(ifp, m);
+	else
+		IFQ_ENQUEUE(&ifp->if_snd, m, error);
+	if (error == 0) {
+		ifp->if_obytes += len + adjust;
+		if (mflags & (M_BCAST|M_MCAST))
+			ifp->if_omcasts++;
+	}
+	if (!startmbuf && (ifp->if_drv_flags & IFF_DRV_OACTIVE) == 0)
+		if_start(ifp);
+	return (error);
+}
+
+/*
+ * Handoff function for an ifqueue with an optionally affilitiated ifnet.
+ * Returns a boolean.
+ */
+int
+if_handoff(struct ifqueue *ifq, struct mbuf *m, struct ifnet *ifp,
+    int adjust)
+{
+	int len, active, startmbuf, success;
+	short mflags;
+
+	active = 0;
+	len = m->m_pkthdr.len;
+	mflags = m->m_flags;
+
+	if (startmbuf_enabled && ifp != NULL && ifp->if_startmbuf != NULL)
+		startmbuf = 1;
+	else
+		startmbuf = 0;
+
+	if (startmbuf)
+		success = (ifp->if_startmbuf(ifp, m) == 0);
+	else {
+		IF_LOCK(ifq);
+		if (_IF_QFULL(ifq)) {
+			_IF_DROP(ifq);
+			m_freem(m);
+			success = 0;
+		} else {
+			_IF_ENQUEUE(ifq, m);
+			success = 1;
+		}
+		IF_UNLOCK(ifq);
+		if (ifp != NULL && !(ifp->if_drv_flags & IFF_DRV_OACTIVE))
+			if_start(ifp);
+	}
+	if (success && ifp != NULL) {
+		ifp->if_obytes += len + adjust;
+		if (m->m_flags & (M_BCAST|M_MCAST))
+			ifp->if_omcasts++;
+	}
+	return (success);
+}
+
+/*
+ * Utility function to be used by device drivers when they need to enqueue a
+ * packet to an interface-related queue rather than immediately delivering.
+ */
+int
+if_startmbuf_enqueue(struct ifqueue *ifq, struct mbuf *m)
 {
-	int active = 0;
 
-	IF_LOCK(ifq);
 	if (_IF_QFULL(ifq)) {
 		_IF_DROP(ifq);
-		IF_UNLOCK(ifq);
 		m_freem(m);
 		return (0);
 	}
-	if (ifp != NULL) {
-		ifp->if_obytes += m->m_pkthdr.len + adjust;
-		if (m->m_flags & (M_BCAST|M_MCAST))
-			ifp->if_omcasts++;
-		active = ifp->if_drv_flags & IFF_DRV_OACTIVE;
-	}
 	_IF_ENQUEUE(ifq, m);
-	IF_UNLOCK(ifq);
-	if (ifp != NULL && !active)
-		if_start(ifp);
 	return (1);
 }
 
--- //depot/vendor/freebsd/src/sys/net/if_var.h	2006/06/19 22:21:22
+++ //depot/user/rwatson/ifnet/src/sys/net/if_var.h	2006/07/30 10:11:54
@@ -162,7 +162,8 @@
 		(struct ifnet *, struct sockaddr **, struct sockaddr *);
 	struct	ifaddr	*if_addr;	/* pointer to link-level address */
 	void	*if_spare2;		/* spare pointer 2 */
-	void	*if_spare3;		/* spare pointer 3 */
+	int	(*if_startmbuf)		/* enqueue and start output */
+		(struct ifnet *, struct mbuf *);
 	int	if_drv_flags;		/* driver-managed status flags */
 	u_int	if_spare_flags2;	/* spare flags 2 */
 	struct  ifaltq if_snd;		/* output queue (includes altq) */
@@ -370,12 +371,15 @@
 		mtx_unlock(&Giant);					\
 } while (0)
 
+int	ifq_handoff(struct ifnet *ifp, struct mbuf *m, int adjust);
 int	if_handoff(struct ifqueue *ifq, struct mbuf *m, struct ifnet *ifp,
 	    int adjust);
+int	if_startmbuf_enqueue(struct ifqueue *ifq, struct mbuf *m);
+
+#define	IF_HANDOFF_ADJ(ifq, m, ifp, adj)	\
+	if_handoff((struct ifqueue *)ifq, m, ifp, adj)
 #define	IF_HANDOFF(ifq, m, ifp)			\
 	if_handoff((struct ifqueue *)ifq, m, ifp, 0)
-#define	IF_HANDOFF_ADJ(ifq, m, ifp, adj)	\
-	if_handoff((struct ifqueue *)ifq, m, ifp, adj)
 
 void	if_start(struct ifnet *);
 
@@ -459,25 +463,8 @@
 #define	IFQ_INC_DROPS(ifq)		((ifq)->ifq_drops++)
 #define	IFQ_SET_MAXLEN(ifq, len)	((ifq)->ifq_maxlen = (len))
 
-/*
- * The IFF_DRV_OACTIVE test should really occur in the device driver, not in
- * the handoff logic, as that flag is locked by the device driver.
- */
-#define	IFQ_HANDOFF_ADJ(ifp, m, adj, err)				\
-do {									\
-	int len;							\
-	short mflags;							\
-									\
-	len = (m)->m_pkthdr.len;					\
-	mflags = (m)->m_flags;						\
-	IFQ_ENQUEUE(&(ifp)->if_snd, m, err);				\
-	if ((err) == 0) {						\
-		(ifp)->if_obytes += len + (adj);			\
-		if (mflags & M_MCAST)					\
-			(ifp)->if_omcasts++;				\
-		if (((ifp)->if_drv_flags & IFF_DRV_OACTIVE) == 0)	\
-			if_start(ifp);					\
-	}								\
+#define	IFQ_HANDOFF_ADJ(ifp, m, adj, err) do {				\
+	err = ifq_handoff(ifp, m, adj);					\
 } while (0)
 
 #define	IFQ_HANDOFF(ifp, m, err)					\


More information about the freebsd-net mailing list