Changes in the network interface queueing handoff model
Robert Watson
rwatson at FreeBSD.org
Sun Jul 30 14:04:49 UTC 2006
5BOne of the ideas that I, Scott Long, and a few others have been bouncing
around for some time is a restructuring of the network interface packet
transmission API to reduce the number of locking operations and allow network
device drivers increased control of the queueing behavior. Right now, it
works something like that following:
- When a network protocol wants to transmit, it calls the ifnet's link layer
output routine via ifp->if_output() with the ifnet pointer, packet,
destination address information, and route information.
- The link layer (e.g., ether_output() + ether_output_frame()) encapsulates
the packet as necessary, performs a link layer address translation (such as
ARP), and hands off to the ifnet driver via a call to IFQ_HANDOFF(), which
accepts the ifnet pointer and packet.
- The ifnet layer enqueues the packet in the ifnet send queue (ifp->if_snd),
and then looks at the driver's IFF_DRV_OACTIVE flag to determine if it needs
to "start" output by the driver. If the driver is already active, it
doesn't, and otherwise, it does.
- The driver dequeues the packet from ifp->if_snd, performs any driver
encapsulation and wrapping, and notifies the hardware. In modern hardware,
this consists of hooking the data of the packet up to the descriptor ring
and notifying the hardware to pick it up via DMA. In order hardware, the
driver would perform a series of I/O operations to send the entire packet
directly to the card via a system bus.
Why change this? A few reasons:
- The ifnet layer send queue is becoming decreasingly useful over time. Most
modern hardware has a significant number of slots in its transmit descriptor
ring, tuned for the performance of the hardware, etc, which is the effective
transmit queue in practice. The additional queue depth doesn't increase
throughput substantially (if at all) but does consume memory.
- On extremely fast hardware (with respect to CPU speed), the queue remains
essentially empty, so we pay the cost of enqueueing and dequeuing a packet
from an empty queue.
- The ifnet send queue is a separately locked object from the device driver,
meaning that for a single enqueue/dequeue pair, we pay an extra four lock
operations (two for insert, two for remove) per packet.
- For synthetic link layer drivers, such as if_vlan, which have no need for
queueing at all, the cost of queueing is eliminated.
- IFF_DRV_OACTIVE is no longer inspected by the link layer, only by the
driver, which helps eliminate a latent race condition involving use of the
flag.
The proposed change is simple: right now one or more enqueue operations
occurs, when a call to ifp->if_start() is made to notify the driver that it
may need to do something (if the ACTIVE flag isn't set). In the new world
order, the driver is directly passed the mbuf, and may then choose to queue it
or otherwise handle it as it sees fit. The immediate practical benefit is
clear: if the queueing at the ifnet layer is unnecessary, it is entirely
avoided, skipping enqueue, dequeue, and four mutex operations. This applies
immediately for VLAN processing, but also means that for modern gigabit cards,
the hardware queue (which will be used anyway) is the only queue necessary.
There are a few downsides, of course:
- For older hardware without its own queueing, the queue is still required --
not only that, but we've now introduced an unconditional function pointer
invocation, which on older hardware, is has more significant relative cost
than it has on more recent CPUs.
- If drivers still require or use a queue, they must now synchronize access to
the queue. The obvious choices are to use the ifq lock (and restore the
above four lock operations), or to use the driver mutex (and risk higher
contention). Right now, if the driver is busy (driver mutex held) then an
enqueue is still possible, but with this change and a single mutex
protecting the send queue and driver, that is no longer possible.
Attached is a patch that maintains the current if_start, but adds
if_startmbuf. If a device driver implements if_startmbuf and the global
sysctl net.startmbuf_enabled is set to 1, then the if_startmbuf path in the
driver will be used. Otherwise, if_start is used. I have modified the if_em
driver to implement if_startmbuf also. If there is no packet backlog in the
if_snd queue, it directly places the packet in the transmit descriptor ring.
If there is a backlog, it uses the if_snd queue protected by driver mutex,
rather than a separate ifq mutex.
In some basic local micro-benchmarks, I saw a 5% improvement in UDP 0-byte
paylod PPS on UP, and a 10% improvement on SMP. I saw a 1.7% performance
improvement in the bulk serving of 1k files over HTTP. These are only
micro-benchmarks, and reflect a configuration in which the CPU is unable to
keep up with the output rate of the 1gbps ethernet card in the device, so
reductions in host CPU usage are immediately visible in increased output as
the CPU is able to better keep up with the network hardware. Other
configurations are also of interest of interesting, especially ones in which
the network device is unable to keep up with the CPU, resulting in more
queueing.
Conceptual review as well as banchmarking, etc, would be most welcome.
Robert N M Watson
Computer Laboratory
University of Cambridge
-------------- next part --------------
--- //depot/vendor/freebsd/src/sys/dev/em/if_em.c 2006/07/27 00:46:24
+++ //depot/user/rwatson/ifnet/src/sys/dev/em/if_em.c 2006/07/29 18:43:14
@@ -735,6 +735,95 @@
EM_UNLOCK(sc);
}
+static int
+em_startmbuf(struct ifnet *ifp, struct mbuf *m)
+{
+ struct mbuf *m_head;
+ struct em_softc *sc = ifp->if_softc;
+ struct ifqueue *ifq = (struct ifqueue *)&ifp->if_snd;
+
+ /*
+ * Three cases:
+ *
+ * (1) Interface isn't running, link is down, or is already active,
+ * etc, simply enqueue.
+ *
+ * (2) The interface is running, not too busy, and we have no mbufs
+ * in the ifnet send queue, so try to hand directly to hardware.
+ *
+ * (3) The interface is running, but we have a backlog. Insert the
+ * current mbuf into the queue and process in-order, if possible.
+ */
+ EM_LOCK(sc);
+ if (((ifp->if_drv_flags & (IFF_DRV_RUNNING|IFF_DRV_OACTIVE)) !=
+ IFF_DRV_RUNNING) || !sc->link_active) {
+ if (_IF_QFULL(ifq)) {
+ _IF_DROP(ifq);
+ EM_UNLOCK(sc);
+ m_freem(m);
+ return (ENOBUFS);
+ }
+ _IF_ENQUEUE(ifq, m);
+ EM_UNLOCK(sc);
+ return (0);
+ }
+
+ /*
+ * XXXRW: Various cases here have historically counted as successes,
+ * but perhaps they should return ENOBUFS?
+ */
+ if (_IF_QLEN(ifq) == 0) {
+ /*
+ * em_encap() can modify our pointer, and or make it NULL on
+ * failure. In that event, we can't enqueue.
+ */
+ if (em_encap(sc, &m)) {
+ if (m == NULL) {
+ EM_UNLOCK(sc);
+ return (0);
+ }
+ ifp->if_flags |= IFF_DRV_OACTIVE;
+ _IF_PREPEND(ifq, m);
+ EM_UNLOCK(sc);
+ return (0);
+ }
+ BPF_MTAP(ifp, m);
+ ifp->if_timer = EM_TX_TIMEOUT;
+ EM_UNLOCK(sc);
+ return (0);
+ }
+
+ if (_IF_QFULL(ifq)) {
+ _IF_DROP(ifq);
+ EM_UNLOCK(sc);
+ m_freem(m);
+ return (ENOBUFS);
+ }
+ _IF_ENQUEUE(ifq, m);
+
+ while (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) {
+ IFQ_DRV_DEQUEUE(&ifp->if_snd, m_head);
+ if (m_head == NULL)
+ break;
+ /*
+ * em_encap() can modify our pointer, and or make it NULL on
+ * failure. In that event, we can't requeue.
+ */
+ if (em_encap(sc, &m_head)) {
+ if (m_head == NULL)
+ break;
+ ifp->if_drv_flags |= IFF_DRV_OACTIVE;
+ IFQ_DRV_PREPEND(&ifp->if_snd, m_head);
+ break;
+ }
+ BPF_MTAP(ifp, m_head);
+ ifp->if_timer = EM_TX_TIMEOUT;
+ }
+
+ EM_UNLOCK(sc);
+ return (0);
+}
+
/*********************************************************************
* Ioctl entry point
*
@@ -2154,6 +2243,7 @@
ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST;
ifp->if_ioctl = em_ioctl;
ifp->if_start = em_start;
+ ifp->if_startmbuf = em_startmbuf;
ifp->if_watchdog = em_watchdog;
IFQ_SET_MAXLEN(&ifp->if_snd, sc->num_tx_desc - 1);
ifp->if_snd.ifq_drv_maxlen = sc->num_tx_desc - 1;
--- //depot/vendor/freebsd/src/sys/net/if.c 2006/07/09 06:06:25
+++ //depot/user/rwatson/ifnet/src/sys/net/if.c 2006/07/26 17:32:50
@@ -2486,28 +2486,111 @@
(ifp->if_start)(ifp);
}
+static int startmbuf_enabled;
+SYSCTL_INT(_net, OID_AUTO, startmbuf_enabled, CTLFLAG_RW, &startmbuf_enabled,
+ 0, "");
+
+/*
+ * XXXRW:
+ *
+ * if_var.h and the interface handoff are some of the nastiest pieces of the
+ * BSD network stack. Generations of hacks, variants, inconsistency, and
+ * foolishness have resulted in essentially unreadable code. For example,
+ * why are the ifq_* interfaces the ones that use the default ifnet send
+ * queue, and the if_* interfaces the ones that use alternative queues,
+ * possibly with no ifnet at all? And why do some interfaces return errno
+ * values, but others booleans?
+ */
+
+/*
+ * Handoff function for simple ifnet structures. Returns an errno value.
+ */
int
-if_handoff(struct ifqueue *ifq, struct mbuf *m, struct ifnet *ifp, int adjust)
+ifq_handoff(struct ifnet *ifp, struct mbuf *m, int adjust)
+{
+ int error, len, startmbuf;
+ short mflags;
+
+ len = m->m_pkthdr.len;
+ mflags = m->m_flags;
+
+ if (startmbuf_enabled && ifp->if_startmbuf != NULL)
+ startmbuf = 1;
+ else
+ startmbuf = 0;
+
+ if (startmbuf)
+ error = ifp->if_startmbuf(ifp, m);
+ else
+ IFQ_ENQUEUE(&ifp->if_snd, m, error);
+ if (error == 0) {
+ ifp->if_obytes += len + adjust;
+ if (mflags & (M_BCAST|M_MCAST))
+ ifp->if_omcasts++;
+ }
+ if (!startmbuf && (ifp->if_drv_flags & IFF_DRV_OACTIVE) == 0)
+ if_start(ifp);
+ return (error);
+}
+
+/*
+ * Handoff function for an ifqueue with an optionally affilitiated ifnet.
+ * Returns a boolean.
+ */
+int
+if_handoff(struct ifqueue *ifq, struct mbuf *m, struct ifnet *ifp,
+ int adjust)
+{
+ int len, active, startmbuf, success;
+ short mflags;
+
+ active = 0;
+ len = m->m_pkthdr.len;
+ mflags = m->m_flags;
+
+ if (startmbuf_enabled && ifp != NULL && ifp->if_startmbuf != NULL)
+ startmbuf = 1;
+ else
+ startmbuf = 0;
+
+ if (startmbuf)
+ success = (ifp->if_startmbuf(ifp, m) == 0);
+ else {
+ IF_LOCK(ifq);
+ if (_IF_QFULL(ifq)) {
+ _IF_DROP(ifq);
+ m_freem(m);
+ success = 0;
+ } else {
+ _IF_ENQUEUE(ifq, m);
+ success = 1;
+ }
+ IF_UNLOCK(ifq);
+ if (ifp != NULL && !(ifp->if_drv_flags & IFF_DRV_OACTIVE))
+ if_start(ifp);
+ }
+ if (success && ifp != NULL) {
+ ifp->if_obytes += len + adjust;
+ if (m->m_flags & (M_BCAST|M_MCAST))
+ ifp->if_omcasts++;
+ }
+ return (success);
+}
+
+/*
+ * Utility function to be used by device drivers when they need to enqueue a
+ * packet to an interface-related queue rather than immediately delivering.
+ */
+int
+if_startmbuf_enqueue(struct ifqueue *ifq, struct mbuf *m)
{
- int active = 0;
- IF_LOCK(ifq);
if (_IF_QFULL(ifq)) {
_IF_DROP(ifq);
- IF_UNLOCK(ifq);
m_freem(m);
return (0);
}
- if (ifp != NULL) {
- ifp->if_obytes += m->m_pkthdr.len + adjust;
- if (m->m_flags & (M_BCAST|M_MCAST))
- ifp->if_omcasts++;
- active = ifp->if_drv_flags & IFF_DRV_OACTIVE;
- }
_IF_ENQUEUE(ifq, m);
- IF_UNLOCK(ifq);
- if (ifp != NULL && !active)
- if_start(ifp);
return (1);
}
--- //depot/vendor/freebsd/src/sys/net/if_var.h 2006/06/19 22:21:22
+++ //depot/user/rwatson/ifnet/src/sys/net/if_var.h 2006/07/30 10:11:54
@@ -162,7 +162,8 @@
(struct ifnet *, struct sockaddr **, struct sockaddr *);
struct ifaddr *if_addr; /* pointer to link-level address */
void *if_spare2; /* spare pointer 2 */
- void *if_spare3; /* spare pointer 3 */
+ int (*if_startmbuf) /* enqueue and start output */
+ (struct ifnet *, struct mbuf *);
int if_drv_flags; /* driver-managed status flags */
u_int if_spare_flags2; /* spare flags 2 */
struct ifaltq if_snd; /* output queue (includes altq) */
@@ -370,12 +371,15 @@
mtx_unlock(&Giant); \
} while (0)
+int ifq_handoff(struct ifnet *ifp, struct mbuf *m, int adjust);
int if_handoff(struct ifqueue *ifq, struct mbuf *m, struct ifnet *ifp,
int adjust);
+int if_startmbuf_enqueue(struct ifqueue *ifq, struct mbuf *m);
+
+#define IF_HANDOFF_ADJ(ifq, m, ifp, adj) \
+ if_handoff((struct ifqueue *)ifq, m, ifp, adj)
#define IF_HANDOFF(ifq, m, ifp) \
if_handoff((struct ifqueue *)ifq, m, ifp, 0)
-#define IF_HANDOFF_ADJ(ifq, m, ifp, adj) \
- if_handoff((struct ifqueue *)ifq, m, ifp, adj)
void if_start(struct ifnet *);
@@ -459,25 +463,8 @@
#define IFQ_INC_DROPS(ifq) ((ifq)->ifq_drops++)
#define IFQ_SET_MAXLEN(ifq, len) ((ifq)->ifq_maxlen = (len))
-/*
- * The IFF_DRV_OACTIVE test should really occur in the device driver, not in
- * the handoff logic, as that flag is locked by the device driver.
- */
-#define IFQ_HANDOFF_ADJ(ifp, m, adj, err) \
-do { \
- int len; \
- short mflags; \
- \
- len = (m)->m_pkthdr.len; \
- mflags = (m)->m_flags; \
- IFQ_ENQUEUE(&(ifp)->if_snd, m, err); \
- if ((err) == 0) { \
- (ifp)->if_obytes += len + (adj); \
- if (mflags & M_MCAST) \
- (ifp)->if_omcasts++; \
- if (((ifp)->if_drv_flags & IFF_DRV_OACTIVE) == 0) \
- if_start(ifp); \
- } \
+#define IFQ_HANDOFF_ADJ(ifp, m, adj, err) do { \
+ err = ifq_handoff(ifp, m, adj); \
} while (0)
#define IFQ_HANDOFF(ifp, m, err) \
More information about the freebsd-net
mailing list