svn commit: r260370 - in projects/random_number_generator: share/man/man4 share/mk sys/dev/e1000 sys/dev/ixgbe sys/dev/netmap sys/net sys/rpc tools/tools/netmap
Mark Murray
markm at FreeBSD.org
Mon Jan 6 14:56:01 UTC 2014
Author: markm
Date: Mon Jan 6 14:56:00 2014
New Revision: 260370
URL: http://svnweb.freebsd.org/changeset/base/260370
Log:
MFC - tracking commit.
Modified:
projects/random_number_generator/share/man/man4/netmap.4
projects/random_number_generator/share/mk/bsd.sys.mk
projects/random_number_generator/sys/dev/e1000/if_em.c
projects/random_number_generator/sys/dev/e1000/if_igb.c
projects/random_number_generator/sys/dev/e1000/if_lem.c
projects/random_number_generator/sys/dev/ixgbe/ixgbe.c
projects/random_number_generator/sys/dev/netmap/if_em_netmap.h
projects/random_number_generator/sys/dev/netmap/if_igb_netmap.h
projects/random_number_generator/sys/dev/netmap/if_lem_netmap.h
projects/random_number_generator/sys/dev/netmap/if_re_netmap.h
projects/random_number_generator/sys/dev/netmap/ixgbe_netmap.h
projects/random_number_generator/sys/dev/netmap/netmap.c
projects/random_number_generator/sys/dev/netmap/netmap_freebsd.c
projects/random_number_generator/sys/dev/netmap/netmap_generic.c
projects/random_number_generator/sys/dev/netmap/netmap_kern.h
projects/random_number_generator/sys/dev/netmap/netmap_mbq.c
projects/random_number_generator/sys/dev/netmap/netmap_mbq.h
projects/random_number_generator/sys/dev/netmap/netmap_mem2.c
projects/random_number_generator/sys/dev/netmap/netmap_mem2.h
projects/random_number_generator/sys/dev/netmap/netmap_vale.c
projects/random_number_generator/sys/net/netmap.h
projects/random_number_generator/sys/net/netmap_user.h
projects/random_number_generator/sys/rpc/svc.h
projects/random_number_generator/tools/tools/netmap/bridge.c
projects/random_number_generator/tools/tools/netmap/nm_util.c
projects/random_number_generator/tools/tools/netmap/nm_util.h
projects/random_number_generator/tools/tools/netmap/pcap.c
projects/random_number_generator/tools/tools/netmap/pkt-gen.c
projects/random_number_generator/tools/tools/netmap/vale-ctl.c
Directory Properties:
projects/random_number_generator/ (props changed)
projects/random_number_generator/share/man/man4/ (props changed)
projects/random_number_generator/sys/ (props changed)
Modified: projects/random_number_generator/share/man/man4/netmap.4
==============================================================================
--- projects/random_number_generator/share/man/man4/netmap.4 Mon Jan 6 14:39:10 2014 (r260369)
+++ projects/random_number_generator/share/man/man4/netmap.4 Mon Jan 6 14:56:00 2014 (r260370)
@@ -1,4 +1,4 @@
-.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa
+.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
.\" All rights reserved.
.\"
.\" Redistribution and use in source and binary forms, with or without
@@ -27,434 +27,546 @@
.\"
.\" $FreeBSD$
.\"
-.Dd October 18, 2013
+.Dd January 4, 2014
.Dt NETMAP 4
.Os
.Sh NAME
.Nm netmap
.Nd a framework for fast packet I/O
+.br
+.Nm VALE
+.Nd a fast VirtuAl Local Ethernet using the netmap API
.Sh SYNOPSIS
.Cd device netmap
.Sh DESCRIPTION
.Nm
is a framework for extremely fast and efficient packet I/O
-(reaching 14.88 Mpps with a single core at less than 1 GHz)
for both userspace and kernel clients.
-Userspace clients can use the netmap API
-to send and receive raw packets through physical interfaces
-or ports of the
-.Xr VALE 4
-switch.
-.Pp
-.Nm VALE
-is a very fast (reaching 20 Mpps per port)
-and modular software switch,
-implemented within the kernel, which can interconnect
-virtual ports, physical devices, and the native host stack.
-.Pp
-.Nm
-uses a memory mapped region to share packet buffers,
-descriptors and queues with the kernel.
-Simple
-.Pa ioctl()s
-are used to bind interfaces/ports to file descriptors and
-implement non-blocking I/O, whereas blocking I/O uses
-.Pa select()/poll() .
+It runs on FreeBSD and Linux,
+and includes
+.Nm VALE ,
+a very fast and modular in-kernel software switch/dataplane.
+.Pp
.Nm
-can exploit the parallelism in multiqueue devices and
-multicore systems.
+and
+.Nm VALE
+are one order of magnitude faster than sockets, bpf or
+native switches based on
+.Xr tun/tap 4 ,
+reaching 14.88 Mpps with much less than one core on a 10 Gbit NIC,
+and 20 Mpps per core for VALE ports.
+.Pp
+Userspace clients can dynamically switch NICs into
+.Nm
+mode and send and receive raw packets through
+memory mapped buffers.
+A selectable file descriptor supports
+synchronization and blocking I/O.
+.Pp
+Similarly,
+.Nm VALE
+can dynamically create switch instances and ports,
+providing high speed packet I/O between processes,
+virtual machines, NICs and the host stack.
.Pp
-For the best performance,
+For best performance,
.Nm
requires explicit support in device drivers;
-a generic emulation layer is available to implement the
+however, the
.Nm
-API on top of unmodified device drivers,
+API can be emulated on top of unmodified device drivers,
at the price of reduced performance
-(but still better than what can be achieved with
-sockets or BPF/pcap).
+(but still better than sockets or BPF/pcap).
.Pp
-For a list of devices with native
+In the rest of this (long) manual page we document
+various aspects of the
.Nm
-support, see the end of this manual page.
-.Sh OPERATION - THE NETMAP API
+and
+.Nm VALE
+architecture, features and usage.
+.Pp
+.Sh ARCHITECTURE
.Nm
-clients must first
-.Pa open("/dev/netmap") ,
-and then issue an
-.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg)
-to bind the file descriptor to a specific interface or port.
+supports raw packet I/O through a
+.Em port ,
+which can be connected to a physical interface
+.Em ( NIC ) ,
+to the host stack,
+or to a
+.Nm VALE
+switch).
+Ports use preallocated circular queues of buffers
+.Em ( rings )
+residing in an mmapped region.
+There is one ring for each transmit/receive queue of a
+NIC or virtual port.
+An additional ring pair connects to the host stack.
+.Pp
+After binding a file descriptor to a port, a
+.Nm
+client can send or receive packets in batches through
+the rings, and possibly implement zero-copy forwarding
+between ports.
+.Pp
+All NICs operating in
+.Nm
+mode use the same memory region,
+accessible to all processes who own
+.Nm /dev/netmap
+file descriptors bound to NICs.
+.Nm VALE
+ports instead use separate memory regions.
+.Pp
+.Sh ENTERING AND EXITING NETMAP MODE
+Ports and rings are created and controlled through a file descriptor,
+created by opening a special device
+.Dl fd = open("/dev/netmap");
+and then bound to a specific port with an
+.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
+.Pp
.Nm
has multiple modes of operation controlled by the
-content of the
-.Pa struct nmreq
-passed to the
-.Pa ioctl() .
-In particular, the
-.Em nr_name
-field specifies whether the client operates on a physical network
-interface or on a port of a
-.Nm VALE
-switch, as indicated below. Additional fields in the
-.Pa struct nmreq
-control the details of operation.
+.Vt struct nmreq
+argument.
+.Va arg.nr_name
+specifies the port name, as follows:
.Bl -tag -width XXXX
-.It Dv Interface name (e.g. 'em0', 'eth1', ... )
-The data path of the interface is disconnected from the host stack.
-Depending on additional arguments,
-the file descriptor is bound to the NIC (one or all queues),
-or to the host stack.
+.It Dv OS network interface name (e.g. 'em0', 'eth1', ... )
+the data path of the NIC is disconnected from the host stack,
+and the file descriptor is bound to the NIC (one or all queues),
+or to the host stack;
.It Dv valeXXX:YYY (arbitrary XXX and YYY)
-The file descriptor is bound to port YYY of a VALE switch called XXX,
-where XXX and YYY are arbitrary alphanumeric strings.
+the file descriptor is bound to port YYY of a VALE switch called XXX,
+both dynamically created if necessary.
The string cannot exceed IFNAMSIZ characters, and YYY cannot
-matching the name of any existing interface.
-.Pp
-The switch and the port are created if not existing.
-.It Dv valeXXX:ifname (ifname is an existing interface)
-Flags in the argument control whether the physical interface
-(and optionally the corrisponding host stack endpoint)
-are connected or disconnected from the VALE switch named XXX.
-.Pp
-In this case the
-.Pa ioctl()
-is used only for configuring the VALE switch, typically through the
-.Nm vale-ctl
-command.
-The file descriptor cannot be used for I/O, and should be
-.Pa close()d
-after issuing the
-.Pa ioctl().
+be the name of any existing OS network interface.
.El
.Pp
-The binding can be removed (and the interface returns to
-regular operation, or the virtual port destroyed) with a
-.Pa close()
-on the file descriptor.
-.Pp
-The processes owning the file descriptor can then
-.Pa mmap()
-the memory region that contains pre-allocated
-buffers, descriptors and queues, and use them to
-read/write raw packets.
+On return,
+.Va arg
+indicates the size of the shared memory region,
+and the number, size and location of all the
+.Nm
+data structures, which can be accessed by mmapping the memory
+.Dl char *mem = mmap(0, arg.nr_memsize, fd);
+.Pp
Non blocking I/O is done with special
-.Pa ioctl()'s ,
-whereas the file descriptor can be passed to
-.Pa select()/poll()
-to be notified about incoming packet or available transmit buffers.
-.Ss DATA STRUCTURES
-The data structures in the mmapped memory are described below
-(see
-.Xr sys/net/netmap.h
-for reference).
-All physical devices operating in
+.Xr ioctl 2
+.Xr select 2
+and
+.Xr poll 2
+on the file descriptor permit blocking I/O.
+.Xr epoll 2
+and
+.Xr kqueue 2
+are not supported on
.Nm
-mode use the same memory region,
-shared by the kernel and all processes who own
-.Pa /dev/netmap
-descriptors bound to those devices
-(NOTE: visibility may be restricted in future implementations).
-Virtual ports instead use separate memory regions,
-shared only with the kernel.
-.Pp
-All references between the shared data structure
-are relative (offsets or indexes). Some macros help converting
-them into actual pointers.
+file descriptors.
+.Pp
+While a NIC is in
+.Nm
+mode, the OS will still believe the interface is up and running.
+OS-generated packets for that NIC end up into a
+.Nm
+ring, and another ring is used to send packets into the OS network stack.
+A
+.Xr close 2
+on the file descriptor removes the binding,
+and returns the NIC to normal mode (reconnecting the data path
+to the host stack), or destroys the virtual port.
+.Pp
+.Sh DATA STRUCTURES
+The data structures in the mmapped memory region are detailed in
+.Xr sys/net/netmap.h ,
+which is the ultimate reference for the
+.Nm
+API. The main structures and fields are indicated below:
.Bl -tag -width XXX
.It Dv struct netmap_if (one per interface)
-indicates the number of rings supported by an interface, their
-sizes, and the offsets of the
-.Pa netmap_rings
-associated to the interface.
-.Pp
-.Pa struct netmap_if
-is at offset
-.Pa nr_offset
-in the shared memory region is indicated by the
-field in the structure returned by the
-.Pa NIOCREGIF
-(see below).
.Bd -literal
struct netmap_if {
- char ni_name[IFNAMSIZ]; /* name of the interface. */
- const u_int ni_version; /* API version */
- const u_int ni_rx_rings; /* number of rx ring pairs */
- const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */
- const ssize_t ring_ofs[]; /* offset of tx and rx rings */
+ ...
+ const uint32_t ni_flags; /* properties */
+ ...
+ const uint32_t ni_tx_rings; /* NIC tx rings */
+ const uint32_t ni_rx_rings; /* NIC rx rings */
+ const uint32_t ni_extra_tx_rings; /* extra tx rings */
+ const uint32_t ni_extra_rx_rings; /* extra rx rings */
+ ...
};
.Ed
+.Pp
+Indicates the number of available rings
+.Pa ( struct netmap_rings )
+and their position in the mmapped region.
+The number of tx and rx rings
+.Pa ( ni_tx_rings , ni_rx_rings )
+normally depends on the hardware.
+NICs also have an extra tx/rx ring pair connected to the host stack.
+.Em NIOCREGIF
+can request additional tx/rx rings,
+to be used between multiple processes/threads
+accessing the same
+.Nm
+port.
.It Dv struct netmap_ring (one per ring)
-Contains the positions in the transmit and receive rings to
-synchronize the kernel and the application,
-and an array of
-.Pa slots
-describing the buffers.
-'reserved' is used in receive rings to tell the kernel the
-number of slots after 'cur' that are still in usr
-indicates how many slots starting from 'cur'
-the
-.Pp
-Each physical interface has one
-.Pa netmap_ring
-for each hardware transmit and receive ring,
-plus one extra transmit and one receive structure
-that connect to the host stack.
.Bd -literal
struct netmap_ring {
- const ssize_t buf_ofs; /* see details */
- const uint32_t num_slots; /* number of slots in the ring */
- uint32_t avail; /* number of usable slots */
- uint32_t cur; /* 'current' read/write index */
- uint32_t reserved; /* not refilled before current */
-
- const uint16_t nr_buf_size;
- uint16_t flags;
-#define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */
-#define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */
-#define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */
- struct timeval ts;
- struct netmap_slot slot[0]; /* array of slots */
+ ...
+ const uint32_t num_slots; /* slots in each ring */
+ const uint32_t nr_buf_size; /* size of each buffer */
+ ...
+ uint32_t head; /* (u) first buf owned by user */
+ uint32_t cur; /* (u) wakeup position */
+ const uint32_t tail; /* (k) first buf owned by kernel */
+ ...
+ uint32_t flags;
+ struct timeval ts; /* (k) time of last rxsync() */
+ ...
+ struct netmap_slot slot[0]; /* array of slots */
}
.Ed
.Pp
-In transmit rings, after a system call 'cur' indicates
-the first slot that can be used for transmissions,
-and 'avail' reports how many of them are available.
-Before the next netmap-related system call on the file
-descriptor, the application should fill buffers and
-slots with data, and update 'cur' and 'avail'
-accordingly, as shown in the figure below:
+Implements transmit and receive rings, with read/write
+pointers, metadata and and an array of
+.Pa slots
+describing the buffers.
+.Pp
+.It Dv struct netmap_slot (one per buffer)
.Bd -literal
-
- cur
- |----- avail ---| (after syscall)
- v
- TX [*****aaaaaaaaaaaaaaaaa**]
- TX [*****TTTTTaaaaaaaaaaaa**]
- ^
- |-- avail --| (before syscall)
- cur
+struct netmap_slot {
+ uint32_t buf_idx; /* buffer index */
+ uint16_t len; /* packet length */
+ uint16_t flags; /* buf changed, etc. */
+ uint64_t ptr; /* address for indirect buffers */
+};
.Ed
-In receive rings, after a system call 'cur' indicates
-the first slot that contains a valid packet,
-and 'avail' reports how many of them are available.
-Before the next netmap-related system call on the file
-descriptor, the application can process buffers and
-release them to the kernel updating
-'cur' and 'avail' accordingly, as shown in the figure below.
-Receive rings have an additional field called 'reserved'
-to indicate how many buffers before 'cur' are still
-under processing and cannot be released.
+.Pp
+Describes a packet buffer, which normally is identified by
+an index and resides in the mmapped region.
+.It Dv packet buffers
+Fixed size (normally 2 KB) packet buffers allocated by the kernel.
+.El
+.Pp
+The offset of the
+.Pa struct netmap_if
+in the mmapped region is indicated by the
+.Pa nr_offset
+field in the structure returned by
+.Pa NIOCREGIF .
+From there, all other objects are reachable through
+relative references (offsets or indexes).
+Macros and functions in <net/netmap_user.h>
+help converting them into actual pointers:
+.Pp
+.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset);
+.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
+.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
+.Pp
+.Dl char *buf = NETMAP_BUF(ring, buffer_index);
+.Sh RINGS, BUFFERS AND DATA I/O
+.Va Rings
+are circular queues of packets with three indexes/pointers
+.Va ( head , cur , tail ) ;
+one slot is always kept empty.
+The ring size
+.Va ( num_slots )
+should not be assumed to be a power of two.
+.br
+(NOTE: older versions of netmap used head/count format to indicate
+the content of a ring).
+.Pp
+.Va head
+is the first slot available to userspace;
+.br
+.Va cur
+is the wakeup point:
+select/poll will unblock when
+.Va tail
+passes
+.Va cur ;
+.br
+.Va tail
+is the first slot reserved to the kernel.
+.Pp
+Slot indexes MUST only move forward;
+for convenience, the function
+.Dl nm_ring_next(ring, index)
+returns the next index modulo the ring size.
+.Pp
+.Va head
+and
+.Va cur
+are only modified by the user program;
+.Va tail
+is only modified by the kernel.
+The kernel only reads/writes the
+.Vt struct netmap_ring
+slots and buffers
+during the execution of a netmap-related system call.
+The only exception are slots (and buffers) in the range
+.Va tail\ . . . head-1 ,
+that are explicitly assigned to the kernel.
+.Pp
+.Ss TRANSMIT RINGS
+On transmit rings, after a
+.Nm
+system call, slots in the range
+.Va head\ . . . tail-1
+are available for transmission.
+User code should fill the slots sequentially
+and advance
+.Va head
+and
+.Va cur
+past slots ready to transmit.
+.Va cur
+may be moved further ahead if the user code needs
+more slots before further transmissions (see
+.Sx SCATTER GATHER I/O ) .
+.Pp
+At the next NIOCTXSYNC/select()/poll(),
+slots up to
+.Va head-1
+are pushed to the port, and
+.Va tail
+may advance if further slots have become available.
+Below is an example of the evolution of a TX ring:
+.Pp
.Bd -literal
- cur
- |-res-|-- avail --| (after syscall)
- v
- RX [**rrrrrrRRRRRRRRRRRR******]
- RX [**...........rrrrRRR******]
- |res|--|<avail (before syscall)
- ^
- cur
+ after the syscall, slots between cur and tail are (a)vailable
+ head=cur tail
+ | |
+ v v
+ TX [.....aaaaaaaaaaa.............]
+
+ user creates new packets to (T)ransmit
+ head=cur tail
+ | |
+ v v
+ TX [.....TTTTTaaaaaa.............]
+ NIOCTXSYNC/poll()/select() sends packets and reports new slots
+ head=cur tail
+ | |
+ v v
+ TX [..........aaaaaaaaaaa........]
.Ed
-.It Dv struct netmap_slot (one per packet)
-contains the metadata for a packet:
+.Pp
+select() and poll() wlll block if there is no space in the ring, i.e.
+.Dl ring->cur == ring->tail
+and return when new slots have become available.
+.Pp
+High speed applications may want to amortize the cost of system calls
+by preparing as many packets as possible before issuing them.
+.Pp
+A transmit ring with pending transmissions has
+.Dl ring->head != ring->tail + 1 (modulo the ring size).
+The function
+.Va int nm_tx_pending(ring)
+implements this test.
+.Pp
+.Ss RECEIVE RINGS
+On receive rings, after a
+.Nm
+system call, the slots in the range
+.Va head\& . . . tail-1
+contain received packets.
+User code should process them and advance
+.Va head
+and
+.Va cur
+past slots it wants to return to the kernel.
+.Va cur
+may be moved further ahead if the user code wants to
+wait for more packets
+without returning all the previous slots to the kernel.
+.Pp
+At the next NIOCRXSYNC/select()/poll(),
+slots up to
+.Va head-1
+are returned to the kernel for further receives, and
+.Va tail
+may advance to report new incoming packets.
+.br
+Below is an example of the evolution of an RX ring:
.Bd -literal
-struct netmap_slot {
- uint32_t buf_idx; /* buffer index */
- uint16_t len; /* packet length */
- uint16_t flags; /* buf changed, etc. */
-#define NS_BUF_CHANGED 0x0001 /* must resync, buffer changed */
-#define NS_REPORT 0x0002 /* tell hw to report results
- * e.g. by generating an interrupt
- */
-#define NS_FORWARD 0x0004 /* pass packet to the other endpoint
- * (host stack or device)
- */
-#define NS_NO_LEARN 0x0008
-#define NS_INDIRECT 0x0010
-#define NS_MOREFRAG 0x0020
-#define NS_PORT_SHIFT 8
-#define NS_PORT_MASK (0xff << NS_PORT_SHIFT)
-#define NS_RFRAGS(_slot) ( ((_slot)->flags >> 8) & 0xff)
- uint64_t ptr; /* buffer address (indirect buffers) */
-};
+ after the syscall, there are some (h)eld and some (R)eceived slots
+ head cur tail
+ | | |
+ v v v
+ RX [..hhhhhhRRRRRRRR..........]
+
+ user advances head and cur, releasing some slots and holding others
+ head cur tail
+ | | |
+ v v v
+ RX [..*****hhhRRRRRR...........]
+
+ NICRXSYNC/poll()/select() recovers slots and reports new packets
+ head cur tail
+ | | |
+ v v v
+ RX [.......hhhRRRRRRRRRRRR....]
.Ed
-The flags control how the the buffer associated to the slot
-should be managed.
-.It Dv packet buffers
-are normally fixed size (2 Kbyte) buffers allocated by the kernel
-that contain packet data. Buffers addresses are computed through
-macros.
-.El
-.Bl -tag -width XXX
-Some macros support the access to objects in the shared memory
-region. In particular,
-.It NETMAP_TXRING(nifp, i)
-.It NETMAP_RXRING(nifp, i)
-return the address of the i-th transmit and receive ring,
-respectively, whereas
-.It NETMAP_BUF(ring, buf_idx)
-returns the address of the buffer with index buf_idx
-(which can be part of any ring for the given interface).
-.El
.Pp
-Normally, buffers are associated to slots when interfaces are bound,
-and one packet is fully contained in a single buffer.
-Clients can however modify the mapping using the
-following flags:
-.Ss FLAGS
+.Sh SLOTS AND PACKET BUFFERS
+Normally, packets should be stored in the netmap-allocated buffers
+assigned to slots when ports are bound to a file descriptor.
+One packet is fully contained in a single buffer.
+.Pp
+The following flags affect slot and buffer processing:
.Bl -tag -width XXX
.It NS_BUF_CHANGED
-indicates that the buf_idx in the slot has changed.
-This can be useful if the client wants to implement
-some form of zero-copy forwarding (e.g. by passing buffers
-from an input interface to an output interface), or
-needs to process packets out of order.
+it MUST be used when the buf_idx in the slot is changed.
+This can be used to implement
+zero-copy forwarding, see
+.Sx ZERO-COPY FORWARDING .
.Pp
-The flag MUST be used whenever the buffer index is changed.
.It NS_REPORT
-indicates that we want to be woken up when this buffer
-has been transmitted. This reduces performance but insures
-a prompt notification when a buffer has been sent.
+reports when this buffer has been transmitted.
Normally,
.Nm
notifies transmit completions in batches, hence signals
-can be delayed indefinitely. However, we need such notifications
-before closing a descriptor.
+can be delayed indefinitely. This flag helps detecting
+when packets have been send and a file descriptor can be closed.
.It NS_FORWARD
-When the device is open in 'transparent' mode,
-the client can mark slots in receive rings with this flag.
-For all marked slots, marked packets are forwarded to
-the other endpoint at the next system call, thus restoring
-(in a selective way) the connection between the NIC and the
-host stack.
+When a ring is in 'transparent' mode (see
+.Sx TRANSPARENT MODE ) ,
+packets marked with this flags are forwarded to the other endpoint
+at the next system call, thus restoring (in a selective way)
+the connection between a NIC and the host stack.
.It NS_NO_LEARN
tells the forwarding code that the SRC MAC address for this
-packet should not be used in the learning bridge
+packet must not be used in the learning bridge code.
.It NS_INDIRECT
-indicates that the packet's payload is not in the netmap
-supplied buffer, but in a user-supplied buffer whose
-user virtual address is in the 'ptr' field of the slot.
+indicates that the packet's payload is in a user-supplied buffer,
+whose user virtual address is in the 'ptr' field of the slot.
The size can reach 65535 bytes.
-.Em This is only supported on the transmit ring of virtual ports
+.br
+This is only supported on the transmit ring of
+.Nm VALE
+ports, and it helps reducing data copies in the interconnection
+of virtual machines.
.It NS_MOREFRAG
indicates that the packet continues with subsequent buffers;
the last buffer in a packet must have the flag clear.
+.El
+.Sh SCATTER GATHER I/O
+Packets can span multiple slots if the
+.Va NS_MOREFRAG
+flag is set in all but the last slot.
The maximum length of a chain is 64 buffers.
-.Em This is only supported on virtual ports
-.It NS_RFRAGS(slot)
-on receive rings, returns the number of remaining buffers
-in a packet, including this one.
-Slots with a value greater than 1 also have NS_MOREFRAG set.
-The length refers to the individual buffer, there is no
-field for the total length.
+This is normally used with
+.Nm VALE
+ports when connecting virtual machines, as they generate large
+TSO segments that are not split unless they reach a physical device.
.Pp
-On transmit rings, if NS_DST is set, it is passed to the lookup
-function, which can use it e.g. as the index of the destination
-port instead of doing an address lookup.
-.El
+NOTE: The length field always refers to the individual
+fragment; there is no place with the total length of a packet.
+.Pp
+On receive rings the macro
+.Va NS_RFRAGS(slot)
+indicates the remaining number of slots for this packet,
+including the current one.
+Slots with a value greater than 1 also have NS_MOREFRAG set.
.Sh IOCTLS
.Nm
-supports some ioctl() to synchronize the state of the rings
-between the kernel and the user processes, plus some
-to query and configure the interface.
-The former do not require any argument, whereas the latter
-use a
-.Pa struct nmreq
-defined as follows:
+uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
+for non-blocking I/O. They take no argument.
+Two more ioctls (NIOCGINFO, NIOCREGIF) are used
+to query and configure ports, with the following argument:
.Bd -literal
struct nmreq {
- char nr_name[IFNAMSIZ];
- uint32_t nr_version; /* API version */
-#define NETMAP_API 4 /* current version */
- uint32_t nr_offset; /* nifp offset in the shared region */
- uint32_t nr_memsize; /* size of the shared region */
- uint32_t nr_tx_slots; /* slots in tx rings */
- uint32_t nr_rx_slots; /* slots in rx rings */
- uint16_t nr_tx_rings; /* number of tx rings */
- uint16_t nr_rx_rings; /* number of tx rings */
- uint16_t nr_ringid; /* ring(s) we care about */
-#define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */
-#define NETMAP_SW_RING 0x2000 /* we process the sw ring */
-#define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */
-#define NETMAP_RING_MASK 0xfff /* the actual ring number */
- uint16_t nr_cmd;
-#define NETMAP_BDG_ATTACH 1 /* attach the NIC */
-#define NETMAP_BDG_DETACH 2 /* detach the NIC */
-#define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */
-#define NETMAP_BDG_LIST 4 /* get bridge's info */
- uint16_t nr_arg1;
- uint16_t nr_arg2;
- uint32_t spare2[3];
+ char nr_name[IFNAMSIZ]; /* (i) port name */
+ uint32_t nr_version; /* (i) API version */
+ uint32_t nr_offset; /* (o) nifp offset in mmap region */
+ uint32_t nr_memsize; /* (o) size of the mmap region */
+ uint32_t nr_tx_slots; /* (o) slots in tx rings */
+ uint32_t nr_rx_slots; /* (o) slots in rx rings */
+ uint16_t nr_tx_rings; /* (o) number of tx rings */
+ uint16_t nr_rx_rings; /* (o) number of tx rings */
+ uint16_t nr_ringid; /* (i) ring(s) we care about */
+ uint16_t nr_cmd; /* (i) special command */
+ uint16_t nr_arg1; /* (i) extra arguments */
+ uint16_t nr_arg2; /* (i) extra arguments */
+ ...
};
-
.Ed
-A device descriptor obtained through
+.Pp
+A file descriptor obtained through
.Pa /dev/netmap
-also supports the ioctl supported by network devices.
+also supports the ioctl supported by network devices, see
+.Xr netintro 4 .
.Pp
-The netmap-specific
-.Xr ioctl 2
-command codes below are defined in
-.In net/netmap.h
-and are:
.Bl -tag -width XXXX
.It Dv NIOCGINFO
-returns EINVAL if the named device does not support netmap.
+returns EINVAL if the named port does not support netmap.
Otherwise, it returns 0 and (advisory) information
-about the interface.
+about the port.
Note that all the information below can change before the
interface is actually put in netmap mode.
.Pp
-.Pa nr_memsize
-indicates the size of the netmap
-memory region. Physical devices all share the same memory region,
-whereas VALE ports may have independent regions for each port.
-These sizes can be set through system-wise sysctl variables.
-.Pa nr_tx_slots, nr_rx_slots
+.Bl -tag -width XX
+.It Pa nr_memsize
+indicates the size of the
+.Nm
+memory region. NICs in
+.Nm
+mode all share the same memory region,
+whereas
+.Nm VALE
+ports have independent regions for each port.
+.It Pa nr_tx_slots , nr_rx_slots
indicate the size of transmit and receive rings.
-.Pa nr_tx_rings, nr_rx_rings
+.It Pa nr_tx_rings , nr_rx_rings
indicate the number of transmit
and receive rings.
Both ring number and sizes may be configured at runtime
using interface-specific functions (e.g.
-.Pa sysctl
-or
-.Pa ethtool .
+.Xr ethtool
+).
+.El
.It Dv NIOCREGIF
-puts the interface named in nr_name into netmap mode, disconnecting
-it from the host stack, and/or defines which rings are controlled
-through this file descriptor.
+binds the port named in
+.Va nr_name
+to the file descriptor. For a physical device this also switches it into
+.Nm
+mode, disconnecting
+it from the host stack.
+Multiple file descriptors can be bound to the same port,
+with proper synchronization left to the user.
+.Pp
On return, it gives the same info as NIOCGINFO, and nr_ringid
indicates the identity of the rings controlled through the file
descriptor.
.Pp
-Possible values for nr_ringid are
+.Va nr_ringid
+selects which rings are controlled through this file descriptor.
+Possible values are:
.Bl -tag -width XXXXX
.It 0
-default, all hardware rings
+(default) all hardware rings
.It NETMAP_SW_RING
-the ``host rings'' connecting to the host stack
-.It NETMAP_HW_RING + i
-the i-th hardware ring
+the ``host rings'', connecting to the host stack.
+.It NETMAP_HW_RING | i
+the i-th hardware ring .
.El
+.Pp
By default, a
-.Nm poll
+.Xr poll 2
or
-.Nm select
+.Xr select 2
call pushes out any pending packets on the transmit ring, even if
no write events are specified.
The feature can be disabled by or-ing
-.Nm NETMAP_NO_TX_SYNC
-to nr_ringid.
-But normally you should keep this feature unless you are using
-separate file descriptors for the send and receive rings, because
-otherwise packets are pushed out only if NETMAP_TXSYNC is called,
-or the send queue is full.
-.Pp
-.Pa NIOCREGIF
-can be used multiple times to change the association of a
-file descriptor to a ring pair, always within the same device.
+.Va NETMAP_NO_TX_SYNC
+to the value written to
+.Va nr_ringid.
+When this feature is used,
+packets are transmitted only on
+.Va ioctl(NIOCTXSYNC)
+or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
.Pp
When registering a virtual interface that is dynamically created to a
.Xr vale 4
@@ -467,6 +579,164 @@ number of slots available for transmissi
tells the hardware of consumed packets, and asks for newly available
packets.
.El
+.Sh SELECT AND POLL
+.Xr select 2
+and
+.Xr poll 2
+on a
+.Nm
+file descriptor process rings as indicated in
+.Sx TRANSMIT RINGS
+and
+.Sx RECEIVE RINGS
+when write (POLLOUT) and read (POLLIN) events are requested.
+.Pp
+Both block if no slots are available in the ring (
+.Va ring->cur == ring->tail )
+.Pp
+Packets in transmit rings are normally pushed out even without
+requesting write events. Passing the NETMAP_NO_TX_SYNC flag to
+.Em NIOCREGIF
+disables this feature.
+.Sh LIBRARIES
+The
+.Nm
+API is supposed to be used directly, both because of its simplicity and
+for efficient integration with applications.
+.Pp
+For conveniency, the
+.Va <net/netmap_user.h>
+header provides a few macros and functions to ease creating
+a file descriptor and doing I/O with a
+.Nm
+port. These are loosely modeled after the
+.Xr pcap 3
+API, to ease porting of libpcap-based applications to
+.Nm .
+To use these extra functions, programs should
+.Dl #define NETMAP_WITH_LIBS
+before
+.Dl #include <net/netmap_user.h>
+.Pp
+The following functions are available:
+.Bl -tag -width XXXXX
+.It Va struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags)
+similar to
+.Xr pcap_open ,
+binds a file descriptor to a port.
+.Bl -tag -width XX
+.It Va ifname
+is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
+.Nm VALE
+port.
+.It Va flags
+can be set to
+.Va NETMAP_SW_RING
+to bind to the host ring pair,
+or to NETMAP_HW_RING to bind to a specific ring.
+.Va ring_name
+with NETMAP_HW_RING,
+is interpreted as a string or an integer indicating the ring to use.
+.It Va ring_flags
+is copied directly into the ring flags, to specify additional parameters
+such as NR_TIMESTAMP or NR_FORWARD.
+.El
+.It Va int nm_close(struct nm_desc_t *d)
+closes the file descriptor, unmaps memory, frees resources.
+.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size)
+similar to pcap_inject(), pushes a packet to a ring, returns the size
+of the packet is successful, or 0 on error;
+.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg)
+similar to pcap_dispatch(), applies a callback to incoming packets
+.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr)
+similar to pcap_next(), fetches the next packet
+.Pp
+.El
+.Sh SUPPORTED DEVICES
+.Nm
+natively supports the following devices:
+.Pp
+On FreeBSD:
+.Xr em 4 ,
+.Xr igb 4 ,
+.Xr ixgbe 4 ,
+.Xr lem 4 ,
+.Xr re 4 .
+.Pp
+On Linux
+.Xr e1000 4 ,
+.Xr e1000e 4 ,
+.Xr igb 4 ,
+.Xr ixgbe 4 ,
+.Xr mlx4 4 ,
+.Xr forcedeth 4 ,
+.Xr r8169 4 .
+.Pp
+NICs without native support can still be used in
+.Nm
+mode through emulation. Performance is inferior to native netmap
+mode but still significantly higher than sockets, and approaching
+that of in-kernel solutions such as Linux's
+.Xr pktgen .
+.Pp
+Emulation is also available for devices with native netmap support,
+which can be used for testing or performance comparison.
+The sysctl variable
+.Va dev.netmap.admode
+globally controls how netmap mode is implemented.
+.Sh SYSCTL VARIABLES AND MODULE PARAMETERS
+Some aspect of the operation of
+.Nm
+are controlled through sysctl variables on FreeBSD
+.Em ( dev.netmap.* )
+and module parameters on Linux
+.Em ( /sys/module/netmap_lin/parameters/* ) :
*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
More information about the svn-src-projects
mailing list