Latency issues with buf_ring

Wed Dec 5 12:32:59 UTC 2012

--- On Tue, 12/4/12, Bruce Evans <brde at optusnet.com.au> wrote:

> From: Bruce Evans <brde at optusnet.com.au>
> Subject: Re: Latency issues with buf_ring
> To: "Andre Oppermann" <oppermann at networx.ch>
> Cc: "Adrian Chadd" <adrian at FreeBSD.org>, "Barney Cordoba" <barney_cordoba at yahoo.com>, "John Baldwin" <jhb at FreeBSD.org>, freebsd-net at FreeBSD.org
> Date: Tuesday, December 4, 2012, 10:31 PM
> On Tue, 4 Dec 2012, Andre Oppermann
> wrote:
> 
> > For most if not all ethernet drivers from 100Mbit/s the
> TX DMA rings
> > are so large that buffering at the IFQ level doesn't
> make sense anymore
> > and only adds latency.
> 
> I found sort of the opposite for bge at 1Gbps.  Most or
> all bge NICs
> have a tx ring size of 512.  The ifq length is the tx
> ring size minus
> 1 (511).  I needed to expand this to imax(2 * tick / 4,
> 10000) to
> maximize pps.  This does bad things to latency and
> worse things to
> caching (512 buffers might fit in the L2 cache, but 10000
> buffers
> bust any reasonably cache as they are cycled through), but I
> only
> tried to optimize tx pps.
> 
> > So it could simply directly put everything into
> > the TX DMA and not even try to soft-queue.  If the
> TX DMA ring is full
> > ENOBUFS is returned instead of filling yet another
> queue.
> 
> That could work, but upper layers currently don't understand
> ENOBUFS
> at all, so it would work poorly now.  Also, 512 entries
> is not many,
> so even if upper layers understood ENOBUFS it is not easy
> for them to
> _always_ respond fast enough to keep the tx active, unless
> there are
> upstream buffers with many more than 512 entries. 
> There needs to be
> enough buffering somewhere so that the tx ring can be
> replenished
> almost instantly from the buffer, to handle the worst-case
> latency
> for the threads generatng new (unbuffered) packets.  At
> the line rate
> of ~1.5 Mpps for 1 Gbps, the maximum latency that can be
> covered by
> 512 entries is only 340 usec.
> 
> > However there
> > are ALTQ interactions and other mechanisms which have
> to be considered
> > too making it a bit more involved.
> 
> I didn't try to handle ALTQ or even optimize for TCP.
> 
> More details: to maximize pps, the main detail is to ensure
> that the tx
> ring never becomes empty.  The tx then transmits as
> fast as possible.
> This requires some watermark processing, but FreeBSD has
> almost none
> for tx rings.  The following normally happens for
> packet generators
> like ttcp and netsend:
> 
> - loop calling send() or sendto() until the tx ring (and
> also any
>   upstream buffers) fill up.  Then ENOBUFS is
> returned.
> 
> - watermark processing is broken in the user API at this
> point.  There
>   is no way for the application to wait for the ENOBUFS
> condition to
>   go away (select() and poll() don't work). 
> Applications use poor
>   workarounds:
> 
> - old (~1989) ttcp sleeps for 18 msec when send() returns
> ENOBUFS.  This
>   was barely good enough for 1 Mbps ethernet (line rate
> ~1500 pps is 27
>   per 18 msec, so IFQ_MAXLEN = 50 combined with just a
> 1-entry tx ring
>   provides a safety factor of about 2).  Expansion
> of the tx ring size to
>   512 makes this work with 10 Mbps ethernet too. 
> Expansion of the ifq
>   to 511 gives another factor of 2.  After losing
> the safety factor of 2,
>   we can now handle 40 Mbps ethernet, and are only a
> factor of 25 short
>   for 1 Gbps.  My hardware can't do line rate for
> small packets -- it
>   can only do 640 kpps.  Thus ttcp is only a
> factor of 11 short of
>   supporting the hardware at 1 Gbps.
> 
>   This assumes that sleeps of 18 msec are actually
> possible, which
>   they aren't with HZ = 100 giving a granularity of 10
> msec so that
>   sleep(18 msec) actually sleeps for an average of 23
> msec.  -current
>   uses the bad default of HZ = 1000.  With that
> sleep(18 msec) would
>   average 18.5 msec.  Of course, ttcp should sleep
> for more like 1
>   msec if that is possible.  Then the average
> sleep is 1.5 msec.  ttcp
>   can keep up with the hardware with that, and is only
> slightly behind
>   the hardware with the worst-case sleep of 2 msec
> (512+511 packets
>   generated every 2 msec is 511.5 kpps).
> 
>   I normally use old ttcp, except I modify it to sleep
> for 1 msec instead
>   of 18 in one version, and in another version I remove
> the sleep so that
>   it busy-waits in a loop that calls send() which
> almost always returns
>   ENOBUFS.  The latter wastes a lot of CPU, but is
> almost good enough
>   for throughput testing.
> 
> - newer ttcp tries to program the sleep time in
> microseconds.  This doesn't
>   really work, since the sleep granularity is normally
> at least a millisecond,
>   and even if it could be the 340 microseconds needed
> by bge with no ifq
>   (see above, and better divide the 340 by 2), then
> this is quite short
>   and would take almost as much CPU as
> busy-waiting.  I consider HZ = 1000
>   to be another form of polling/busy-waiting and don't
> use it except for
>   testing.
> 
> - netrate/netsend also uses a programmed sleep time. 
> This doesn't really
>   work, as above.  netsend also tries to limit its
> rate based on sleeping.
>   This is further from working, since even
> finer-grained sleeps are needed
>   to limit the rate accurately than to keep up with the
> maxium rate.
> 
> Watermark processing at the kernel level is not quite as
> broken.  It
> is mostly non-existend, but partly works, sort of
> accidentally.  The
> difference is now that there is a tx "eof" or "completion"
> interrupt
> which indicates the condition corresponding to the ENOBUFS
> condition
> going away, so that the kernel doesn't have to poll for
> this.  This
> is not really an "eof" interrupt (unless bge is programmed
> insanely,
> to interrupt only after the tx ring is completely
> empty).  It acts as
> primitive watermarking.  bge can be programmed to
> interrupt after
> having sent every N packets (strictly, after every N buffer
> descriptors,
> but for small packets these are the same).  When there
> are more than
> N packets to start, say M, this acts as a watermark at M-N
> packets.
> bge is normally misprogrammed with N = 10.  At the line
> rate of 1.5 Mpps,
> this asks for an interrupt rate of 150 kHz, which is far too
> high and
> is usually unreachable, so reaching the line rate is
> impossible due to
> the CPU load from the interrupts.  I use N = 384 or 256
> so that the
> interrupt rate is not the dominant limit.  However, N =
> 10 is better
> for latency and works under light loads.  It also
> reduces the amount
> of buffering needed.
> 
> The ifq works more as part of accidentally watermarking than
> as a buffer.
> It is the same size as the tx right (actually 1 smaller for
> bogus reasons),
> so it is not really useful as a buffer.  However, with
> no explicit
> watermarking, any separate buffer like the ifq provides a
> sort of
> watermark at the boundary between the buffers.  The
> usefulness of this
> would most obvious if the tx "eof" interrupt were actually
> for eof
> (perhaps that is what it was originally).  Then on the
> eof interrupt,
> there is no time at all to generate new packets, and the
> time when the
> tx is idle can be minimized by keeping pre-generated packets
> handy where
> the can be copied to the tx ring at tx "eof" interrupt
> time.  A buffer
> of about the same size as the tx ring (or maybe 1/4) the
> size, is enough
> for this.
> 
> OTOH, with bge misprogrammed to interrupt after every 10 tx
> packets, the
> ifq is useless for its watermark purposes.  The
> watermark is effectively
> in the tx ring, and very strangely placed there at 10 below
> the top
> (ring full).  Normally tx watermarks are placed near
> the bottom (ring
> empty).  They must not be placed too near the bottom,
> else there would
> not be enough time to replenish the ring between the time
> when the "eof"
> (really, the "watermark") interrupt is received and when the
> tx runs
> dry.  They should not be placed too near the top like
> they are in -current's
> bge, else the point of having a large tx ring is defeated
> and there are
> too many interrupts.  However, when they are placed
> near the top, latencency
> requirements are reduced.
> 
> I recently worked on buffering for sio and noticed similar
> related
> problems for tx watermarks.  Don't laugh -- serial i/o
> 1 character at
> a time at 3.686400 Mbps has much the same timing
> requirements as
> ethernet i/o 1 packet at a time at 1 Gbps.  Each serial
> character
> takes ~2.7 usec and each minimal ethernet packet takes ~0.67
> usec.
> With tx "ring" sizes of 128 and 512 respectively, the ring
> times for
> full to empty are 347 usec for serial i/o and 341 usec for
> ethernet i/o.
> Strangely, tx is harder than rx because:
> - perfection is possible and easier to measure for tx. 
> It consists of
>   just keeping at least 1 entry in the tx ring at all
> times.  Latency
>   must be kept below ~340 usec to have any chance of
> this.  This is not
>   so easy to achieve under _all_ loads.
> - for rx, you have an external source generating the
> packets, so you
>   don't have to worry about latency affecting the
> generators.
> - the need for watermark processing is better known for rx,
> since it
>   obviously doesn't work to generate the rx "eof"
> interrupt near the
>   top.
> The serial timing was actually harder to satisfy, because I
> worked on
> it on a 366 MHz CPU while I worked on bge on a 2 GHz CPU,
> and even the
> 2GHz CPU couldn't keep up with line rate (so from full to
> empty takes
> 800 usec).
> 
> It turned out that the best position for the tx low
> watermark is about
> 1/4 or 1/2 from the bottom for both sio and bge.  It
> must be fairly
> high, else the latency requirements are not met.  In
> the middle is a
> good general position.  Although it apparently "wastes"
> half of the ring
> to make the latency requirements easier to meet (without
> very
> system-dependent tuning), the efficiency lost from this is
> reasonably
> small.
> 
> Bruce
> 

I'm sure that Bill Paul is a nice man, but referencing drivers that were
written from a template and never properly load tested doesn't really
illustrate anything. All of his drivers are functional but optimized for
nothing.

BC