Latency issues with buf_ring
Barney Cordoba
barney_cordoba at yahoo.com
Wed Dec 5 12:32:59 UTC 2012
--- On Tue, 12/4/12, Bruce Evans <brde at optusnet.com.au> wrote:
> From: Bruce Evans <brde at optusnet.com.au>
> Subject: Re: Latency issues with buf_ring
> To: "Andre Oppermann" <oppermann at networx.ch>
> Cc: "Adrian Chadd" <adrian at FreeBSD.org>, "Barney Cordoba" <barney_cordoba at yahoo.com>, "John Baldwin" <jhb at FreeBSD.org>, freebsd-net at FreeBSD.org
> Date: Tuesday, December 4, 2012, 10:31 PM
> On Tue, 4 Dec 2012, Andre Oppermann
> wrote:
>
> > For most if not all ethernet drivers from 100Mbit/s the
> TX DMA rings
> > are so large that buffering at the IFQ level doesn't
> make sense anymore
> > and only adds latency.
>
> I found sort of the opposite for bge at 1Gbps. Most or
> all bge NICs
> have a tx ring size of 512. The ifq length is the tx
> ring size minus
> 1 (511). I needed to expand this to imax(2 * tick / 4,
> 10000) to
> maximize pps. This does bad things to latency and
> worse things to
> caching (512 buffers might fit in the L2 cache, but 10000
> buffers
> bust any reasonably cache as they are cycled through), but I
> only
> tried to optimize tx pps.
>
> > So it could simply directly put everything into
> > the TX DMA and not even try to soft-queue. If the
> TX DMA ring is full
> > ENOBUFS is returned instead of filling yet another
> queue.
>
> That could work, but upper layers currently don't understand
> ENOBUFS
> at all, so it would work poorly now. Also, 512 entries
> is not many,
> so even if upper layers understood ENOBUFS it is not easy
> for them to
> _always_ respond fast enough to keep the tx active, unless
> there are
> upstream buffers with many more than 512 entries.
> There needs to be
> enough buffering somewhere so that the tx ring can be
> replenished
> almost instantly from the buffer, to handle the worst-case
> latency
> for the threads generatng new (unbuffered) packets. At
> the line rate
> of ~1.5 Mpps for 1 Gbps, the maximum latency that can be
> covered by
> 512 entries is only 340 usec.
>
> > However there
> > are ALTQ interactions and other mechanisms which have
> to be considered
> > too making it a bit more involved.
>
> I didn't try to handle ALTQ or even optimize for TCP.
>
> More details: to maximize pps, the main detail is to ensure
> that the tx
> ring never becomes empty. The tx then transmits as
> fast as possible.
> This requires some watermark processing, but FreeBSD has
> almost none
> for tx rings. The following normally happens for
> packet generators
> like ttcp and netsend:
>
> - loop calling send() or sendto() until the tx ring (and
> also any
> upstream buffers) fill up. Then ENOBUFS is
> returned.
>
> - watermark processing is broken in the user API at this
> point. There
> is no way for the application to wait for the ENOBUFS
> condition to
> go away (select() and poll() don't work).
> Applications use poor
> workarounds:
>
> - old (~1989) ttcp sleeps for 18 msec when send() returns
> ENOBUFS. This
> was barely good enough for 1 Mbps ethernet (line rate
> ~1500 pps is 27
> per 18 msec, so IFQ_MAXLEN = 50 combined with just a
> 1-entry tx ring
> provides a safety factor of about 2). Expansion
> of the tx ring size to
> 512 makes this work with 10 Mbps ethernet too.
> Expansion of the ifq
> to 511 gives another factor of 2. After losing
> the safety factor of 2,
> we can now handle 40 Mbps ethernet, and are only a
> factor of 25 short
> for 1 Gbps. My hardware can't do line rate for
> small packets -- it
> can only do 640 kpps. Thus ttcp is only a
> factor of 11 short of
> supporting the hardware at 1 Gbps.
>
> This assumes that sleeps of 18 msec are actually
> possible, which
> they aren't with HZ = 100 giving a granularity of 10
> msec so that
> sleep(18 msec) actually sleeps for an average of 23
> msec. -current
> uses the bad default of HZ = 1000. With that
> sleep(18 msec) would
> average 18.5 msec. Of course, ttcp should sleep
> for more like 1
> msec if that is possible. Then the average
> sleep is 1.5 msec. ttcp
> can keep up with the hardware with that, and is only
> slightly behind
> the hardware with the worst-case sleep of 2 msec
> (512+511 packets
> generated every 2 msec is 511.5 kpps).
>
> I normally use old ttcp, except I modify it to sleep
> for 1 msec instead
> of 18 in one version, and in another version I remove
> the sleep so that
> it busy-waits in a loop that calls send() which
> almost always returns
> ENOBUFS. The latter wastes a lot of CPU, but is
> almost good enough
> for throughput testing.
>
> - newer ttcp tries to program the sleep time in
> microseconds. This doesn't
> really work, since the sleep granularity is normally
> at least a millisecond,
> and even if it could be the 340 microseconds needed
> by bge with no ifq
> (see above, and better divide the 340 by 2), then
> this is quite short
> and would take almost as much CPU as
> busy-waiting. I consider HZ = 1000
> to be another form of polling/busy-waiting and don't
> use it except for
> testing.
>
> - netrate/netsend also uses a programmed sleep time.
> This doesn't really
> work, as above. netsend also tries to limit its
> rate based on sleeping.
> This is further from working, since even
> finer-grained sleeps are needed
> to limit the rate accurately than to keep up with the
> maxium rate.
>
> Watermark processing at the kernel level is not quite as
> broken. It
> is mostly non-existend, but partly works, sort of
> accidentally. The
> difference is now that there is a tx "eof" or "completion"
> interrupt
> which indicates the condition corresponding to the ENOBUFS
> condition
> going away, so that the kernel doesn't have to poll for
> this. This
> is not really an "eof" interrupt (unless bge is programmed
> insanely,
> to interrupt only after the tx ring is completely
> empty). It acts as
> primitive watermarking. bge can be programmed to
> interrupt after
> having sent every N packets (strictly, after every N buffer
> descriptors,
> but for small packets these are the same). When there
> are more than
> N packets to start, say M, this acts as a watermark at M-N
> packets.
> bge is normally misprogrammed with N = 10. At the line
> rate of 1.5 Mpps,
> this asks for an interrupt rate of 150 kHz, which is far too
> high and
> is usually unreachable, so reaching the line rate is
> impossible due to
> the CPU load from the interrupts. I use N = 384 or 256
> so that the
> interrupt rate is not the dominant limit. However, N =
> 10 is better
> for latency and works under light loads. It also
> reduces the amount
> of buffering needed.
>
> The ifq works more as part of accidentally watermarking than
> as a buffer.
> It is the same size as the tx right (actually 1 smaller for
> bogus reasons),
> so it is not really useful as a buffer. However, with
> no explicit
> watermarking, any separate buffer like the ifq provides a
> sort of
> watermark at the boundary between the buffers. The
> usefulness of this
> would most obvious if the tx "eof" interrupt were actually
> for eof
> (perhaps that is what it was originally). Then on the
> eof interrupt,
> there is no time at all to generate new packets, and the
> time when the
> tx is idle can be minimized by keeping pre-generated packets
> handy where
> the can be copied to the tx ring at tx "eof" interrupt
> time. A buffer
> of about the same size as the tx ring (or maybe 1/4) the
> size, is enough
> for this.
>
> OTOH, with bge misprogrammed to interrupt after every 10 tx
> packets, the
> ifq is useless for its watermark purposes. The
> watermark is effectively
> in the tx ring, and very strangely placed there at 10 below
> the top
> (ring full). Normally tx watermarks are placed near
> the bottom (ring
> empty). They must not be placed too near the bottom,
> else there would
> not be enough time to replenish the ring between the time
> when the "eof"
> (really, the "watermark") interrupt is received and when the
> tx runs
> dry. They should not be placed too near the top like
> they are in -current's
> bge, else the point of having a large tx ring is defeated
> and there are
> too many interrupts. However, when they are placed
> near the top, latencency
> requirements are reduced.
>
> I recently worked on buffering for sio and noticed similar
> related
> problems for tx watermarks. Don't laugh -- serial i/o
> 1 character at
> a time at 3.686400 Mbps has much the same timing
> requirements as
> ethernet i/o 1 packet at a time at 1 Gbps. Each serial
> character
> takes ~2.7 usec and each minimal ethernet packet takes ~0.67
> usec.
> With tx "ring" sizes of 128 and 512 respectively, the ring
> times for
> full to empty are 347 usec for serial i/o and 341 usec for
> ethernet i/o.
> Strangely, tx is harder than rx because:
> - perfection is possible and easier to measure for tx.
> It consists of
> just keeping at least 1 entry in the tx ring at all
> times. Latency
> must be kept below ~340 usec to have any chance of
> this. This is not
> so easy to achieve under _all_ loads.
> - for rx, you have an external source generating the
> packets, so you
> don't have to worry about latency affecting the
> generators.
> - the need for watermark processing is better known for rx,
> since it
> obviously doesn't work to generate the rx "eof"
> interrupt near the
> top.
> The serial timing was actually harder to satisfy, because I
> worked on
> it on a 366 MHz CPU while I worked on bge on a 2 GHz CPU,
> and even the
> 2GHz CPU couldn't keep up with line rate (so from full to
> empty takes
> 800 usec).
>
> It turned out that the best position for the tx low
> watermark is about
> 1/4 or 1/2 from the bottom for both sio and bge. It
> must be fairly
> high, else the latency requirements are not met. In
> the middle is a
> good general position. Although it apparently "wastes"
> half of the ring
> to make the latency requirements easier to meet (without
> very
> system-dependent tuning), the efficiency lost from this is
> reasonably
> small.
>
> Bruce
>
I'm sure that Bill Paul is a nice man, but referencing drivers that were
written from a template and never properly load tested doesn't really
illustrate anything. All of his drivers are functional but optimized for
nothing.
BC
More information about the freebsd-net
mailing list