[Bug 264836] arm/arm/busdma_machdep-v6.c: bounce page accounting leak (noticed with high traffic ftdi usb serial devices)

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 22 Jun 2022 21:53:03 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264836

            Bug ID: 264836
           Summary: arm/arm/busdma_machdep-v6.c: bounce page accounting
                    leak (noticed with high traffic ftdi usb serial
                    devices)
           Product: Base System
           Version: 13.1-STABLE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: arm
          Assignee: freebsd-arm@FreeBSD.org
          Reporter: jcfyecrayz@liamekaens.com

In bus_dmamap_unload(), the counters for free_bpages and reserved_bpages appear
to be vulnerable to unprotected read-modify-write operations that result in
accounting that looks like a page leak.

This was noticed on a 2GB quad core i.MX6 system that has more than one device
attached via FTDI based USB serial connection.  This system happens to be using
FTDI US4232H quad port chips, but the problem is more general.

There is a latency timer setting in FTDI chips that is used to set the interval
at which short packets of data are flushed from the USB endpoint by the FTDI
chip (which has some internal buffer memory).  The default latency is 16 ms. 
We had set the latency to 4 ms to get data more quickly.

We started noticing problems with slower USB responses and eventually the
network stack would be affected as well.  In the system in question, it fairly
reliably "locked up" (couldn't ssh any more, trouble spawning processes when
logged in on the serial port).

In the locked up state, the usb/usbus0.xplr thread of the usb system process
was hung and the system could not process usb messages (this i.MX6 system has
an ehci USB controller).

The typical stack dump for usbus0.xplr when things were hung is:

   13 100029 usb                 usbus0.xplr         sched_switch+0x9d4
mi_switch+0x184 sleepq_wait+0x2c _cv_wait+0x1bc usbd_do_request_flags+0x4bc
usbd_req_get_port_status+0x44 uhub_explore+0xc4 uhub_explore+0x8f8
uhub_explore+0x8f8 usb_bus_explore+0x150 usb_process+0x124 fork_exit+0xc0
swi_exit+0

Once we noticed that hw.busdma.zone0.free_pages was steadily decrementing -
eventually down to zero - we started investigating (dtrace was helpful here)
why there appeared to be a leak of bounce pages.

That's when we found what appears to be the vulnerability in
bus_dmamap_unload().  This code has been this way for more than a decade, but
it takes a lot of transactions for this to occur and was particularly hard to
find.  For a long time, we would work around the problem by detecting the
symptoms and just reboot this system to recover - hardly ideal.  It would take
weeks to months depending on USB traffic load.  Adjusting the FTDI latency
timer to 0 ms (force packet delivery on every high speed microframe) finally
made this happen more quickly.

Even if someone else has enough traffic to experience the same problem, 

I will submit a patch for review.  Early results seem promising (in particular
the free bounce page accounting now does not show what looks like a leak).

This was originally noticed quite a while ago on 11.x, but it has been
confirmed on 13.x as well.

As an indication of system load, with the 0 ms latency timer we see more than
100 bounce pages per second (based on hw.busdma.zone0.total_bounced), and the
load due to interrupts is about 15%.  The high rate of bounce page and
interrupt activity gives lots of good opportunity for preemption at just the
right time to trigger the accounting leak.

Work sponsored by: Microchip Technology, Inc.

-- 
You are receiving this mail because:
You are the assignee for the bug.