[Bug 264836] arm/arm/busdma_machdep-v6.c: bounce page accounting leak (noticed with high traffic ftdi usb serial devices)
Date: Wed, 22 Jun 2022 21:53:03 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264836 Bug ID: 264836 Summary: arm/arm/busdma_machdep-v6.c: bounce page accounting leak (noticed with high traffic ftdi usb serial devices) Product: Base System Version: 13.1-STABLE Hardware: Any OS: Any Status: New Severity: Affects Some People Priority: --- Component: arm Assignee: freebsd-arm@FreeBSD.org Reporter: jcfyecrayz@liamekaens.com In bus_dmamap_unload(), the counters for free_bpages and reserved_bpages appear to be vulnerable to unprotected read-modify-write operations that result in accounting that looks like a page leak. This was noticed on a 2GB quad core i.MX6 system that has more than one device attached via FTDI based USB serial connection. This system happens to be using FTDI US4232H quad port chips, but the problem is more general. There is a latency timer setting in FTDI chips that is used to set the interval at which short packets of data are flushed from the USB endpoint by the FTDI chip (which has some internal buffer memory). The default latency is 16 ms. We had set the latency to 4 ms to get data more quickly. We started noticing problems with slower USB responses and eventually the network stack would be affected as well. In the system in question, it fairly reliably "locked up" (couldn't ssh any more, trouble spawning processes when logged in on the serial port). In the locked up state, the usb/usbus0.xplr thread of the usb system process was hung and the system could not process usb messages (this i.MX6 system has an ehci USB controller). The typical stack dump for usbus0.xplr when things were hung is: 13 100029 usb usbus0.xplr sched_switch+0x9d4 mi_switch+0x184 sleepq_wait+0x2c _cv_wait+0x1bc usbd_do_request_flags+0x4bc usbd_req_get_port_status+0x44 uhub_explore+0xc4 uhub_explore+0x8f8 uhub_explore+0x8f8 usb_bus_explore+0x150 usb_process+0x124 fork_exit+0xc0 swi_exit+0 Once we noticed that hw.busdma.zone0.free_pages was steadily decrementing - eventually down to zero - we started investigating (dtrace was helpful here) why there appeared to be a leak of bounce pages. That's when we found what appears to be the vulnerability in bus_dmamap_unload(). This code has been this way for more than a decade, but it takes a lot of transactions for this to occur and was particularly hard to find. For a long time, we would work around the problem by detecting the symptoms and just reboot this system to recover - hardly ideal. It would take weeks to months depending on USB traffic load. Adjusting the FTDI latency timer to 0 ms (force packet delivery on every high speed microframe) finally made this happen more quickly. Even if someone else has enough traffic to experience the same problem, I will submit a patch for review. Early results seem promising (in particular the free bounce page accounting now does not show what looks like a leak). This was originally noticed quite a while ago on 11.x, but it has been confirmed on 13.x as well. As an indication of system load, with the 0 ms latency timer we see more than 100 bounce pages per second (based on hw.busdma.zone0.total_bounced), and the load due to interrupts is about 15%. The high rate of bounce page and interrupt activity gives lots of good opportunity for preemption at just the right time to trigger the accounting leak. Work sponsored by: Microchip Technology, Inc. -- You are receiving this mail because: You are the assignee for the bug.