Partial cacheline flush problems on ARM and MIPS
Ian Lepore
freebsd at damnhippie.dyndns.org
Thu Aug 23 23:10:14 UTC 2012
On Thu, 2012-08-23 at 15:50 -0600, Warner Losh wrote:
> On Aug 23, 2012, at 3:28 PM, Ian Lepore wrote:
> > A recent innocuous change to the USB driver code caused intermittant
> > errors in the umass(4) driver on ARM and MIPS platforms, and this
>
> I think the proper solution is to segregate DMA and non-DMA parts of structures so that you don't have both sharing a cache line.
>
> I also wonder why we don't allocate the DMA memory for these structures separately from the non-DMA parts. This would eliminate the USB_CACHE_BYTES kludge (which is CPU dependent, not arch dependent) and move the knowledge of this junk into busdma layer where it belongs. From my understanding of the issue, this would completely eliminate the problem forever!
>
> Sharing a cacheline between something that is DMA aware and something that is just begging for trouble. We're doing more work than we need to to support this dubious feature and we'd be miles ahead if we could not share at all.
>
> Warner
>
It seems to me that what we have here is a new type of constraint on DMA
operations, and we need a way to communicate that constraint from the
part of the platform support code that knows about it to the drivers and
driver support code that needs to know. The way we communicate DMA
constraints is with a busdma tag, but right now that tag only
communicates constraints that were needed for ISA and PCI busses, namely
buffer alignment, boundary-crossing restrictions, and exclusion
regions.
Now we have a new type of constraint, I think of it as "granularity".
In effect, we have a DMA system that can only do DMA in cacheline sized
chunks. Even when the IO size -- and thus the number of "bits on the
wire" -- is less than the cacheline size, at the end of the DMA
operation (which includes the software-assisted coherency operations)
the number of bytes in memory that may be modified is the size of a
cacheline. This is because "the DMA system" is not just the engine that
moves bytes around, it's the combination of hardware and software that
work together to maintain cache coherency.
Ideally we'd find a way to communicate this new constraint using the
existing mechanism, the busdma tag, and ideally we'd not have to change
every existing call to bus_dma_tag_create() to add a new parm. As I
understand it, parent tags are now passed down through the newbus
hierarchy consistantly, such that a tag at the nexus level could
describe a platform requirement such as granularity, and all devices and
the helper code they use will have access to that constraint via
inheritance from ancestors' tags. Maybe we could have a new flavor of
bus_dma_tag_create() that takes a struct of parms, and existing calls
wouldn't have to be changed.
Communicating the constraint is only part of the problem; it also has to
be easy for drivers to work with that constraint, especially drivers
that are not targeted specifically at platforms with granular DMA. I
think we can achieve a huge chunk of that purely within the arm/mips
implementation of bus_dmamem_alloc(), but even so there would be a new
conceptual limitation on using that routine: it is specifically for
allocating DMA buffers, and that means that there will be a new a rule
that the CPU cannot access any memory within that buffer while an IO
operation is in progress.
I'd also like to say there's a new rule that you cannot subdivide a
buffer obtained from bus_dmamem_alloc() into multiple buffers, or into a
combination of DMA and CPU-accessed data. That would be bad news for
the USB subsystem, and perhaps other drivers. If this idea is either
impossible or particularly contentious, then I guess we'd need some new
helper routines so that a driver can subdivide the memory in a way that
doesn't violate any constraints implied by the tag used to allocate the
buffer.
Not all IO occurs using buffers obtained from bus_dmamem_alloc(), and I
doubt we can reasonably ever require that it be so. I think the only
hope we have of handling that problem is to bounce the requests that
don't meet the granularity constraint, just as we'd have to do if the
caller-supplied buffer fell into an exclusion zone or violated an
alignment or boundary constraint. When I've tossed this idea out in the
past there was instant resistance. Yeah, bounce buffers are massively
inefficient, but my experience has been that most of the IO that isn't
aligned and sized to a multiple of a cacheline is small IO (a few to a
few dozen bytes). I've never seen a case of page-sized or larger IO
requests that required partial-cacheline handling. I'm sure some
examples exist, but they're probably more the exception than the rule.
(And the bad performance you'd get from bouncing and copying massive
bulk data flow would be a powerful incentive to track down the culprit
and improve the driver.)
-- Ian
More information about the freebsd-arm
mailing list