RFC: bus_get_cpus(9)
John Baldwin
jhb at freebsd.org
Thu Feb 19 15:37:31 UTC 2015
On Thursday, February 19, 2015 07:28:14 AM Adrian Chadd wrote:
> Hi,
>
> On 19 February 2015 at 06:46, John Baldwin <jhb at freebsd.org> wrote:
> > One of the next steps for NUMA device-awareness is a way to let drivers
> > know which CPUs are ideal to use for interrupts (and in particular this
> > is targeted at multiqueue NICs that want to create a TX/RX ring pair per
> > CPU). However, for modern Intel systems at least, it is usually best to
> > use CPUs from the physical processor package that contains the I/O hub
> > that a device connects to (e.g. to allow DDIO to work).
> >
> > The PoC API I came up with is a new bus method called bus_get_cpus() that
> > returns a requested cpuset for a given device. It accepts an enum for the
> > second parameter that says the type of cpuset being requested. Currently
> > two>
> > valus are supported:
> > - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to
> > the>
> > device when NUMA is enabled)
> >
> > - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)
> >
> > For a NIC driver the expectation is that the driver will call
> > 'bus_get_cpus(dev, INTR_CPUS, &set)' and create queues for each of the
> > CPUs in 'set'. (In my current patchset I have updated igb(4) to use this
> > approach.)
> >
> > For systems that do not support NUMA (or if it is not enabled in the
> > kernel
> > config), LOCAL_CPUS is mapped to 'all_cpus' by default in the 'root_bus'
> > driver. INTR_CPUS is also mapped to 'all_cpus' by default.
> >
> > The x86 interrupt code maintains its own set of interrupt CPUs which this
> > patch now exposes via INTR_CPUS in the x86 nexus driver.
> >
> > The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
> > LOCAL_CPUS set when _PXM exists and NUMA is enabled. They also and the
> > global INTR_CPUS set from the nexus driver with the per-domain set from
> > _PXM to generate a local INTR_CPUS set for child devices.
> >
> > The current patch can be found here:
> >
> > https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpu
> > s
> >
> > It includes a few other fixes besides the implementation of bus_get_cpu()
> > (and some things have already been committed such as
> >
> > taskqueue_start_threads_cpuset() and CPU_COUNT()):
> > - It fixes the x86 interrupt code to exclude modern SMT threads from the
> >
> > default interrupt set. (Previously only Pentium 4-era HTT threads were
> > excluded.)
> >
> > - It has a sample conversion of igb(4) to this interface (albeit ugly
> > using
> >
> > #if's).
> >
> > Longer term I think I would like to make the INTR_CPUS thing a bit more
> > formal. In particular, Solaris allows you to alter the set of CPUs that
> > handle interrupts via prctl (or a tool named something close to that). I
> > think I would like to have a dedicated global cpuset for that (but not
> > named "2", it would be a new WHICH level). That would allow userland to
> > use cpuset to alter the set of CPUs that handle interrupts in case you
> > wanted to use SMT for example. I think if we do this that all ithreads
> > would have their cpusets hang off of this set instead of the root set
> > (which would also remove some of the recent special case handling for
> > ithreads I believe). The one uglier part about this is that we should
> > probably then have a way to notify drivers that INTR_CPUS changed so that
> > they could try to cope gracefully. I think that's a bit of a longer
> > horizon thing, but for now I think bus_get_cpus() is a good next step.
> >
> > What do other folks think? (And yes, I know it needs a manpage before it
> > goes in, but I'd rather get the API agreed on before polishing that.)
>
> So I'd rather something slightly more descriptive about the iteration
> of cpu ids and cpusets for each queue in a driver.
>
> Eg, you're iterating over the interrupt set for a CPU, but it's still
> up to the driver to walk the cpuset array and figure out which queue
> goes to which CPU.
>
> For RSS I'm going to take this stuff and have the driver call into RSS
> to do this. It'll provide the deviceid, and it'll return:
>
> * how many queues?
> * for each queue id, what is the cpuset the interrupts/taskqueues should run
> on.
>
> It already does this, but it currently has no idea about the
> underlying device topology.
>
> (Later on for rebalancing we'll want to have those cpusets returned be
> some top level cpusets per RSS bucket that we can change and have
> everything cascade "right", but that's a later thing to worry about.)
>
> For the RSS and non-RSS case, I can see situations where the admin may
> wish to define a mapping for queues to cpusets that aren't necessarily
> 1:1 mapping like you've done. For example, saying "I want 16 queues,
> but I have four CPUs, here's how they're mapped." Right now drivers do
> round-robin, but it's hard coded. If we have an iteration API like
> what exists for RSS then we can hide the policy config in the bus code
> and have it check kenv/hints at boot time for what the config should
> be.
>
> So I'd like to have the API a little more higher level so we can do
> interesting things with it.
There's nothing preventing the RSS code from calling bus_get_cpus() internally
to populate the info it returns in its APIs.
That is, I imagine something like:
#ifdef RSS
queue_info = fetch_rss_info(dev);
for (queue in queue_info) {
create queue for CPU queue->cpu
}
#else
/* Use bus_get_cpus directly and do 1:1 */
#endif
That is, I think RSS should provide a layer on top of new-bus, not be a
bus_foo API. At some point all drivers might only have the #ifdef RSS case
and not use bus_get_cpus() directly at all, but it doesn't seem like the RSS
API is quite there yet.
--
John Baldwin
More information about the freebsd-arch
mailing list