Time sharing for interrupt threads
Date: Wed, 04 May 2022 21:48:39 UTC
My recent changes to the softclock threads (raising their priority) were to address a livelock issue in which a constant stream of packets could starve timeout events. However, in the discussion on the review (https://reviews.freebsd.org/D33693), several other livelock cases were raised. As a result, I've pursued a possible more general solution to livelock among interrupt threads. Previously I have toyed with the notion of extending our existing interrupt API to support cooperative time-sharing among interrupt handlers (where handlers would have to explicitly reschedule themselves before returning). This would be similar to what some drivers have tried to do in the past with scheduling tasks on taskqueues to defer work. However, this approach was a bit clunky (and the implementation was also a bit clunky) as it requires driver changes. It also requires drivers to try to understand how much work is "too much work" and knowing when to yield. Using tasks on taskqueues had similar problems. One of the other issues with the taskqueue approach is that the tasks would try to also yield by rescheduling the task, but this didn't actually work since the taskqueue thread handler would just see the requeued task and run it again right away, so the task always ended up running to completion. iflib attempts to ameliorate the requeue task problem by forcing all iflib devices to use a single pool of per-CPU threads so that the thread can at least round-robin among the tasks for multiple devices. However, this doesn't succeed in time-sharing with other interrupt handlers either for non-iflib NIC drivers or for non-network devices. It also still has the problem of drivers having to guess at when to yield. Instead, an approach I now think might be most workable is to do forced time-sharing via preemption from clock interrupts similar to what is done for userspace time-sharing threads. However, this would be a separate tier of time-sharing that only forces time-sharing among ithreads. I have implemented an initial prototype of this in ULE and Drew has tested it in some specific livelock use cases. The general idea of this approach is that sched_clock() (which runs at stathz) notices that an ithread has used a full time slice without going idle, the ithread's priority is demoted by 4 (to force it into a separate runq) and if there is a runnable thread waiting at the ithread's original priority, a preemption is scheduled when the clock interrupt finishes. The ithread's "base" priority is restored if the ithread goes idle (the priority isn't actually restored until the idle ithread is resumed via sched_wakeup, but it gives the desired behavior of undoing "demotion" once the ithread stops running and goes idle). One other constraint is that ithreads will only demote to PRI_MAX_ITHD but no lower, so ithreads will still always be preferred to non-ithreads. The changes to implement this can be found on the swi_refactor branch here: https://github.com/freebsd/freebsd-src/compare/main...bsdjhb:swi_refactor One thing that Drew found though when testing this initially is that while non-iflib NIC drivers (which run handlers at PI_NET) could preempt each other, an iflib driver (whcih runs its gtaskqueue threads at a hardcoded PI_SOFT) could not, or at least not as well. My suspicion is that it just takes several full slices for a PI_NET ithread to demote down to PI_SOFT where the gtaskqueue iflib thread would actually get a chance to run, and that even with a steady stream of traffic the non-iflib driver occasionally catches up and briefly goes idle which resets the priority back to PI_NET. Drew hacked iflib to use PI_NET as a quick hack and once this was done the iflib driver was able to preempt the non-iflib driver smoothly. From that experience, I suspect that if we do want to pursue this approach, we may need to collapse our notion of ithread priorities a bit. That is, we may want priorities to look something like this: PI_REALTIME (PI_MIN_ITHD + 0) /* Reserved for "unusual" hardware like maybe clocks. */ PI_HARDWARE (PI_MIN_ITHD + 4) /* All "normal" hardware like NET/DISK/etc. */ PI_SOFT (PI_MIN_ITHD + 8) /* Software handlers like netisr and others. */ This way, once a hardware handler uses up a time slice, it ends up yielding to all other hardware handlers and if it uses two slices it can't even starve swis. It's not really clear to me that we need hard lines between different classes of interrupts nowadays especially if we force time-sharing between long-running handlers. It's debatable that NVMe devices (PI_DISK) should always be preempted by NICs (PI_NET) for example. In the review Warner had proposed a separate PI_FLASH perhaps for flash storage that might be above PI_NET, but I think it might be simplest to let a system auto-adjust to interrupt load via a time-sharing system instead. In terms of softclock, I'm not quite sure if it should still run at PI_HARDWARE or PI_SOFT in the above model. I think it probably still wants to run at PI_HARDWARE, but it would probably still work even at PI_SOFT in at least some livelock workloads. If we do decide to move forward with some sort of time-sharing, I do think we can possibly simplify iflib drivers by removing the global taskqueue thread pool and reverting back to just using dedicated ithreads. This would avoid the need for iflib drivers needing to figure out when to yield but instead let the handlers run to completion. One further historical note: the reason we have different priorities for ithreads at all is mostly just an artifact of moving away from spl. In the old interrupt masking implementation in FreeBSD 4.x and before, pending interrupts (both hardware and software) were tracked in a bitmask (in fact, there were only 24 bits for hardware interrupts, so we enforced artificial interrupt sharing by tying multiple hardware interrupts to the same bit) and pending interrupts were handled via an ffs() loop on the bitmask during splx(). When we added ithreads we kept priorities that reflected the relative order of the bits set aside in the pending bitmask. Certainly that is where the SWI_* values come from, and the relative ordering of PI_* interrupts is mostly I think a SWAG based on the the fact that at the time network devices needed lower latency than disk (which was all spinning rust). At least some versions of Solaris (the one my Solaris Internals book describes) don't assign differing priorities to ithreads at all (all ithreads ran at the same priority and I think the version of Solaris in question had per-cpu ithreads that handlers were queued to when an interrupt asserted). I'm less familiar with what other systems do in terms of the relative priority of device interrupts. Given all that, my question is if other folks think that time-sharing of interrupt threads is an idea worth pursuing (vs having ithreads always run to completion)? If so, do folks have feedback on the design or the question of collapsing ithread priorities? -- John Baldwin