stop_cpus_hard when multiple CPUs are panicking from an NMI

Attilio Rao attilio at freebsd.org
Fri Nov 16 00:16:51 UTC 2012


On Thu, Nov 15, 2012 at 11:47 PM, Ryan Stone <rysto32 at gmail.com> wrote:
> On Thu, Nov 15, 2012 at 6:41 PM, Attilio Rao <attilio at freebsd.org> wrote:
>>
>> On Thu, Nov 15, 2012 at 10:58 PM, Ryan Stone <rysto32 at gmail.com> wrote:
>> > At work we have some custom watchdog hardware that sends an NMI upon
>> > expiry.  We've modified the kernel to panic when it receives the
>> > watchdog
>> > NMI.  I've been trying the "stop_scheduler_on_panic" mode, and I've
>> > discovered that when my watchdog expires, the system gets completely
>> > wedged.  After some digging, I've discovered is that I have multiple
>> > CPUs
>> > getting the watchdog NMI and trying to panic concurrently.  One of the
>> > CPUs
>> > wins, and the rest spin forever in this code:
>>
>> Quick question: can you control the way your watchdog sends the NMI?
>> Like only to BSP rather than broadcast, etc.
>> This is tied to the very unique situation that you cannot really
>> deliver the (second) NMI.
>>
>> Attilio
>>
>>
>> --
>> Peace can only be achieved by understanding - A. Einstein
>
>
> I don't believe that I can, but I can check.  In any case I can imagine
> other places where this could be an issue.  hwpmc works with NMIs, right?
> So an hwpmc bug could trigger the same kind of issues if two CPUs that
> concurrently called pmc_intr both tripped over the sane bug.

Frankly, I think that what you were trying to do is someway the right
approach, modulo a clean interface.

I don't understand why the "spinlock" does wants to spin forever as it
can never recover. Stopping the cpus that gets into the "spinlock" is
perfectly fine.
There are only 2 things to consider:
1) I think we need a new KPI for that, a function in
$arch/include/cpu.h that does take care to stop a CPU in MI way, so
for example cpu_self_stop(). This needs to be implemented for all the
architectures but it can be done easily because it will be what
cpustop_handler() and similar functions do, basically.
2) The "fake spinlock" path will call such functions. The only thing
to debeate IMHO is if we want to do that conditional to
stop_scheduler_on_panic or not. If I have to be honest, stopping the
CPU seems the best approach in any case to me, but I'm open to hear
what you think.

Comments?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein


More information about the freebsd-hackers mailing list